rst2html5 Design Notes

The following documentation describes the knowledge collected during rst2html5 implementation. It might be helpful to other people who want to contribute to the project or create another rst converter.

Docutils

Docutils is a set of tools for processing plaintext documentation in restructuredText markup (rst) into other formats such as HTML, PDF and Latex. Its documents design issues and implementation details are described at http://docutils.sourceforge.net/docs/peps/pep-0258.html

In the early stages of the translation process, the rst document is analyzed and transformed into an intermediary format called doctree which is then passed to a translator to be transformed into the desired formatted output:

                       Translator
                 +-------------------+
                 |    +---------+    |
---> doctree -------->|  Writer |-------> output
                 |    +----+----+    |
                 |         |         |
                 |         |         |
                 |  +------+------+  |
                 |  | NodeVisitor |  |
                 |  +-------------+  |
                 +-------------------+

Doctree

The doctree is a hierarchical structure of the elements of a rst document. It is defined at docutils.nodes and is used internally by Docutils components.

The command rst2pseudoxml.py produces a textual representation of a doctree that is very useful to visualize the nesting of the elements of a rst document. This information was of great help for both rst2html5 design and tests.

Given the following rst snippet:

Title
=====

Text and more text

The textual representation produced by rst2pseudoxml.py is:

<document ids="title" names="title" source="snippet.rst" title="Title">
    <title>
        Title
    <paragraph>
        Text and more text

Translator, Writer and NodeVisitor

A translator is comprised of two parts: a Writer and a NodeVisitor. The Writer is responsible to prepare and coordinate the translation made by the NodeVisitor. The NodeVisitor is used for visiting each doctree node and it performs all actions needed to translate the node to the desired format according to its type and content.

Important

To develop a new docutils translator, you need to specialize these two classes.

Note

Those classes correspond to a variation of the Visitor pattern, called “Extrinsic Visitor” that is more commonly used in Python. See The “Visitor Pattern”, Revisited.

               +-------------+
               |             |
               |    Writer   |
               |  translate  |
               |             |
               +------+------+
                      |
                      |    +---------------------------+
                      |    |                           |
                      v    v                           |
                 +------------+                        |
                 |            |                        |
                 |    Node    |                        |
                 |  walkabout |                        |
                 |            |                        |
                 +--+---+---+-+                        |
                    |   |   |                          |
          +---------+   |   +----------+               |
          |             |              |               |
          v             |              v               |
 +----------------+     |    +--------------------+    |
 |                |     |    |                    |    |
 |  NodeVisitor   |     |    |    NodeVisitor     |    |
 | dispatch_visit |     |    | dispatch_departure |    |
 |                |     |    |                    |    |
 +--------+-------+     |    +---------+----------+    |
          |             |              |               |
          |             +--------------|---------------+
          |                            |
          v                            v
+-------------------+        +--------------------+
|                   |        |                    |
|   NodeVisitor     |        |   NodeVisitor      |
| visit_<NODE_TYPE> |        | depart_<NODE_TYPE> |
|                   |        |                    |
+-------------------+        +--------------------+

During the doctree traversal through docutils.nodes.Node.walkabout(), there are two NodeVisitor dispatch methods called: dispatch_visit() and dispatch_departure(). The former is called early in the node visitation. Then, all children nodes walkabout() are visited, and lastly, the latter dispatch method is called. Each dispatch method calls another method whose name follows the pattern visit_<NODE_TYPE> or depart_<NODE_TYPE> such as visit_paragraph or depart_title, that should be implemented by the NodeVisitor subclass object.

rst2html5

In rst2html5, Writer and NodeVisitor are specialized through HTML5Writer and HTML5Translator classes.

rst2html5.HTML5Translator is a NodeVisitor subclass that implements all visit_<NODE_TYPE> and depart_<NODE_TYPE> methods needed to translate a doctree to its HTML5 content. The rst2html5.HTML5Translator uses an object of the ElemStack helper class that controls a context stack to handle indentation and the nesting of the doctree traversal:

                   rst2html5
           +-----------------------+
           |    +-------------+    |
doctree ---|--->| HTML5Writer |----|-->  HTML5
           |    +------+------+    |
           |           |           |
           |           |           |
           |  +--------+--------+  |
           |  | HTML5Translator |  |
           |  +--------+--------+  |
           |           |           |
           |           |           |
           |     +-----+-----+     |
           |     | ElemStack |     |
           |     +-----------+     |
           +-----------------------+

The standard visit_<NODE_TYPE> action is called default_visit and it initiates a new element context:

    def default_visit(self, node: NodeElement) -> None:
        """
        Initiate a new context to store inner HTML5 elements.
        """
        if 'ids' in node and self.once_attr('expand_id_to_anchor', default=True):
            # create an anchor <a id=id></a> on top of the current element
            # for each id found.
            for id in node['ids'][1:]:
                self.context.begin_elem()
                self.context.commit_elem(tag.a(id=id))
            node['ids'] = node['ids'][0:1]
        self.context.begin_elem()
        return

The standard depart_<NODE_TYPE> action is default_departure and it creates the HTML5 element corresponding to the saved context:

    def default_departure(self, node: NodeElement) -> None:
        """
        Create the node's corresponding HTML5 element and combine it with its
        stored context.
        """
        tag_name, indent, attributes = self.parse(node)
        elem = getattr(tag, tag_name)(**attributes)
        self.context.commit_elem(elem, indent)
        return

Not all rst elements follow this procedure. The Text element, for example, is a leaf-node and thus doesn’t need a specific context. Other elements have a common processing and can share the same visit_ and/or depart_ method. To take advantage of theses similarities, the rst_terms dict maps a node type to its visit_ and depart_ methods:

    rst_terms = {
        # 'term': ('tag', 'visit_func', 'depart_func', use_term_in_class,
        #          indent_elem)
        # use_term_in_class and indent_elem are optionals.
        # If not given, the default is False, True
        'Text': (None, 'visit_Text', None),
        'abbreviation': ('abbr', dv, dp),
        'acronym': ('abbr', dv, dp),
        'address': (None, 'visit_address', None),
        'admonition': ('aside', 'visit_aside', 'depart_aside', True),
        'attention': ('aside', 'visit_aside', 'depart_aside', True),
        'attribution': ('p', dv, dp, True),
        'author': (None, 'visit_bibliographic_field', None),
        'authors': (None, 'visit_authors', None),
        'block_quote': ('blockquote', 'visit_blockquote', dp),
        'bullet_list': ('ul', dv, dp, False),
        'caption': ('figcaption', dv, dp, False),
        'caution': ('aside', 'visit_aside', 'depart_aside', True),
        'citation': (None, 'visit_citation', 'depart_citation', True),
        'citation_reference': (
            'a',
            'visit_citation_reference',
            'depart_reference',
            True,
            False,
        ),
        'classifier': (None, 'visit_classifier', None),
        'colspec': (None, pass_, 'depart_colspec'),
        'comment': (None, 'visit_comment', None),
        'compound': ('div', dv, dp),
        'contact': (None, 'visit_bibliographic_field', None),
        'container': ('div', dv, dp),
        'copyright': (None, 'visit_bibliographic_field', None),
        'danger': ('aside', 'visit_aside', 'depart_aside', True),
        'date': (None, 'visit_bibliographic_field', None),
        'decoration': (None, 'do_nothing', None),
        'definition': ('dd', dv, dp),
        'definition_list': ('dl', dv, dp),
        'definition_list_item': (None, 'do_nothing', None),
        'description': ('td', dv, dp),
        'docinfo': (None, 'do_nothing', None),
        'doctest_block': (
            'pre',
            'visit_literal_block',
            'depart_literal_block',
            True,
        ),
        'document': (None, 'visit_document', 'depart_document'),
        'emphasis': ('em', dv, dp, False, False),
        'entry': (None, dv, 'depart_entry'),
        'enumerated_list': ('ol', dv, 'depart_enumerated_list'),
        'error': ('aside', 'visit_aside', 'depart_aside', True),
        'field': (None, 'visit_field', None),
        'field_body': (None, 'do_nothing', None),
        'field_list': (None, 'do_nothing', None),
        'field_name': (None, 'do_nothing', None),
        'figure': (None, 'visit_figure', dp),
        'footer': (None, dv, dp),
        'footnote': (None, 'visit_citation', 'depart_citation', True),
        'footnote_reference': (
            'a',
            'visit_citation_reference',
            'depart_reference',
            True,
            False,
        ),
        'generated': (None, 'do_nothing', None),
        'header': (None, dv, dp),
        'hint': ('aside', 'visit_aside', 'depart_aside', True),
        'image': ('img', 'visit_image', 'depart_image'),
        'important': ('aside', 'visit_aside', 'depart_aside', True),
        'inline': ('span', dv, dp, False, False),
        'label': ('th', 'visit_reference', 'depart_label'),
        'legend': ('div', dv, dp, True),
        'line': (None, 'visit_line', None),
        'line_block': ('pre', 'visit_line_block', 'depart_line_block', True),
        'list_item': ('li', dv, dp),
        'literal': ('code', 'visit_literal', 'depart_literal', False, False),
        'literal_block': (
            'pre',
            'visit_literal_block',
            'depart_literal_block',
        ),
        'math': (None, 'visit_math_block', None),
        'math_block': (None, 'visit_math_block', None),
        'meta': (None, 'visit_meta', None),
        'note': ('aside', 'visit_aside', 'depart_aside', True),
        'option': ('kbd', 'visit_option', dp, False, False),
        'option_argument': ('var', 'visit_option_argument', dp, False, False),
        'option_group': ('td', 'visit_option_group', 'depart_option_group'),
        'option_list': (None, 'visit_option_list', 'depart_option_list', True),
        'option_list_item': ('tr', dv, dp),
        'option_string': (None, 'do_nothing', None),
        'organization': (None, 'visit_bibliographic_field', None),
        'paragraph': ('p', 'visit_paragraph', dp),
        'pending': (None, dv, dp),
        'problematic': (
            'a',
            'visit_problematic',
            'depart_reference',
            True,
            False,
        ),
        'raw': (None, 'visit_raw', None),
        'reference': (
            'a',
            'visit_reference',
            'depart_reference',
            False,
            False,
        ),
        'revision': (None, 'visit_bibliographic_field', None),
        'row': ('tr', 'visit_row', 'depart_row'),
        'rubric': ('p', dv, 'depart_rubric', True),
        'section': ('section', 'visit_section', 'depart_section'),
        'sidebar': ('aside', 'visit_aside', 'depart_aside', True),
        'status': (None, 'visit_bibliographic_field', None),
        'strong': (None, dv, dp, False, False),
        'subscript': ('sub', dv, dp, False, False),
        'substitution_definition': (None, 'skip_node', None),
        'substitution_reference': (None, 'skip_node', None),
        'subtitle': (None, 'visit_target', 'depart_subtitle'),
        'superscript': ('sup', dv, dp, False, False),
        'system_message': ('div', 'visit_system_message', dp),
        'table': (None, 'visit_table', 'depart_table'),
        'target': ('a', 'visit_target', 'depart_reference', False, False),
        'tbody': (None, dv, dp),
        'term': ('dt', dv, dp),
        'tgroup': (None, 'do_nothing', None),
        'thead': (None, 'visit_thead', 'depart_thead'),
        'tip': ('aside', 'visit_aside', 'depart_aside', True),
        'title': (None, dv, 'depart_title'),
        'title_reference': ('cite', dv, dp, False, False),
        'topic': ('aside', 'visit_aside', 'depart_aside', True),
        'transition': ('hr', dv, dp),
        'version': (None, 'visit_bibliographic_field', None),
        'warning': ('aside', 'visit_aside', 'depart_aside', True),
    }

where dv is default_visit and dp means default_departure.

HTML5 Tag Construction

HTML5 Tags are constructed by the genshi.builder.tag object.

ElemStack

For the previous doctree example, the sequence of visit_... and depart_... calls is this:

1. visit_document
    2. visit_title
        3. visit_Text
        4. depart_Text
    5. depart_title
    6. visit_paragraph
        7. visit_Text
        8. depart_Text
    9. depart_paragraph
10. depart_document

For this sequence, the behavior of a ElemStack context object is:

  1. Initial State. The context stack is empty:

    context = []
    
  2. visit_document. A new context for document is reserved:

    context = [ [] ]
                 \
                  document
                  context
    
  3. visit_title. A new context for title is pushed into the context stack:

                    title
                    context
                     /
    context = [ [], [] ]
                 \
                  document
                  context
    

3. visit_Text. A Text node doesn’t need a new context because it is a leaf-node. Its text is simply added to the context of its parent node:

                  title
                  context
                 /
context = [ [], ['Title'] ]
             \
              document
              context
  1. depart_Text. No action performed. The context stack remains the same.

  2. depart_title. This is the end of the title processing. The title context is popped from the context stack to form an h1 tag that is then inserted into the context of the title parent node (document context):

    context = [ [tag.h1('Title')] ]
                 \
                  document
                  context
    
  3. visit_paragraph. A new context is added:

                                     paragraph
                                     context
                                    /
    context = [ [tag.h1('Title')], [] ]
                 \
                  document
                  context
    
  4. visit_Text. Again, the text is inserted into its parent’s node context:

                                     paragraph
                                     context
                                    /
    context = [ [tag.h1('Title')], ['Text and more text'] ]
                 \
                  document
                  context
    
  5. depart_Text. No action performed.

  6. depart_paragraph. Follows the standard procedure where the current context is popped and form a new tag that is appended into the context of the parent node:

    context = [ [tag.h1('Title'), tag.p('Text and more text')] ]
                 \
                  document
                  context
    
  7. depart_document. The document node doesn’t have an HTML tag. Its context is simply combined to the outer context to form the body of the HTML5 document:

    context = [tag.h1('Title'), tag.p('Text and more text')]
    

rst2html5 Tests

The test cases are located at tests/cases.py and each test case is a dictionary whose main keys are:

rst:text snippet in rst format
out:expected output
part:specifies which part of rst2html5 output will be compared to out. Possible values are head, body or whole.

Other possible keys are rst2html5 configuration settings such as indent_output, script, script-defer, html-tag-attr or stylesheet.

When a test fails, three auxiliary files are created on the default temporary directory (/tmp):

  1. TEST_CASE_NAME.rst contains the rst snippet of the test case.;
  2. TEST_CASE_NAME.result contais the result produced by rst2html5 and
  3. TEST_CASE_NAME.expected contains the expected result.

Their differences can be easily visualized by a diff tool:

$ kdiff3 /tmp/TEST_CASE_NAME.result /tmp/TEST_CASE_NAME.expected

Workaround to Conflicts with Docutils

rst2html5 package installation should make it possible to use it via command line and also being imported in other projects using rst2html5. For example, to use it via command line:

$ rst2html5 example.rst example.html

And programmatically from another project:

from rst2html5 import HTML5Writer

...

The problem is that after 0.13.1, docutils installation creates two scripts called rst2html5 and rst2html5.py in <venv>/bin, where <venv> is the installation path of the virtual environment being used. Both do the same.

Since it is not possible to delete a script from another package, rst2html5 package installation overwrites both, but rst2html5.py still causes problems. When importing rst2html5 from one of those scripts, Python reaches <venv>/bin/rst2html5.py instead of <venv>/lib/<python_version>/site-packages/rst2html5 because the former comes first in sys.path during the execution, in a virtual environment.

A typical sys.path is:

[
    '/tmp/py39/bin',
    '/usr/lib/python39.zip',
    '/usr/lib/python3.9',
    '/usr/lib/python3.9/lib-dynload',
    '/tmp/py39/lib/python3.9/site-packages'
]

where /tmp/py39 is the path of the virtual environment, and python3.9 is the current Python version.

Note

The sys.path information from the command line is different from the one inside a running script. To get the real value, you must manually insert a breakpoint or print it from a installed script.

From 1.9.2 <= rst2html5 < 2.0, the immediate solution was to rename the module from rst2html5 to rs2html5_, so that importing would skip <venv>/bin/rst2html5.py and find the right module at <venv>/lib/<python_version>/site-packages.

Version 2.0 implements a more elegant solution for the problem that allows both the script and the module to be named rst2html5. The script <venv>/bin/rst2html5 still imports rst2html5_ but instead of reaching a module, the importing hits a file called rst2html5_.py that modifies sys.path and only then import the module rst2html5:

                <venv>/bin              <venv>/lib/<python_version>/site-packages


before 2.0:     rst2html5    ----->     rst2html5_/
                (import rst2html5_)


2.0 onwards:    rst2html5    ----->     rst2html5_.py     -------->   rst2html5/
                (import rst2html5_)     (modifies sys.path
                                         and then import rst2html5)

<venv>/bin/rst2html5 is generated automatically during the package installation, and contains something very similar to this:

#!/<venv>/bin/python
from rst2html5_ import main

if __name__ == '__main__':
    main()

The intermediary file rst2html5_.py is shown below:

import sys
from pathlib import Path

from docutils.core import default_description, publish_cmdline

# inserts <venv>/lib/<python_version>/site-packages before <venv>/bin in sys.path
# so that ``from rst2html5 ...`` reaches <venv>/lib/<python_version>/site-packages/rst2html5
# instead of docutils' <venv>/bin/rst2html5.py
sys.path.insert(0, str(Path(__file__).parent.absolute()))

from rst2html5 import HTML5Writer  # noqa E402


def main():
    description = 'Generates (X)HTML5 documents from standalone reStructuredText sources.' + default_description
    publish_cmdline(writer=HTML5Writer(), description=description)

The package installation is configured in the file pyproject.toml:

[tool.poetry]
...
packages = [
    {include = "rst2html5"}
]
include = ["rst2html5_.py"]

[tool.poetry.scripts]
rst2html5 = "rst2html5_:main"  # overwrites docutils' rst2html5
...

Attention

It is very likely that projects that use rst2html5 prior to 2.0 won’t need to change their imports because rst2html5_.HTML5Writer is still reachable through the new rst2html5_.py file. However, they’re advised to do so.