rst2html5 Design Notes

The following documentation describes the knowledge collected durint rst2html5 implementation. Probably, it isn’t complete or even exact, but it might be helpful to other people who want to create another rst converter.

Docutils

Docutils is a set of tools for processing plaintext documentation in restructuredText markup (rst) into other formats such as HTML, PDF and Latex. Its documents design issues and implementation details are described at http://docutils.sourceforge.net/docs/peps/pep-0258.html

In the early stages of the translation process, the rst document is analyzed and transformed into an intermediary format called doctree which is then passed to a translator to be transformed into the desired formatted output:

                       Translator
                 +-------------------+
                 |    +---------+    |
---> doctree -------->|  Writer |-------> output
                 |    +----+----+    |
                 |         |         |
                 |         |         |
                 |  +------+------+  |
                 |  | NodeVisitor |  |
                 |  +-------------+  |
                 +-------------------+

Doctree

The doctree is a hierarchical structure of the elements of a rst document. It is defined at docutils.nodes and is used internally by Docutils components.

The command rst2pseudoxml.py produces a textual representation of a doctree that is very useful to visualize the nesting of the elements of a rst document. This information was of great help to both rst2html5 design and tests.

Given the following rst snippet:

Title
=====

Text and more text

The textual representation produced by rst2pseudoxml is:

<document ids="title" names="title" source="snippet.rst" title="Title">
    <title>
        Title
    <paragraph>
        Text and more text

Translator, Writer e NodeVisitor

A translator is comprised of two parts: a Writer and a NodeVisitor. The Writer is responsible to prepare and to coordinate the translation made by the NodeVisitor. The NodeVisitor is used when visiting each doctree node and it performs all actions needed to translate the node to the desired format according to its type and content.

Important

To develop a new docutils translator, one needs to specialize these two classes.

Note

Those classes correspond to a variation of the Visitor pattern, called “Extrinsic Visitor” that is more commonly used in Python. See The “Visitor Pattern”, Revisited.

              +-------------+
              |             |
              |    Writer   |
              |  translate  |
              |             |
              +------+------+
                     |
                     |    +---------------------------+
                     |    |                           |
                     v    v                           |
                +------------+                        |
                |            |                        |
                |    Node    |                        |
                |  walkabout |                        |
                |            |                        |
                +--+---+---+-+                        |
                   |   |   |                          |
         +---------+   |   +----------+               |
         |             |              |               |
         v             |              v               |
+----------------+     |    +--------------------+    |
|                |     |    |                    |    |
|  NodeVisitor   |     |    |    NodeVisitor     |    |
| dispatch_visit |     |    | dispatch_departure |    |
|                |     |    |                    |    |
+--------+-------+     |    +---------+----------+    |
         |             |              |               |
         |             +--------------|---------------+
         |                            |
         v                            v
+-----------------+          +------------------+
|                 |          |                  |
|   NodeVisitor   |          |   NodeVisitor    |
| visit_NODE_TYPE |          | depart_NODE_TYPE |
|                 |          |                  |
+-----------------+          +------------------+

During the doctree traversal through docutils.nodes.Node.walkabout(), there are two NodeVisitor dispatch methods called: dispatch_visit() and dispatch_departure(). The former is called early in the node visitation. Then, all children nodes walkabout() are visited and lastly the latter dispatch method is called. Each dispatch method calls another method whose name follows the pattern visit_NODE_TYPE or depart_NODE_TYPE such as visit_paragraph or depart_title, that should be implemented by the NodeVisitor subclass object.

rst2html5

In rst2html5, Writer and NodeVisitor are specialized through HTML5Writer and HTML5Translator classes.

rst2html5.HTML5Translator is a NodeVisitor subclass that implements all visit_NODE_TYPE and depart_NODE_TYPE methods needed to translate a doctree to its HTML5 content. The rst2html5.HTML5Translator uses an object of the:class:~rst2html5.ElemStack helper class that controls a context stack to handle indentation and the nesting of the doctree traversal:

                   rst2html5
           +-----------------------+
           |    +-------------+    |
doctree ---|--->| HTML5Writer |----|-->  HTML5
           |    +------+------+    |
           |           |           |
           |           |           |
           |  +--------+--------+  |
           |  | HTML5Translator |  |
           |  +--------+--------+  |
           |           |           |
           |           |           |
           |     +-----+-----+     |
           |     | ElemStack |     |
           |     +-----------+     |
           +-----------------------+

The standard visit_NODE_TYPE action is initiate a new node context:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
    def default_visit(self, node):
        '''
        Initiate a new context to store inner HTML5 elements.
        '''
        if 'ids' in node and self.once_attr('expand_id_to_anchor', default=True):
            # create an anchor <a id=id></a> for each id found before the
            # current element.
            for id in node['ids'][1:]:
                self.context.begin_elem()
                self.context.commit_elem(tag.a(id=id))
            node.attributes['ids'] = node.attributes['ids'][0:1]
        self.context.begin_elem()
        return

The standard depart_NODE_TYPE action is to create the HTML5 element according to the saved context:

1
2
3
4
5
6
7
8
9
    def default_departure(self, node):
        '''
        Create the node's corresponding HTML5 element and combine it with its
        stored context.
        '''
        tag_name, indent, attributes = self.parse(node)
        elem = getattr(tag, tag_name)(**attributes)
        self.context.commit_elem(elem, indent)
        return

Not all rst elements follow this procedure. The Text element, for example, is a leaf-node and thus doesn’t need a specific context. Other elements have a common processing and can share the same visit_ and/or depart_ method. To take advantage of theses similarities, the rst_terms dict maps a node type to a visit_ and depart_ methods:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
    def append(self, element, indent=True):
        '''
        Append to current element
        '''
        self.stack[-1].append(self._indent_elem(element, indent))
        return

    def begin_elem(self):
        '''
        Start a new element context
        '''
        self.stack.append([])
        self.indent_level += 1
        return

    def commit_elem(self, elem, indent=True):
        '''
        A new element is create by removing its stack to make a tag.
        This tag is pushed back into its parent's stack.
        '''
        pop = self.stack.pop()
        elem(*pop)
        self.indent_level -= 1
        self.append(elem, indent)
        return

    def pop(self):
        return self.pop_elements(1)[0]

    def pop_elements(self, num_elements):
        assert num_elements > 0
        parent_stack = self.stack[-1]
        result = []
        for x in range(num_elements):
            pop = parent_stack.pop()
            elem = pop[0 if len(pop) == 1 else self.indent_output]
            result.append(elem)
        result.reverse()
        return result


dv = 'default_visit'
dp = 'default_departure'
pass_ = 'no_op'


class HTML5Translator(nodes.NodeVisitor):

    rst_terms = {
        # 'term': ('tag', 'visit_func', 'depart_func', use_term_in_class,
        #          indent_elem)
        # use_term_in_class and indent_elem are optionals.
        # If not given, the default is False, True
        'Text': (None, 'visit_Text', None),
        'abbreviation': ('abbr', dv, dp),
        'acronym': (None, dv, dp),
        'address': (None, 'visit_address', None),
        'admonition': ('aside', 'visit_aside', 'depart_aside', True),
        'attention': ('aside', 'visit_aside', 'depart_aside', True),
        'attribution': ('p', dv, dp, True),
        'author': (None, 'visit_bibliographic_field', None),
        'authors': (None, 'visit_authors', None),
        'block_quote': ('blockquote', 'visit_blockquote', dp),
        'bullet_list': ('ul', dv, dp, False),
        'caption': ('figcaption', dv, dp, False),
        'caution': ('aside', 'visit_aside', 'depart_aside', True),
        'citation': (None, 'visit_citation', 'depart_citation', True),
        'citation_reference': ('a', 'visit_citation_reference',
                               'depart_reference', True, False),
        'classifier': (None, 'visit_classifier', None),
        'colspec': (None, pass_, 'depart_colspec'),
        'comment': (None, 'visit_comment', None),
        'compound': ('div', dv, dp),
        'contact': (None, 'visit_bibliographic_field', None),
        'container': ('div', dv, dp),
        'copyright': (None, 'visit_bibliographic_field', None),
        'danger': ('aside', 'visit_aside', 'depart_aside', True),
        'date': (None, 'visit_bibliographic_field', None),
        'decoration': (None, 'do_nothing', None),
        'definition': ('dd', dv, dp),
        'definition_list': ('dl', dv, dp),
        'definition_list_item': (None, 'do_nothing', None),
        'description': ('td', dv, dp),
        'docinfo': (None, 'do_nothing', None),
        'doctest_block': ('pre', 'visit_literal_block', 'depart_literal_block', True),
        'document': (None, 'visit_document', 'depart_document'),
        'emphasis': ('em', dv, dp, False, False),
        'entry': (None, dv, 'depart_entry'),
        'enumerated_list': ('ol', dv, 'depart_enumerated_list'),
        'error': ('aside', 'visit_aside', 'depart_aside', True),
        'field': (None, 'visit_field', None),
        'field_body': (None, 'do_nothing', None),
        'field_list': (None, 'do_nothing', None),
        'field_name': (None, 'do_nothing', None),
        'figure': (None, 'visit_figure', dp),
        'footer': (None, dv, dp),
        'footnote': (None, 'visit_citation', 'depart_citation', True),
        'footnote_reference': ('a', 'visit_citation_reference', 'depart_reference', True, False),
        'generated': (None, 'do_nothing', None),
        'header': (None, dv, dp),
        'hint': ('aside', 'visit_aside', 'depart_aside', True),
        'image': ('img', dv, dp),
        'important': ('aside', 'visit_aside', 'depart_aside', True),
        'inline': ('span', dv, dp, False, False),
        'label': ('th', 'visit_reference', 'depart_label'),
        'legend': ('div', dv, dp, True),
        'line': (None, 'visit_line', None),
        'line_block': ('pre', 'visit_line_block', 'depart_line_block', True),
        'list_item': ('li', dv, dp),
        'literal': ('code', 'visit_literal', 'depart_literal', False, False),
        'literal_block': ('pre', 'visit_literal_block', 'depart_literal_block'),
        'math': (None, 'visit_math_block', None),
        'math_block': (None, 'visit_math_block', None),
        'meta': (None, 'visit_meta', None),
        'note': ('aside', 'visit_aside', 'depart_aside', True),
        'option': ('kbd', 'visit_option', dp, False, False),
        'option_argument': ('var', 'visit_option_argument', dp, False, False),
        'option_group': ('td', 'visit_option_group', 'depart_option_group'),
        'option_list': (None, 'visit_option_list', 'depart_option_list', True),
        'option_list_item': ('tr', dv, dp),

HTML5 Tag Construction

HTML5 Tags are constructed by the genshi.builder.tag object.

Genshi Builder

Support for programmatically generating markup streams from Python code using a very simple syntax. The main entry point to this module is the tag object (which is actually an instance of the ElementFactory class). You should rarely (if ever) need to directly import and use any of the other classes in this module.

Elements can be created using the tag object using attribute access. For example:

>>> doc = tag.p('Some text and ', tag.a('a link', href='http://example.org/'), '.')
>>> doc
<Element "p">

This produces an Element instance which can be further modified to add child nodes and attributes. This is done by “calling” the element: positional arguments are added as child nodes (alternatively, the Element.append method can be used for that purpose), whereas keywords arguments are added as attributes:

>>> doc(tag.br)
<Element "p">
>>> print(doc)
<p>Some text and <a href="http://example.org/">a link</a>.<br/></p>

If an attribute name collides with a Python keyword, simply append an underscore to the name:

>>> doc(class_='intro')
<Element "p">
>>> print(doc)
<p class="intro">Some text and <a href="http://example.org/">a link</a>.<br/></p>

As shown above, an Element can easily be directly rendered to XML text by printing it or using the Python str() function. This is basically a shortcut for converting the Element to a stream and serializing that stream:

>>> stream = doc.generate()
>>> stream 
<genshi.core.Stream object at ...>
>>> print(stream)
<p class="intro">Some text and <a href="http://example.org/">a link</a>.<br/></p>

The tag object also allows creating “fragments”, which are basically lists of nodes (elements or text) that don’t have a parent element. This can be useful for creating snippets of markup that are attached to a parent element later (for example in a template). Fragments are created by calling the tag object, which returns an object of type Fragment:

>>> fragment = tag('Hello, ', tag.em('world'), '!')
>>> fragment
<Fragment>
>>> print(fragment)
Hello, <em>world</em>!

ElemStack

For the previous doctree example, the sequence of visit_... and depart_... calls is:

1. visit_document
    2. visit_title
        3. visit_Text
        4. depart_Text
    5. depart_title
    6. visit_paragraph
        7. visit_Text
        8. depart_Text
    9. depart_paragraph
10. depart_document

For this sequence, the behavior of a ElemStack context object is:

  1. Initial State. The context stack is empty:

    context = []
    
  2. visit_document. A new context for document is reserved:

    context = [ [] ]
                 \
                  document
                  context
    
  3. visit_title. A new context for title is pushed into the context stack:

                    title
                    context
                     /
    context = [ [], [] ]
                 \
                  document
                  context
    

3. visit_Text. A Text node doesn’t need a new context because it is a leaf-node. Its text is simply added to the context of its parent node:

                  title
                  context
                 /
context = [ [], ['Title'] ]
             \
              document
              context
  1. depart_Text. No action performed. The context stack remains the same.

  2. depart_title. This is the end of the title processing. The title context is popped from the context stack to form an h1 tag that is then inserted into the context of the title parent node (document context):

    context = [ [tag.h1('Title')] ]
                 \
                  document
                  context
    
  3. visit_paragraph. A new context is added:

                                     paragraph
                                     context
                                    /
    context = [ [tag.h1('Title')], [] ]
                 \
                  document
                  context
    
  4. visit_Text. Again, the text is inserted into its parent’s node context:

                                     paragraph
                                     context
                                    /
    context = [ [tag.h1('Title')], ['Text and more text'] ]
                 \
                  document
                  context
    
  5. depart_Text. No action performed.

  6. depart_paragraph. Follows the standard procedure where the current context is popped and form a new tag that is appended into the context of the parent node:

    context = [ [tag.h1('Title'), tag.p('Text and more text')] ]
                 \
                  document
                  context
    
  7. depart_document. The document node doesn’t have an HTML tag. Its context is simply combined to the outer context to form the body of the HTML5 document:

    context = [tag.h1('Title'), tag.p('Text and more text')]
    

rst2html5 Tests

The tests executed in rst2html5.tests.test_html5writer are bases on generators (veja http://nose.readthedocs.org/en/latest/writing_tests.html#test-generators). The test cases are in tests/cases.py. Each test case is a dictionary whose main keys are:

rst:text snippet in rst format
out:expected output
part:specifies which part of rst2html5 output will be compared to out. Possible values are head, body or whole.

All other keys are rst2html5 configuration settings such as indent_output, script, script-defer, html-tag-attr or stylesheet.

When test fails, three auxiliary files are saved on the temporary directory (/tmp):

  1. TEST_CASE.rst com o trecho de texto rst do caso de teste;
  2. TEST_CASE.result com resultado produzido pelo rst2html5 e
  3. TEST_CASE.expected com o resultado esperado pelo caso de teste.

Their differences can be easily visualized:

$ kdiff3 /tmp/TEST_CASE.result /tmp/TEST_CASE.expected