======================
rst2html5 Design Notes
======================
The following documentation describes the knowledge collected during ``rst2html5`` implementation.
It might be helpful to other people who want to contribute to the project or create another rst converter.
Docutils
========
Docutils_ is a set of tools for processing plaintext documentation in restructuredText_ markup (rst)
into other formats such as HTML, PDF and Latex.
Its documents design issues and implementation details are described at
http://docutils.sourceforge.net/docs/peps/pep-0258.html
In the early stages of the translation process,
the rst document is analyzed and transformed into an intermediary format called *doctree*
which is then passed to a translator to be transformed into the desired formatted output::
Translator
+-------------------+
| +---------+ |
---> doctree -------->| Writer |-------> output
| +----+----+ |
| | |
| | |
| +------+------+ |
| | NodeVisitor | |
| +-------------+ |
+-------------------+
Doctree
-------
The doctree_ is a hierarchical structure of the elements of a ``rst`` document.
It is defined at ``docutils.nodes`` and is used internally by Docutils components.
The command :command:`rst2pseudoxml.py` produces a textual representation of a doctree
that is very useful to visualize the nesting of the elements of a ``rst`` document.
This information was of great help for both ``rst2html5`` design and tests.
Given the following ``rst`` snippet:
.. code-block:: rst
Title
=====
Text and more text
The textual representation produced by :command:`rst2pseudoxml.py` is:
.. code-block:: xml
Title
Text and more text
Translator, Writer and NodeVisitor
----------------------------------
A translator is comprised of two parts: a |Writer| and a |NodeVisitor|.
The |Writer| is responsible to prepare and coordinate the translation made by the |NodeVisitor|.
The |NodeVisitor| is used for visiting each doctree node and
it performs all actions needed to translate the node to the desired format
according to its type and content.
.. important::
To develop a new docutils translator, you need to specialize these two classes.
.. note::
Those classes correspond to a variation of the Visitor pattern,
called "Extrinsic Visitor" that is more commonly used in Python.
See
`The "Visitor Pattern", Revisited `_.
.. seealso::
`Double Dispatch and the "Visitor" Pattern `_.
::
+-------------+
| |
| Writer |
| translate |
| |
+------+------+
|
| +---------------------------+
| | |
v v |
+------------+ |
| | |
| Node | |
| walkabout | |
| | |
+--+---+---+-+ |
| | | |
+---------+ | +----------+ |
| | | |
v | v |
+----------------+ | +--------------------+ |
| | | | | |
| NodeVisitor | | | NodeVisitor | |
| dispatch_visit | | | dispatch_departure | |
| | | | | |
+--------+-------+ | +---------+----------+ |
| | | |
| +--------------|---------------+
| |
v v
+-------------------+ +--------------------+
| | | |
| NodeVisitor | | NodeVisitor |
| visit_ | | depart_ |
| | | |
+-------------------+ +--------------------+
.. http://www.asciiflow.com/#Draw
During the doctree traversal through :func:`docutils.nodes.Node.walkabout`,
there are two |NodeVisitor| dispatch methods called:
:func:`~docutils.nodes.NodeVisitor.dispatch_visit` and
:func:`~docutils.nodes.NodeVisitor.dispatch_departure`.
The former is called early in the node visitation.
Then, all children nodes :func:`~docutils.nodes.Node.walkabout` are visited, and lastly,
the latter dispatch method is called.
Each dispatch method calls another method whose name follows the pattern
``visit_`` or ``depart_``
such as ``visit_paragraph`` or ``depart_title``,
that should be implemented by the |NodeVisitor| subclass object.
rst2html5
=========
In :mod:`rst2html5`,
|Writer| and |NodeVisitor| are specialized through
:class:`~rst2html5.HTML5Writer` and :class:`~rst2html5.HTML5Translator` classes.
:class:`rst2html5.HTML5Translator` is a |NodeVisitor| subclass
that implements all ``visit_`` and ``depart_`` methods
needed to translate a doctree to its HTML5 content.
The :class:`rst2html5.HTML5Translator` uses
an object of the :class:`~rst2html5.ElemStack` helper class that controls a context stack
to handle indentation and the nesting of the doctree traversal::
rst2html5
+-----------------------+
| +-------------+ |
doctree ---|--->| HTML5Writer |----|--> HTML5
| +------+------+ |
| | |
| | |
| +--------+--------+ |
| | HTML5Translator | |
| +--------+--------+ |
| | |
| | |
| +-----+-----+ |
| | ElemStack | |
| +-----------+ |
+-----------------------+
The standard ``visit_`` action is called ``default_visit`` and it initiates a new element context:
.. literalinclude:: ../rst2html5/__init__.py
:pyobject: HTML5Translator.default_visit
:emphasize-lines: 12
The standard ``depart_`` action is ``default_departure`` and it creates the HTML5 element
corresponding to the saved context:
.. literalinclude:: ../rst2html5/__init__.py
:pyobject: HTML5Translator.default_departure
:emphasize-lines: 6-8
Not all rst elements follow this procedure.
The ``Text`` element, for example, is a leaf-node and thus doesn't need a specific context.
Other elements have a common processing and can share the same ``visit_`` and/or ``depart_`` method.
To take advantage of theses similarities,
the ``rst_terms`` dict maps a node type to its ``visit_`` and ``depart_`` methods:
.. literalinclude:: ../rst2html5/__init__.py
:pyobject: HTML5Translator
:lines: 3-141
where ``dv`` is ``default_visit`` and ``dp`` means ``default_departure``.
HTML5 Tag Construction
----------------------
HTML5 Tags are constructed by the :class:`genshi.builder.tag` object.
ElemStack
---------
For the previous doctree example,
the sequence of ``visit_...`` and ``depart_...`` calls is this::
1. visit_document
2. visit_title
3. visit_Text
4. depart_Text
5. depart_title
6. visit_paragraph
7. visit_Text
8. depart_Text
9. depart_paragraph
10. depart_document
For this sequence,
the behavior of a ElemStack context object is:
0. **Initial State**. The context stack is empty::
context = []
1. **visit_document**. A new context for ``document`` is reserved::
context = [ [] ]
\
document
context
2. **visit_title**. A new context for *title* is pushed into the context stack::
title
context
/
context = [ [], [] ]
\
document
context
3. **visit_Text**. A ``Text`` node doesn't need a new context because it is a leaf-node.
Its text is simply added to the context of its parent node::
title
context
/
context = [ [], ['Title'] ]
\
document
context
4. **depart_Text**. No action performed. The context stack remains the same.
5. **depart_title**. This is the end of the title processing.
The title context is popped from the context stack to form an ``h1`` tag
that is then inserted into the context of the title parent node (*document context*)::
context = [ [tag.h1('Title')] ]
\
document
context
6. **visit_paragraph**. A new context is added::
paragraph
context
/
context = [ [tag.h1('Title')], [] ]
\
document
context
7. **visit_Text**. Again, the text is inserted into its parent's node context::
paragraph
context
/
context = [ [tag.h1('Title')], ['Text and more text'] ]
\
document
context
8. **depart_Text**. No action performed.
9. **depart_paragraph**. Follows the standard procedure
where the current context is popped and form a new tag that is appended into
the context of the parent node::
context = [ [tag.h1('Title'), tag.p('Text and more text')] ]
\
document
context
10. **depart_document**. The document node doesn't have an HTML tag.
Its context is simply combined to the outer context to form the body of the HTML5 document::
context = [tag.h1('Title'), tag.p('Text and more text')]
.. _tests:
rst2html5 Tests
===============
The test cases are located at :file:`tests/cases.py` and
each test case is a dictionary whose main keys are:
:rst: text snippet in rst format
:out: expected output
:part: specifies which part of **rst2html5** output will be compared to **out**.
Possible values are **head**, **body** or **whole**.
Other possible keys are ``rst2html5`` configuration settings such as
*indent_output*, *script*, *script-defer*, *html-tag-attr* or *stylesheet*.
When a test fails,
three auxiliary files are created on the default temporary directory (:file:`/tmp`):
#. :file:`TEST_CASE_NAME.rst` contains the rst snippet of the test case.;
#. :file:`TEST_CASE_NAME.result` contais the result produced by **rst2html5** and
#. :file:`TEST_CASE_NAME.expected` contains the expected result.
Their differences can be easily visualized by a diff tool::
$ kdiff3 /tmp/TEST_CASE_NAME.result /tmp/TEST_CASE_NAME.expected
.. _workaround:
Workaround to Conflicts with ``Docutils``
=========================================
``rst2html5`` package installation should make it possible to use it via command line
and also being imported in other projects using ``rst2html5``.
For example, to use it via command line:
.. code:: bash
$ rst2html5 example.rst example.html
And programmatically from another project:
.. code:: python
from rst2html5 import HTML5Writer
...
The problem is that after ``0.13.1``,
``docutils`` installation creates two scripts called ``rst2html5``
*and* ``rst2html5.py`` in ``/bin``,
where ```` is the installation path of the virtual environment being used.
Both do the same.
Since it is not possible to delete a script from another package,
``rst2html5`` package installation overwrites both,
but ``rst2html5.py`` still causes problems.
When importing ``rst2html5`` from one of those scripts,
Python reaches ``/bin/rst2html5.py``
instead of ``/lib//site-packages/rst2html5``
because the former comes first in ``sys.path``
**during the execution, in a virtual environment**.
A typical ``sys.path`` is:
.. code:: python
[
'/tmp/py39/bin',
'/usr/lib/python39.zip',
'/usr/lib/python3.9',
'/usr/lib/python3.9/lib-dynload',
'/tmp/py39/lib/python3.9/site-packages'
]
where ``/tmp/py39`` is the path of the virtual environment,
and ``python3.9`` is the current Python version.
.. note::
The ``sys.path`` information from the command line is different
from the one inside a running script.
To get the real value,
you must manually insert a breakpoint or print it from a installed script.
From ``1.9.2 <= rst2html5 < 2.0``,
the immediate solution was to rename the module from ``rst2html5`` to ``rs2html5_``,
so that importing would skip ``/bin/rst2html5.py``
and find the right module at ``/lib//site-packages``.
Version ``2.0`` implements a more elegant solution for the problem
that allows both the script *and* the module to be named ``rst2html5``.
The script ``/bin/rst2html5`` still imports ``rst2html5_``
but instead of reaching a module, the importing hits a file called ``rst2html5_.py``
that modifies ``sys.path`` and only then import the module ``rst2html5``::
/bin /lib//site-packages
before 2.0: rst2html5 -----> rst2html5_/
(import rst2html5_)
2.0 onwards: rst2html5 -----> rst2html5_.py --------> rst2html5/
(import rst2html5_) (modifies sys.path
and then import rst2html5)
``/bin/rst2html5`` is generated automatically during the package installation,
and contains something very similar to this:
.. code:: python
#!//bin/python
from rst2html5_ import main
if __name__ == '__main__':
main()
The intermediary file ``rst2html5_.py`` is shown below:
.. literalinclude:: ../rst2html5_.py
:lines: 7-
:name: rst2html5_.py
The package installation is configured in the file ``pyproject.toml``:
.. code-block:: toml
[tool.poetry]
...
packages = [
{include = "rst2html5"}
]
include = ["rst2html5_.py"]
[tool.poetry.scripts]
rst2html5 = "rst2html5_:main" # overwrites docutils' rst2html5
...
.. attention::
It is very likely that projects that use ``rst2html5`` prior to 2.0 *won't* need to change their imports
because ``rst2html5_.HTML5Writer`` is still reachable through the new ``rst2html5_.py`` file.
However, they're advised to do so.