Last updated Oct 3, 2024

HTML to markup conversion for the rich text editor

Introduction

This component enables the rich Text Editor by converting HTML (created by the renderer, then edited by the user) into Confluence Wiki Markup.

It works like this:

  1. Submit HTML to WysiwygConverter.convertXHtmlToWikiMarkup
  2. ...
  3. Get Wiki Markup back.

This document explains step 2 in some more detail. Most problems with this stage stem from difficulty in determining the correct amount of whitespace to put between two pieces of markup.

Classes and Responsibilities

This section briefly describes the main classes involved and their responsibilities.

DefaultConfluenceWysiwygConverter

Converts Wiki Markup to HTML to be given to the rich text editor, and converts edited HTML back to markup. Creates RenderContexts from pages and delegates the conversion operations to a WysiwygConverter instance.

DefaultWysiwygConverter

Converts Wiki Markup to XHTML to be given to the rich text editor, and converts edited XHTML back to markup. This class contains the guts of the HTML -> Markup conversion, and delegates the Markup -> HTML conversion to a WikiStyleRenderer, with the setRenderingForWysiwyg flag set to true in the RenderContext.

WysiwygNodeConverter

Interface for any class which can convert an HTML DOM tree into Markup. Can be implemented to convert particular macros back into markup. The macro class must implement WysiwygNodeConverter and give the macro's outer DIV a 'wysiwyg' attribute with the value 'macro:<macroname>'.

Styles

Aggregates text styles as we traverse the HTML DOM tree. Immutable. Responsible for interpreting Node attributes as styles and decorating markup text with style and colour macros/markup.

ListContext

Keeps track of nested lists - the depth and the type.

WysiwygLinkHelper

Just a place to put some static methods for creating HTML attributes describing links, and for converting link HTML nodes into markup.

Overview of the HTML to Markup Conversion Process

Preprocessing the HTML

  1. First the incoming HTML is stripped of newlines and 'thinspaces', which were inserted during the rendering process so that there were places to put the cursor to insert text.
  2. XML processing instructions (which can be present when HTML is pasted from MS Word) are stripped.
  3. NekoHTML is used to parse the HTML into an XML document fragment.

Converting the Document Fragment to Markup

This uses the convertNode method, which has the honour of being the longest method in Atlassian (although not the most complex by cyclomatic complexity measures).

The signature of this method is:

1
2
 
String convertNode( 
Node node, 
Node previousSibling, 
Styles styles, 
ListContext listContext, 
boolean inTable, 
boolean inListItem, 
boolean ignoreText, 
boolean escapeWikiMarkup) 

That is, the method returns the markup needed to represent the HTML contained in the DOM tree, based on the current context (what styles have been applied by parent nodes, are we already in a table or a list and so on).

The body of this method is a large case statement based on the type of the current node and the current state. The typical case gets the markup produced by its children, using the convertChildren method, decorates it in some way and returns the resulting string.

The convertChildren method simply iterates over a node's children calling convertNode and concatenating the markup returned.

In order to determine how much white space separates the markup produced by two sibling nodes we often need to know the type of each node. That is why convertNode takes a previousSibling argument. The getSep method takes the two nodes to be separated and some state information. t uses a lookup table to decide what type of whitespace (or other text) to use.

Post-processing the markup

  1. Clean up whitespace and multiple newlines - the conversion process may insert too many newlines or multiple "TEXTSEP" strings to separate text - these are collapsed into single newlines and single spaces.
  2. Replace {*} style markup with simply * where possible.

Worthwhile Style Improvements

  1. Split up convertNode so that it is responsible for deciding what treatment the current node needs, and then calling convertTextNode, convertDivNode etc.
  2. Put the state passed to convertNode into an immutable object to reduce the parameter clutter. Don't use a Map.
  3. Refactor WysiwygLinkHelper - it's very confusing.

Rendering in 'For Wysiwyg' Mode

The HTML produced by the renderer to be displayed by the Rich Text editor is not identical to that generated for display. It contains extra attributes which are cues to the conversion process. The following list isn't exhaustive, but gives the flavour of the types of considerations involved.

  1. Some errors should be rendered differently so that the original markup isn't lost - e.g. an embedded image which can't be found should be displayed as a placeholder, not just an error message.

  2. When links are rendered extra attributes are added to the tag so that the appropriate alias, destination and tooltip can be determined. See WysiwygLinkHelper's javadoc for details.

  3. Some errors put the erroneous markup in a span with the "wikisrc" class, which causes its contents to be directly used as markup.

  4. This speaks for itself:

    1
    2
     
    // @HACK 
    // The newline before the title parameter below fixes CONF-4562. I have absolutely no idea HOW it fixes 
    // CONF-4562, but the simple fact that it does fix the problem indicates that I could spend my whole life 
    // trying to work out why and be none the wiser. I suggest you don't think too hard about it either, and 
    // instead contemplate the many joys that can be found in life -- the sunlight reflecting off Sydney 
    // Harbour; walking through the Blue Mountains on a dew-laden Autumn morning; the love of a beautiful 
    // woman -- this should in some way distract you from the insane ugliness of the code I am about to check 
    // in. 
    // 
    // Oh, and whatever you do, don't remove the damn newline.
    // 
    // -- Charles, November 09, 2005 
    if (renderContext.isRenderingForWysiwyg()) 
    buffer.append("\n"); 
    
  5. Thin spaces are added at strategic points so that there is somewhere to place the cursor when inserting text, e.g. at the end of the page, in a new paragraph.

  6. Curly brackets are treated differently: a '{' typed in the RTE is interpreted as the start of a macro tag, not as an escaped '{' - you must explicitly escape '{ and '}' in the RTE.

  7. Macros.
    From a wysiwyg point of view there are four cases:

    1. Macros with unrendered bodies (or no bodies). These appear as {macro} ... unrendered body ... {macro}, so the user can edit the body text in wysiwyg mode.
    2. Macros with rendered bodies, but which the editor doesn't 'understand' - that is, the editor can't manipulate the HTML produced by the macro. These are rendered as {macro} ... rendered body ... {macro}. A macro indicates that the editor doesn't understand it by returning true from suppressMacroRenderingDuringWysiwyg(). Most macros should do this, unless the Wysiwyg converter understands how to create a new instance of the macro. The user can edit the HTML in the body of these macros, which will be converted back to markup.
    3. Macros we fully understand. These are simply rendered as normal (but surrounded by a div or span describing them). These return false from suppressMacroRenderingDuringWysiwyg().
    4. Macros which are responsible for their own rendering. These return true from suppressSurroundingTagDuringWysiwygRendering()
  8. The bq. markup adds an attribute to the tag to distinguish it from a blockquote tag produced by the {quote} macro.

  9. The header DIV of panel macros is given a wysiwyg="ignore" attribute, because it is generated from the macro parameters. This means that is you edit the title of a panel macro in the RTE the change is ignored.

  10. Look at the InlineHtmlMacro for an example of a macro which implements WysiwygNodeConverter.

How To Fix Bugs

Writing Tests

The first thing to do is to write a failing test. At the moment all the tests are in com.atlassian.renderer.wysiwyg.TestSimpleMarkup. Keeping them al together is reasonable, as they run quickly and you will want to make sure that your fixes don't break any of the other tests.

There are two types of test - markup tests and XHTML tests.

Use a markup test when you have a piece of markup which doesn't 'round trip' correctly. For instance, perhaps the markup:

1
2
* foo

* bar 

becomes

1
2
* foo 
* bar 

when you go from wiki markup mode to rich text mode and back again.
The body of the test you write would be:

1
2
testMarkup("* foo\n\n* bar"); 

which will check that the markup is the same after a round trip. Note that it is OK for markup to change in some circumstances - two different markup strings may be equivalent, and the round trip will convert the starting markup to 'canonical markup' which renders identically to the initial markup. There are also pathological cases where a round trip may switch markup between two equivalent strings - these should be fixed, even though they don't break the rendering as they show up as changes in the version history.

If a bug is caused by the conversion of user-edited (or pasted) HTML into markup.
In this case you write a test like this:

1
2
testXHTML("...offending HTML...", "...desired markup...") 

This test first checks that the desired markup round-trips correctly, then that the HTML converts to that markup.

Finding Problems

Once you have written your test you need to find out what the converter is doing.

Running the test in debug mode and putting breakpoints in testMarkup/testXHTML is the best way of doing this. As you track down the nodes causing problems you can put breakpoints in the part of convertNode which handles the offending type of node.

You can also set 'debug' to true in DefaultWysiwygConverter.java:44 - this will dump the XHTML produced by Neko, turn off the post-processing mentioned above, and print out details of the separator calculations in the generated markup string.

So you might see:

1
2
[li-li 
false,false] 

which means that two list items, not in a table and not in a (nested) list get separated by a newline. You can tweak the table of separators as needed.

Rate this page: