Parsing HTML email for presentation

I needed to display HTML emails in the browser. However as anything can show up in an inbox it is important to sanitise the content before displaying it. Also useful would be the ability to return information about what was sanitised and to optionally render differently depending on settings. For example if I wanted to prevent loading images like many of the bigger web based email services do.

For this solution I will take incoming HTML; parse and sanitize it before displaying content to the user. The first thing we need to do is create either a virtual DOM, or more simply, use DOMParser.

javascript
function cleanup (source, _options) {
    const doc = new DOMParser().parseFromString(source.trim(), 'text/html');

    const output = Object.assign({}, DEFAULT_OUTPUT);
    const options = Object.assign({}, DEFAULT_OPTIONS, _options);

    // TODO: cleanup the DOM

    output.html = renderDoctype(doc) + doc.documentElement.outerHTML;

    return output;
}

There are already a few things missing in the above code snippet, we need to define what our DEFAULT_OPTIONS are, that is what we expect to be able to modify.

javascript
const DEFAULT_OPTIONS = {
    hideImages: false,
    stripJs: true,
    target: '_blank'
};

What we need next is to define what output we expect to see if nothing was changed.

javascript
const DEFAULT_OUTPUT = {
    hasJs: false,
    hasImages: false,
    hasInlineAttachments: false,
    hasLinks: false
};

You may choose to handle this as you would like, my implementation's output is meant to detail what was found in the HTML rather than what is contained in the resulting HTML. For example if I elect to hide images, it will still report hasImages: true.

The outerHTML attribute doesn't include the doctype which may have been included with the source HTML, and therefore I'll try to recover it. Unfortunately there's no really easy way to do this. But a common implementation looks like the following.

javascript
function renderDoctype (doc) {
    const node = doc.doctype;
    if (!node) return '';
    return '<!DOCTYPE ' +
        node.name +
        (node.publicId ? ' PUBLIC "' + node.publicId + '"' : '') +
        (!node.publicId && node.systemId ? ' SYSTEM' : '') +
        (node.systemId ? ' "' + node.systemId + '"' : '') +
        '>';
}

Now that we have our snippet in place, we need to implement DOM traversal. This will be a recursive function which iterates through every child of every node. Implementation of recursive functions is fairly simple if you know how they work. In place of the TODO from the start of this article:

javascript
function traverse (node) {
    if (node.nodeType === 1) {
        // TODO: this is an element node
    }

    Array.from(node.children).forEach(traverse);
}
traverse(doc);

We are only really interested in modifying element nodes however it would be relatively trivial for you to make changes to nodes of other types if you would like. To modify text on the page for example. In such a case you would want to traverse childNodes rather than children in this example.

Because we want to iterate easily over the nodes we convert it to an array, which then gives us forEach. We process each node and then all of it's children in turn throughout the entire DOM.

I'll provide examples of some of the changes I elected to make in my implementation below, that way anyone can point out how to exploit it's flaws. I'm only kidding of course, but please do.

javascript
if (node.nodeName === 'SCRIPT') {
    output.hasJs = true;
    if (options.stripJs) {
        node.parentNode.removeChild(node);
        return;
    }
}

First I want to get rid of any unwanted nodes, that way I can return early and move on to the next node. The nodeName property always returns uppercase.

javascript
if (node.nodeName === 'FORM' || node.nodeName === 'A') {
    output.hasLinks = true;
    if (options.target)
        node.setAttribute('target', options.target);
}

This will ensure that any links I click on or inline forms that are submitted take me to a new tab, I don't want to navigate away from my email viewer. Next I want to deal with images, this is a little bit more complex and I'll be doing some things which should be considered completely optional in this section.

javascript
if (node.nodeName === 'IMG') {
    const src = node.getAttribute('src');
    if (/^\s*cid/i.test(src)) {
        output.hasInlineAttachments = true;
        const attachment = attachmentForCid(options.attachments, src);
        node.setAttribute('src', (attachment ? attachment.path : ''));
    } else {
        output.hasImages = true;
        if (options.hideImages)
            node.setAttribute('src', '');
    }
}

Very often email clients implement inline attachments using CIDs (Content identifiers), which are interpreted to mean "there's an attachment for this." In my implementation I use that cid uri and a set of attachments which may have been provided in the options object. What this function does is outside the scope of this article. But essentially the purpose is to figure out which attachment it is related to, and use the path for the attachment from the server.

Alternatively if it is a normal external image, which is most common, I simply hide it if the option is set. Next I tackle background images, which is a little bit more complex but much the same.

javascript
function getBackground (node) {
    const result = window.getComputedStyle(node, null).getPropertyValue('background-image');
    if (result) {
        return result.trim().slice(4, -1).replace(/['"]/g, '');
    }
}

const background = getBackground(node);
if (background) {
    if (/^\s*cid/i.test(background)) {
        output.hasInlineAttachments = true;
        const attachment = attachmentForCid(options.attachments, background);
        node.style.backgroundImage = (attachment ? `url('${attachment.path}')` : 'none');
    } else {
        output.hasImages = true;
        if (options.hideImages)
            node.style.backgroundImage = 'none';
    }
}

Here we are grabbing the computed style for the element, and looking for the background-image property. If it exists, we strip url() and any extraneous quotes or double quotes, to access the url. Then we treat it much the same way we would treat any IMG tag.

That should take care of our image problems. Next lets look at stripping out JavaScript. We definitely, usually, definitely don't want any strange JavaScript code running on our pages.

javascript
for (const name of node.getAttributeNames()) {
    if (/^on/i.test(name)) {
        output.hasJs = true;
        if (options.stripJs)
            node.removeAttr(name);
    }
}

We iterate over every attribute of the element looking for JavaScript hooks and if we find any remove them.

javascript
const href = node.getAttribute('href');
if (href && /^\s*javascript/i.test(href)) {
    output.hasJs = true;
    if (options.stripJs)
        node.removeAttr('href');
}

We also make sure to check that links aren't decorated with JavaScript code. Otherwise it would run when a link is clicked, not what we want.

And that should be it. Your outputted HTML will be cleaner than it went in, it will also be a full DOM including html head and body tags. At this stage you can choose to do anything else with the DOM that you would like after traversal including inserting some default styles to the top, or even your own JavaScript.

javascript
const DEFAULT_STYLE = document.createElement('style');
DEFAULT_STYLE.textContent = `body {
    margin: 15px 25px;
    font-family: sans-serif;
    color: #08090A;
    font-size: 1rem;
    font-weight: 400;
    line-height: 1.5;
}
blockquote {
    margin: 0;
    padding: 0;
    padding-left: 13px;
    border-left: 2px solid #ccc; 
}`;

This will for example just add some simple styles to improve the layout of our email before showing it. As well as overriding some styles that otherwise cause some common difficulty.

javascript
doc.documentElement.style.overflowY = 'hidden';
doc.documentElement.style.height = 'auto';
doc.documentElement.style.maxHeight = 'none';
doc.body.style.height = 'auto';
doc.body.style.maxHeight = 'none';

const style = DEFAULT_STYLE.cloneNode(true);
doc.head.insertBefore(style, doc.head.firstChild);

Now we're ready to insert the HTML into an iframe on our page. Which is outside the scope of this article, I'm sure I'll cover dealing with an iframe in this manner in a later article.