Based on this blog post by Django co-BDFL Jacob Kaplan-Moss, I wanted to try using html5lib to sanitize user input. I’m using Markdown on most of the site. But in one particular place (news items), I am (currently) allowing users to submit HTML news stories with the TinyMCE Javascript editor. This is mainly because my users like to copy and paste content from sites like MySpace, and TinyMCE might be easier for them to use than Markdown. I may revisit this decision, but for now we’ll go with it.
I was using the lxml sanitizer for this purpose. But because of the high praises html5lib received from Jacob, and from studying the source code to both, html5lib gives me greater confidence, even if it is an order of magnitude slower. But, it isn’t like this is going to get used more than a few times a day, so that isn’t a concern.
Never having used html5lib, or any other HTML/XML parser before, it was a bit confusing to figure out how to use it for this task. After studying the code and the html5lib news group, I came up with the following bit of code I thought I would share. Comments are extremely welcome.
import html5lib
from html5lib import sanitizer, treebuilders, treewalkers, serializer
def sanitizer_factory(*args, **kwargs):
san = sanitizer.HTMLSanitizer(*args, **kwargs)
# This isn't available yet
# san.strip_tokens = True
return san
def clean_html(buf):
"""Cleans HTML of dangerous tags and content."""
buf = buf.strip()
if not buf:
return buf
p = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"),
tokenizer=sanitizer_factory)
dom_tree = p.parseFragment(buf)
walker = treewalkers.getTreeWalker("dom")
stream = walker(dom_tree)
s = serializer.htmlserializer.HTMLSerializer(
omit_optional_tags=False,
quote_attr_values=True)
return s.render(stream)
I haven’t tested it extensively yet, but it seems to do the trick. I understand a future version of html5lib will have an option to strip completely out offending tags. Right now they are simply rendered harmless and remain in the input (via < and >). This is fine, as I can see them in the admin as I review submitted stories.
Tags: django, html5lib, sanitization, sg101