My newline-to-break extension now shipping with Python-Markdown
Here is a quick update on a previous post I made about a newline-to-break extension for Python-Markdown. I'm very happy to report that the extension will now be shipping with Python-Markdown! Thanks to developer Waylan Limberg for including it!
Django Uploads and UnicodeEncodeError
Something strange happened that I wish to document in case it helps others. I had to reboot my Ubuntu server while troubleshooting a disk problem. After the reboot, I began receiving internal server errors whenever someone tried to view a certain forum thread on my Django powered website. After some detective work, I determined it was because a user that had posted in the thread had an avatar image whose filename contained non-ASCII characters. The image file had been there for months, and I still cannot explain why it just suddenly started happening.
The traceback I was getting ended with something like this:
File "/django/core/files/storage.py", line 159, in _open
return File(open(self.path(name), mode))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 72-79: ordinal not in range(128)
So it appeared that the open() call was triggering the error. This led me on a twisty Google search which had many dead ends. Eventually I found a suitable explanation. Apparently, Linux filesystems don't enforce a particular Unicode encoding for filenames. Linux applications must decide how to interpret filenames all on their own. The Python OS library (on Linux) uses environment variables to determine what locale you are in, and this chooses the encoding for filenames. If these environment variables are not set, Python falls back to ASCII (by default), and hence the source of my UnicodeEncodeError.
So how do you tell a Python instance that is running under Apache / mod_wsgi about these environment variables? It turns out the answer is in the Django documentation, albeit in the mod_python integration section.
So, to fix the issue, I added the following lines to my /etc/apache2/envvars file:
export LANG='en_US.UTF-8'
export LC_ALL='en_US.UTF-8'
Note that you must cold stop and re-start Apache for these changes to take effect. I got tripped up at first because I did an apache2ctrl graceful, and that was not sufficient to create a new environment.
I contributed to Fructose
At work we started using CxxTest as our unit testing framework. We like it because it is very light-weight and easy to use. We've gotten a tremendous amount of benefit from using a unit testing framework, much more than I had ever imagined. We now have almost 700 tests, and I cannot imagine going back to the days of no unit tests or ad-hoc testing. It is incredibly reassuring to see all the tests pass after making a significant change to the code base. There is no doubt in my mind that our software-hardware integration phases have gone much smoother thanks to our unit tests.
Sadly it seems CxxTest is no longer actively supported. However this is not of great concern to us. The code is so small we are fairly confident we could tweak it if necessary.
I recently discovered Fructose, a unit testing framework written by Andrew Marlow. It too has similar goals of being small and simple to use. One thing I noticed that CxxTest had that Fructose did not was a Python code generator that took care of creating the main() function and registering all the tests with the framework. Since C++ has very little introspection capabilities, C++ unit testing frameworks have historically laid the burden of registering tests on the programmer. Some use macros to help with this chore, but littering your code with ugly macros makes tests annoying to write. And if anything, you want your tests to be easy to write so your colleagues will write lots of tests. CxxTest approached this problem by providing first a Perl script, then later a Python script, to automate this part of the process.
I decided it would be interesting to see if I could provide such a script for Fructose. After a Saturday of hacking, I'm happy to say Andrew has accepted the script and it now ships with Fructose version 1.1.0. I hope to improve the script to not only run all the tests but to also print out a summary of the number of tests that passed and failed at the end, much like CxxTest does. This will require some changes to the C++ code. Also on my wish list is to make the script extensible, so that others can easily change the output and code generation to suit their needs.
I've hosted the code for the Python script, which I call fructose_gen.py on Bitbucket. Feedback is greatly appreciated.
A newline-to-break Python-Markdown extension
When I launched a new version of my website, I decided the new forums would use Markdown instead of BBCode for the markup. This decision was mainly a personal one for aesthetic reasons. I felt that Markdown was more natural to write compared to the clunky square brackets of BBCode.
My new site is coded in Python using the Django framework. For a Markdown implementation I chose Python-Markdown.
My mainly non-technical users seemed largely ambivalent to the change from BBCode to Markdown. This was probably because I gave them a nice Javascript editor (MarkItUp!) which inserted the correct markup for them.
However, shortly after launch, one particular feature of Markdown really riled up some users: the default line break behavior. In strict Markdown, to create a new paragraph, you must insert a blank line between paragraphs. Hard returns (newlines) are simply ignored, just like they are in HTML. You can, however, force a break by ending a line with two blank spaces. This isn't very intuitive, unlike the rest of Markdown.
Now I agree the default behavior is useful if you are creating an online document, like a blog post. However, non-technical users really didn't understand this behavior at all in the context of a forum post. For example, many of my users post radio-show playlists, formatted with one song per line. When such a playlist was pasted into a forum post, Markdown made it all one giant run-together paragraph. This did not please my users. Arguably, they should have used a Markdown list. But it became clear teaching people the new syntax wasn't going to work, especially when it used to work just fine in BBCode and they had created their playlists in the same way for several years.
It turns out I am not alone in my observations (or on the receiving end of user wrath). Other, much larger sites, like StackOverflow and GitHub, have altered their Markdown parsers to treat newlines as hard breaks. How can this be done with Python-Markdown?
It turns out this is really easy. Python-Markdown was designed with user customization in mind by offering an extension facility. The extension documentation is good, and you can find extension writing help on the friendly mailing list.
Here is a simple extension for Python-Markdown that turns newlines into HTML <br /> tags.
"""
A python-markdown extension to treat newlines as hard breaks; like
StackOverflow and GitHub flavored Markdown do.
"""
import markdown
BR_RE = r'\n'
class Nl2BrExtension(markdown.Extension):
def extendMarkdown(self, md, md_globals):
br_tag = markdown.inlinepatterns.SubstituteTagPattern(BR_RE, 'br')
md.inlinePatterns.add('nl', br_tag, '_end')
def makeExtension(configs=None):
return Nl2BrExtension(configs)
I saved this code in a file called mdx_nl2br.py and put it on my PYTHONPATH. You can then use it in a Django template like this:
{{ value|markdown:"nl2br" }}
To use the extension in Python code, something like this should do the trick:
import markdown
md = markdown.Markdown(safe_mode=True, extensions=['nl2br'])
converted_text = md.convert(text)
Update (June 21, 2011): This extension is now being distributed with Python-Markdown! See issue 13 on github for the details. Thanks to Waylan Limberg for the help in creating the extension and for including it with Python-Markdown.
A better "Who's Online" with Redis & Python
Updated on December 17, 2011: I found a better solution. Head on over to the new post to check it out.
Who's What?
My website, like many others, has a "who's online" feature. It displays the names of authenticated users that have been seen over the course of the last ten minutes or so. It may seem a minor feature at first, but I find it really does a lot to "humanize" the site and make it seem more like a community gathering place.
My first implementation of this feature used the MySQL database to update a per-user timestamp whenever a request from an authenticated user arrived. Actually, this seemed excessive to me, so I used a strategy involving an "online" cookie that has a five minute expiration time. Whenever I see an authenticated user without the online cookie I update their timestamp and then hand them back a cookie that will expire in five minutes. In this way I don't have to hit the database on every single request.
This approach worked fine but it has some aspects that didn't sit right with me:
- It seems like overkill to use the database to store temporary, trivial information like this. It doesn't feel like a good use of a full-featured relational database management system (RDBMS).
- I am writing to the database during a GET request. Ideally, all GET requests should be idempotent. Of course if this is strictly followed, it would be impossible to create a "who's online" feature in the first place. You'd have to require the user to POST data periodically. However, writing to a RDBMS during a GET request is something I feel guilty about and try to avoid when I can.
Redis
Enter Redis. I discovered Redis recently, and it is pure, white-hot awesomeness. What is Redis? It's one of those projects that gets slapped with the "NoSQL" label. And while I'm still trying to figure that buzzword out, Redis makes sense to me when described as a lightweight data structure server. Memcached can store key-value pairs very fast, where the value is always a string. Redis goes one step further and stores not only strings, but data structures like lists, sets, and hashes. For a great overview of what Redis is and what you can do with it, check out Simon Willison's Redis tutorial.
Another reason why I like Redis is that it is easy to install and deploy. It is straight C code without any dependencies. Thus you can build it from source just about anywhere. Your Linux distro may have a package for it, but it is just as easy to grab the latest tarball and build it yourself.
I've really come to appreciate Redis for being such a small and lightweight tool. At the same time, it is very powerful and effective for filling those tasks that a traditional RDBMS is not good at.
For working with Redis in Python, you'll need to grab Andy McCurdy's redis-py client library. It can be installed with a simple
$ sudo pip install redis
Who's Online with Redis
Now that we are going to use Redis, how do we implement a "who's online" feature? The first step is to get familiar with the Redis API.
One approach to the "who's online" problem is to add a user name to a set whenever we see a request from that user. That's fine but how do we know when they have stopped browsing the site? We have to periodically clean out the set in order to time people out. A cron job, for example, could delete the set every five minutes.
A small problem with deleting the set is that people will abruptly disappear from the site every five minutes. In order to give more gradual behavior we could utilize two sets, a "current" set and an "old" set. As users are seen, we add their names to the current set. Every five minutes or so (season to taste), we simply overwrite the old set with the contents of the current set, then clear out the current set. At any given time, the set of who's online is the union of these two sets.
This approach doesn't give exact results of course, but it is perfectly fine for my site.
Looking over the Redis API, we see that we'll be making use of the following commands:
- SADD for adding members to the current set.
- RENAME for copying the current set to the old, as well as destroying the current set all in one step.
- SUNION for performing a union on the current and old sets to produce the set of who's online.
And that's it! With these three primitives we have everything we need. This is because of the following useful Redis behaviors:
- Performing a SADD against a set that doesn't exist creates the set and is not an error.
- Performing a SUNION with sets that don't exist is fine; they are simply treated as empty sets.
The one caveat involves the RENAME command. If the key you wish to rename does not exist, the Python Redis client treats this as an error and an exception is thrown.
Experimenting with algorithms and ideas is quite easy with Redis. You can either use the Python Redis client in a Python interactive interpreter shell, or you can use the command-line client that comes with Redis. Either way you can quickly try out commands and refine your approach.
Implementation
My website is powered by Django, but I am not going to show any Django specific code here. Instead I'll show just the pure Python parts, and hopefully you can adapt it to whatever framework, if any, you are using.
I created a Python module to hold this functionality: whos_online.py. Throughout this module I use a lot of exception handling, mainly because if the Redis server has crashed (or if I forgot to start it, say in development) I don't want my website to be unusable. If Redis is unavailable, I simply log an error and drive on. Note that in my limited experience Redis is very stable and has not crashed on me once, but it is good to be defensive.
The first important function used throughout this module is a function to obtain a connection to the Redis server:
import logging
import redis
logger = logging.getLogger(__name__)
def _get_connection():
"""
Create and return a Redis connection. Returns None on failure.
"""
try:
conn = redis.Redis(host=HOST, port=PORT, db=DB)
return conn
except redis.RedisError, e:
logger.error(e)
return None
The HOST, PORT, and DB constants can come from a configuration file or they could be module-level constants. In my case they are set in my Django settings.py file. Once we have this connection object, we are free to use the Redis API exposed via the Python Redis client.
To update the current set whenever we see a user, I call this function:
# Redis key names:
USER_CURRENT_KEY = "wo_user_current"
USER_OLD_KEY = "wo_user_old"
def report_user(username):
"""
Call this function when a user has been seen. The username will be added to
the current set.
"""
conn = _get_connection()
if conn:
try:
conn.sadd(USER_CURRENT_KEY, username)
except redis.RedisError, e:
logger.error(e)
If you are using Django, a good spot to call this function is from a piece of custom middleware. I kept my "5 minute cookie" algorithm to avoid doing this on every request although it is probably unnecessary on my low traffic site.
Periodically you need to "age out" the sets by destroying the old set, moving the current set to the old set, and then emptying the current set.
def tick():
"""
Call this function to "age out" the old set by renaming the current set
to the old.
"""
conn = _get_connection()
if conn:
# An exception may be raised if the current key doesn't exist; if that
# happens we have to delete the old set because no one is online.
try:
conn.rename(USER_CURRENT_KEY, USER_OLD_KEY)
except redis.ResponseError:
try:
del conn[old]
except redis.RedisError, e:
logger.error(e)
except redis.RedisError, e:
logger.error(e)
As mentioned previously, if no one is on your site, eventually your current set will cease to exist as it is renamed and not populated further. If you attempt to rename a non-existent key, the Python Redis client raises a ResponseError exception. If this occurs we just manually delete the old set. In a bit of Pythonic cleverness, the Python Redis client supports the del syntax to support this operation.
The tick() function can be called periodically by a cron job, for example. If you are using Django, you could create a custom management command that calls tick() and schedule cron to execute it. Alternatively, you could use something like Celery to schedule a job to do the same. (As an aside, Redis can be used as a back-end for Celery, something that I hope to explore in the near future).
Finally, you need a way to obtain the current "who's online" set, which again is a union of the current and old sets.
def get_users_online():
"""
Returns a set of user names which is the union of the current and old
sets.
"""
conn = _get_connection()
if conn:
try:
# Note that keys that do not exist are considered empty sets
return conn.sunion([USER_CURRENT_KEY, USER_OLD_KEY])
except redis.RedisError, e:
logger.error(e)
return set()
In my Django application, I calling this function from a custom inclusion template tag .
Conclusion
I hope this blog post gives you some idea of the usefulness of Redis. I expanded on this example to also keep track of non-authenticated "guest" users. I simply added another pair of sets to track IP addresses.
If you are like me, you are probably already thinking about shifting some functions that you awkwardly jammed onto a traditional database to Redis and other "NoSQL" technologies.
« Previous Page -- Next Page »