All in the <head> – Ponderings and code by Drew McLellan –

Don't Parse Markdown at Runtime

I’m really pleased to see the popularity of Markdown growing over the last few years. Helped, no doubt, by its adoption by major forces in the developer world like github and StackOverflow; when developers like to use something, they put it in their own projects, and so it grows. I’ve always personally preferred Textile over Markdown, but either way I’m of the opinion that a neutral, simple text-based language that can be simply transformed into any number of other formats is the most responsible way to author and store content.

We have both Textile and Markdown available in Perch in preference to HTML-based WYSIWYG editors, and it’s really positive to see other content management systems taking the same approach.

From a developer point of view, using either of these languages is pretty straightforward. The user inputs the content in e.g. Markdown, and you then store that directly in Markdown format in order to facilitate later editing. Obviously you can’t just output Markdown to the browser, so at some point that needs to be converted into HTML. The question that is sometimes debated is when this should happen.

If you’ve ever looked at the source code for a parser of this nature, it should be clear that transcoding from text to HTML is a fair amount of work. The PHP version of Markdown is about 1500 lines of mostly regular expressions and string manipulation. What other single component of your application is comparable?

I’m always of the opinion that if the outcome of a task is known then it shouldn’t be performed more than once. For a given Markdown input, we know the output will always be the same, so in my own applications I transform the text to HTML once and store it in the database alongside the original. That just seems like the smart thing to do. However, I see lots of CMSs these days (especially those purporting to be ‘lightweight’) that parse Markdown at runtime and don’t appear to suffer from it.

But which is better, parsing Markdown at runtime, or parsing at edit time and retrieving? There’s only one way to find out…

FIGHT!

Ok, perhaps not a fight, but I thought it would be interesting to run some highly unscientific, finger-in-the-air benchmarks to get an idea of whether parsing Markdown really does impact page performance compared to fetching HTML from a file or database. Is it really that slow?

Using the PHP version of Markdown, I took the jQuery github README.md file as an example document. I figured it wasn’t too long or short, contained a few different features of the language, and was pretty much a typical example.

My methodology was simply to write a PHP script to perform the task being tested, and then hit it with apachebench a few times to get the number of requests per second. In unscientific conditions, I expected my results to be useful only for comparison – the conditions weren’t perfect, but they were consistent across tests.

In the most basic terms, measuring requests per second tells you how many visitors your site can support at once. The faster the code, the higher the number, the better.

Test 1: Runtime parsing

Below is the script I used. Pretty much no-nonsense, reading in the source Markdown file, instantiating the parser and parsing the text.

<?php
    require('markdown.php');
    $text = file_get_contents('jquery.md');
    $Markdown_Parser = new Markdown_Parser;
    $Markdown_Parser->transform($text);		
    unset($Markdown_Parser);
 ?>

I blasted this with apachebench for 10,000 requests with a concurrency of 100.

Result: around 155 requests per second.

Test 2: Retrieving HTML from a database

I created a very simple database with one table containing one row. I pasted in the HTML result of the parsed Markdown (created using the same method as above). I then took some boilerplate PHP PDO database connection code from the PHP manual.

<?php
    $dbh = new PDO('mysql:host=localhost;dbname=markdown-test','username', 'password');
    foreach($dbh->query('SELECT html 
        FROM content WHERE id=1') as $row) {
        $text = $row['html'];
    }
    $dbh = null;
?>

I restarted the server, and then hit this script with the same ab settings.

Result: around 3,575 requests per second.

Test 3: Retrieving HTML from a file

For comparison, I thought it would be interesting to look at a file-based approach. For this text, I parsed the Markdown on the first attempt, and then reused it for subsequent runs. A very basic form of runtime parsing and caching, if you will.

<?php
    if (file_exists('jquery.html')) {
        $html = file_get_contents('jquery.html');
    }else{
        require('markdown.php');
        $text = file_get_contents('jquery.md');
        $Markdown_Parser = new Markdown_Parser;
        $html = $Markdown_Parser->transform($text);		
        file_put_contents('jquery.html', $html);
        unset($Markdown_Parser);
    }
?>

In theory, this should be very fast, as it’s basically just stating a file then fetching it. It hit it with the same settings again.

Result: around 12,425 requests per second.

Conclusion

It would be improper to raise a formal conclusion from such rough tests, but I think we can get an idea of the overall work involved with each method, and the numbers tally with common sense.

Parsing Markdown is slow. It can be around 25 times slower than fetching pre-transformed HTML from the database. Considering you’re likely already fetching your Markdown text from the database, you’re effectively doing the work of Test 2 and then Test 1 on top.

It would be interesting to compare the third test with caching the output to something like Redis. Depending on your traffic profile, that could be quite an effective approach if you really didn’t want to store the HTML permanently, although I’m not sure why that would be an issue. It would also be interesting to compare these rough results with some properly conducted ones, if anyone’s set up to do those and has the time.

All applications and situations are different, and therefore everyone has their own considerations and allowances to make. Operating and different scales, with different platforms can affect your choices. Perhaps you have CPU in abundance, but are bottlenecking on IO.

However, for the typical scenario of a basic content managed website and for any given web hosting, parsing Markdown at runtime can vastly reduce the number of visitors your site can support at once. It could make the difference between surviving a Fireballing and not. For my own work, I will continue to parse at edit time and store HTML in the database.