All in the <head>

– Ponderings & code by Drew McLellan –

– Live from The Internets since 2003 –

About

Don't Parse Markdown at Runtime

3 January 2013

I’m really pleased to see the popularity of Markdown growing over the last few years. Helped, no doubt, by its adoption by major forces in the developer world like github and StackOverflow; when developers like to use something, they put it in their own projects, and so it grows. I’ve always personally preferred Textile over Markdown, but either way I’m of the opinion that a neutral, simple text-based language that can be simply transformed into any number of other formats is the most responsible way to author and store content.

We have both Textile and Markdown available in Perch in preference to HTML-based WYSIWYG editors, and it’s really positive to see other content management systems taking the same approach.

From a developer point of view, using either of these languages is pretty straightforward. The user inputs the content in e.g. Markdown, and you then store that directly in Markdown format in order to facilitate later editing. Obviously you can’t just output Markdown to the browser, so at some point that needs to be converted into HTML. The question that is sometimes debated is when this should happen.

If you’ve ever looked at the source code for a parser of this nature, it should be clear that transcoding from text to HTML is a fair amount of work. The PHP version of Markdown is about 1500 lines of mostly regular expressions and string manipulation. What other single component of your application is comparable?

I’m always of the opinion that if the outcome of a task is known then it shouldn’t be performed more than once. For a given Markdown input, we know the output will always be the same, so in my own applications I transform the text to HTML once and store it in the database alongside the original. That just seems like the smart thing to do. However, I see lots of CMSs these days (especially those purporting to be ‘lightweight’) that parse Markdown at runtime and don’t appear to suffer from it.

But which is better, parsing Markdown at runtime, or parsing at edit time and retrieving? There’s only one way to find out…

FIGHT!

Ok, perhaps not a fight, but I thought it would be interesting to run some highly unscientific, finger-in-the-air benchmarks to get an idea of whether parsing Markdown really does impact page performance compared to fetching HTML from a file or database. Is it really that slow?

Using the PHP version of Markdown, I took the jQuery github README.md file as an example document. I figured it wasn’t too long or short, contained a few different features of the language, and was pretty much a typical example.

My methodology was simply to write a PHP script to perform the task being tested, and then hit it with apachebench a few times to get the number of requests per second. In unscientific conditions, I expected my results to be useful only for comparison – the conditions weren’t perfect, but they were consistent across tests.

In the most basic terms, measuring requests per second tells you how many visitors your site can support at once. The faster the code, the higher the number, the better.

Test 1: Runtime parsing

Below is the script I used. Pretty much no-nonsense, reading in the source Markdown file, instantiating the parser and parsing the text.

<?php
    require('markdown.php');
    $text = file_get_contents('jquery.md');
    $Markdown_Parser = new Markdown_Parser;
    $Markdown_Parser->transform($text);		
    unset($Markdown_Parser);
 ?>

I blasted this with apachebench for 10,000 requests with a concurrency of 100.

Result: around 155 requests per second.

Test 2: Retrieving HTML from a database

I created a very simple database with one table containing one row. I pasted in the HTML result of the parsed Markdown (created using the same method as above). I then took some boilerplate PHP PDO database connection code from the PHP manual.

<?php
    $dbh = new PDO(
        'mysql:host=localhost;dbname=markdown-test',
        'username', 'password');
    foreach($dbh->query('SELECT html 
        FROM content WHERE id=1') as $row) {
        $text = $row['html'];
    }
    $dbh = null;
?>

I restarted the server, and then hit this script with the same ab settings.

Result: around 3,575 requests per second.

Test 3: Retrieving HTML from a file

For comparison, I thought it would be interesting to look at a file-based approach. For this text, I parsed the Markdown on the first attempt, and then reused it for subsequent runs. A very basic form of runtime parsing and caching, if you will.

<?php
    if (file_exists('jquery.html')) {
        $html = file_get_contents('jquery.html');
    }else{
        require('markdown.php');
        $text = file_get_contents('jquery.md');
        $Markdown_Parser = new Markdown_Parser;
        $html = $Markdown_Parser->transform($text);		
        file_put_contents('jquery.html', $html);
        unset($Markdown_Parser);
    }
?>

In theory, this should be very fast, as it’s basically just stating a file then fetching it. It hit it with the same settings again.

Result: around 12,425 requests per second.

Conclusion

It would be improper to raise a formal conclusion from such rough tests, but I think we can get an idea of the overall work involved with each method, and the numbers tally with common sense.

Parsing Markdown is slow. It can be around 25 times slower than fetching pre-transformed HTML from the database. Considering you’re likely already fetching your Markdown text from the database, you’re effectively doing the work of Test 2 and then Test 1 on top.

It would be interesting to compare the third test with caching the output to something like Redis. Depending on your traffic profile, that could be quite an effective approach if you really didn’t want to store the HTML permanently, although I’m not sure why that would be an issue. It would also be interesting to compare these rough results with some properly conducted ones, if anyone’s set up to do those and has the time.

All applications and situations are different, and therefore everyone has their own considerations and allowances to make. Operating and different scales, with different platforms can affect your choices. Perhaps you have CPU in abundance, but are bottlenecking on IO.

However, for the typical scenario of a basic content managed website and for any given web hosting, parsing Markdown at runtime can vastly reduce the number of visitors your site can support at once. It could make the difference between surviving a Fireballing and not. For my own work, I will continue to parse at edit time and store HTML in the database.

- Drew McLellan

Comments

  1. § Ben Lancaster:

    Personally I’d be inclined to take the DB and flat files out of the equation for the “cached” HTML version and instead use a memory cache like APC (now enabled in PHP by default I believe) or Memcache, using a checksum of the Markdown content as the cache key. That way if the content changes, the checksum will change and it’ll re-generate the HTML and cache it in memory. It’ll make for a smaller storage on disk and speed up the database too as your queries will return smaller resultsets.

    Memory-based caches will almost certainly be quicker than hitting the disk for flat files too.

  2. § jamie knight:

    Hiya,

    Good post, thanks for providing some numbers. My own experiment building a little file publisher had parsing a 4kb file of medium complexity markdown taking about 20-30ms on my little shared hosting account. Markdown was one of many culprits i found. I found strtotime (5-10ms!) to be pretty slow as well.

    I know premature optimisation is the root of evil but sometimes it can be fun (and relaxing) to performance tweak and experiment. I found that once your getting below around 5ms or so even things like file includes start taking noticeable amounts of time. When experimenting i reduced PHP execution time from 5ms to < 2ms just by combining some files which didn’t benefit that much from being separate.

    In my day job with BBC Radio & Music we use varnish and ESI to take the load of the PHP servers and keep the site peppy.

    Cheers,

    Jamie + Lion

Photographs

Work With Me

edgeofmyseat.com logo

At edgeofmyseat.com we build custom content management systems, ecommerce solutions and develop web apps.

Follow me

Affiliation

  • Web Standards Project
  • Britpack
  • 24 ways

Perch - a really little cms

About Drew McLellan

Photo of Drew McLellan

Drew McLellan (@drewm) has been hacking on the web since around 1996 following an unfortunate incident with a margarine tub. Since then he’s spread himself between both front- and back-end development projects, and now is Director and Senior Web Developer at edgeofmyseat.com in Maidenhead, UK (GEO: 51.5217, -0.7177). Prior to this, Drew was a Web Developer for Yahoo!, and before that primarily worked as a technical lead within design and branding agencies for clients such as Nissan, Goodyear Dunlop, Siemens/Bosch, Cadburys, ICI Dulux and Virgin.net. Somewhere along the way, Drew managed to get himself embroiled with Dreamweaver and was made an early Macromedia Evangelist for that product. This lead to book deals, public appearances, fame, glory, and his eventual downfall.

Picking himself up again, Drew is now a strong advocate for best practises, and stood as Group Lead for The Web Standards Project 2006-08. He has had articles published by A List Apart, Adobe, and O’Reilly Media’s XML.com, mostly due to mistaken identity. Drew is a proponent of the lower-case semantic web, and is currently expending energies in the direction of the microformats movement, with particular interests in making parsers an off-the-shelf commodity and developing simple UI conventions. He writes here at all in the head and, with a little help from his friends, at 24 ways.