Paging Large Datasets in SQL Server
A common requirement of a web application is to be able to perform paging on sets of data. Paging, as the name suggests, is simply the act of taking a bunch of data and splitting it across a number of pages. The practice is then to offer ‘previous’ and ‘next’ links for the user to navigate back and forth. The challenge for the developer lies in knowing what page he is on and what the next and previous pages are in order to request the right section of data from the database.
Well, the maths is pretty simple and boils down mostly to two key datapoints:- the total number of rows you need to display and the number of rows to display per page. The rest is just schoolboy sums.
Know Your LIMITs
When working with MySQL as your data store, retrieving a ‘page’ of data is simple. MySQL provides a
LIMIT clause in its SQL syntax for returning only sections of result sets. It takes two arguments. The first is the row number to start at (counting from zero), and the second is the number of rows to return. So if you’re displaying 20 results per page, to retrieve page 3 of your set you’d need to specify something like:
SELECT some_data FROM a_table WHERE search_conditions LIMIT 40, 20
This would return 20 rows from the result set, starting at the 41st row. In laymans’ terms, page 3.
Now this is all well and good if you’re using MySQL. If you’re using Microsoft SQL Server (as I happen to be at the moment), it’s not so rosy. Unlike MySQL, SQL Server does not support the
LIMIT clause. Instead it has
TOP, which is used right after the
SELECT as in:
SELECT TOP 20 some_data FROM a_table WHERE search_conditions
The downside to
LIMIT is that it only takes a single argument, that being the number of rows to return. There’s no way to specify where in the result set to start from. Which is a problem.
The traditional way of handling this problem in SQL Server is to select the result set into a new temporary table with its own index column. This then gives you a method of selecting a portion from the middle of the data, as the rows are numbered sequentially.
For my purposes, however, this method had its drawbacks. The primary problem is that I needed to perform paging on all sorts of data sets right across my app. Some of those are queried direct from ASP classes via ADO, and some are called through stored procedures, depending on the type of query and where and when it is used. The temporary table shenanigans need to be performed at the database level, so there went any hope for a unified solution using that technique. In addition to this (and I admit that this would be a problem with MySQL too) some of the queries are extremely intensive and take a long time to return. Running the query over for each page was not acceptable.
Combine this with the requirement to be able to dynamically sort the results, and I came around to the conclusion that I’d need to somehow run the query once and then cache the result for subsequent processing. In theory I think I should have been able to do this using temporary tables, but in practise I couldn’t get the tables to persist in the way I needed, and as I mentioned, didn’t provide the uniform ‘drop in’ solution I was aiming for. The caching was going to have to be done at the client (which in this instance is the application layer – ASP).
One really nifty feature of SQL Server (and I wish MySQL would support this more fully) is native XML output. You can basically take any query and put
FOR XML on the end, and instead of a recordset object, it returns an XML document. Supercool. This has lead to many elegant solutions in my projects over the last five years, and WOW has it been that long already?
Based on the availability of the XML output, I decided to run each query that was going to be paged as XML. I’d then take the result and write it to disk on the web server. Once it was cached on disk (and for every subsequent request of the data with a valid paging or sorting query string parameter) I’d read the XML back off the disk and process it for display with an XSLT template.
XSLT is a fantastic technology which never fails to regenerate interest in whatever I’m working on. For my purposes on this project, it has three key features. Firstly, it has a
postion() function which returns what is effectively the row number as we loop through the data. Obviously this is invaluable for paging as it enables me to process for output only rows within a certain range (my page).
The second feature that helps is that it has its own sorting capabilities built in. This means there’s no need to go back to the database server and rely on SQL when the user wishes to reorder a result set. The third key feature is the ability to take my source XML document and process it in a number of different ways. I can use the same cached result set to produce both the XHTML and the CSV versions of a report, for example.
So it was a bit of a journey to get there, but I found what was for me a really good solution to paging large datasets in SQL Server. Retrieve as XML, and process with XSLT. Job done.