All in the <head>

– Ponderings & code by Drew McLellan –

– Live from The Internets since 2003 –

About

Paging Large Datasets in SQL Server

11 July 2005

A common requirement of a web application is to be able to perform paging on sets of data. Paging, as the name suggests, is simply the act of taking a bunch of data and splitting it across a number of pages. The practice is then to offer ‘previous’ and ‘next’ links for the user to navigate back and forth. The challenge for the developer lies in knowing what page he is on and what the next and previous pages are in order to request the right section of data from the database.

Well, the maths is pretty simple and boils down mostly to two key datapoints:- the total number of rows you need to display and the number of rows to display per page. The rest is just schoolboy sums.

Know Your LIMITs

When working with MySQL as your data store, retrieving a ‘page’ of data is simple. MySQL provides a LIMIT clause in its SQL syntax for returning only sections of result sets. It takes two arguments. The first is the row number to start at (counting from zero), and the second is the number of rows to return. So if you’re displaying 20 results per page, to retrieve page 3 of your set you’d need to specify something like:

SELECT some_data FROM a_table WHERE search_conditions LIMIT 40, 20

This would return 20 rows from the result set, starting at the 41st row. In laymans’ terms, page 3.

Now this is all well and good if you’re using MySQL. If you’re using Microsoft SQL Server (as I happen to be at the moment), it’s not so rosy. Unlike MySQL, SQL Server does not support the LIMIT clause. Instead it has TOP, which is used right after the SELECT as in:

SELECT TOP 20 some_data FROM a_table WHERE search_conditions

The downside to TOP vs LIMIT is that it only takes a single argument, that being the number of rows to return. There’s no way to specify where in the result set to start from. Which is a problem.

Temporary Tables

The traditional way of handling this problem in SQL Server is to select the result set into a new temporary table with its own index column. This then gives you a method of selecting a portion from the middle of the data, as the rows are numbered sequentially.

For my purposes, however, this method had its drawbacks. The primary problem is that I needed to perform paging on all sorts of data sets right across my app. Some of those are queried direct from ASP classes via ADO, and some are called through stored procedures, depending on the type of query and where and when it is used. The temporary table shenanigans need to be performed at the database level, so there went any hope for a unified solution using that technique. In addition to this (and I admit that this would be a problem with MySQL too) some of the queries are extremely intensive and take a long time to return. Running the query over for each page was not acceptable.

Combine this with the requirement to be able to dynamically sort the results, and I came around to the conclusion that I’d need to somehow run the query once and then cache the result for subsequent processing. In theory I think I should have been able to do this using temporary tables, but in practise I couldn’t get the tables to persist in the way I needed, and as I mentioned, didn’t provide the uniform ‘drop in’ solution I was aiming for. The caching was going to have to be done at the client (which in this instance is the application layer – ASP).

SQLXML

One really nifty feature of SQL Server (and I wish MySQL would support this more fully) is native XML output. You can basically take any query and put FOR XML on the end, and instead of a recordset object, it returns an XML document. Supercool. This has lead to many elegant solutions in my projects over the last five years, and WOW has it been that long already?

Based on the availability of the XML output, I decided to run each query that was going to be paged as XML. I’d then take the result and write it to disk on the web server. Once it was cached on disk (and for every subsequent request of the data with a valid paging or sorting query string parameter) I’d read the XML back off the disk and process it for display with an XSLT template.

Using XSLT

XSLT is a fantastic technology which never fails to regenerate interest in whatever I’m working on. For my purposes on this project, it has three key features. Firstly, it has a postion() function which returns what is effectively the row number as we loop through the data. Obviously this is invaluable for paging as it enables me to process for output only rows within a certain range (my page).

The second feature that helps is that it has its own sorting capabilities built in. This means there’s no need to go back to the database server and rely on SQL when the user wishes to reorder a result set. The third key feature is the ability to take my source XML document and process it in a number of different ways. I can use the same cached result set to produce both the XHTML and the CSV versions of a report, for example.

So it was a bit of a journey to get there, but I found what was for me a really good solution to paging large datasets in SQL Server. Retrieve as XML, and process with XSLT. Job done.

- Drew McLellan

Comments

  1. § Jonathan Snook: ADO has a paging mechanism built into it as well. The properties of the RecordSet that you’d be interested in would be PageSize (sets or returns how many records per page), AbsolutePage (sets or returns the page number), PageCount (returns # of pages).

    It still returns all records from the database just like the XML solution so it’s just another approach but may be more appropriate for things that don’t use XML.
  2. § Mike Rainey: Well, Jonathan beat me to this one.

    I agree with you that MySQL should support XML output more. I have run into the same problems you have described on more than one occasion and have never really been able to come up with a solution I found acceptable. Thanks to you, now I have.

    Although I know ASP and in a round-about way, MicrosoftSQL, I prefer PHP/MySQL. Just plain easier to work with, in my opinion.
  3. § Turnip: I remember struggling with this one ages ago… can’t remember the exact solution I came up with but it basically was a subquery taking the form of something like:

    SELECT TOP 60 * FROM myTable WHERE id NOT IN (SELECT TOP 40 id FROM myTable);

    Don’t quote me on that though… it’s entirely from memory.
  4. § Joseph Lindsay: I had to do this recently and came up with a way of doing it as some nested sub queries and sorting. Like this
  5. § Jennifer Grucza: Oh wow, that’s pretty dumb that SQL Server doesn’t support an OFFSET. Caching works ok for smaller datasets, but what if you have lots of data?

    I was annoyed that I couldn’t use scrollable result sets with Postgres, but this is much worse.
  6. § Bob Gordon: J Snook’s comment is OK, if the Provider supports bookmarks, but my SQL2000 Provider does not, so I like J. Lindsay’s solution. Of course his solution needs some paramterizing so that you can change the page size, offset, table name, and sorted column on the fly. I have done this in a stored proc that I will probably use quite a bit.
  7. § Donal: A) of all – If you’re paging in MySQL, you might be interested in using SELECT SQL_CALC_FOUND_ROWS to find out how many rows the statement would have returned without the LIMIT.

    Apologies if this is widely known already – I’ve never encountered it in other people’s work.

    And B – I’ve only done this with PHP, but in my experience, caching data as XML is not as convenient or fast as just serializing the object and caching that text. It might be worth profiling this option – I haven’t tested it myself in about two years, but back then the results were convincing.

    Useful if XSLT isn’t your cup of tea.
  8. § jorgempf: gordon: how did you put the dynamic value in the top parameter of the select in a store procedure?
  9. § Frans Nagel: I would rather let SQL server do as much as possible by itself. Turnip was right:
    use pubs
    select top 5 * from authors where au_id not in(select top 5 au_id from authors order by au_lname)order by au_lname
  10. § Drew: That’s all well and good when writing a basic query against the pubs database, but in a real life situation with multiple subqueries and calculated columns and it quickly becomes superbly complicated, bordering on impractical.
  11. § Claudine: Yes I think Joseph Lindsay’s method is probably the best.
  12. § Mathew Waters: Very useful article, thanks Drew. For those that do not have write access to a web server, or just don’t want to use XML+XSL, beware of using the …where NOT IN (... query. The “NOT IN” is very much harder for SQL Server to deal with, as it must perform a table scan instead of a seek. I would agree that Joseph Lindsay’s method is the best here.
  13. § Ben: Thanks for this useful article. And thanks to Donal for the hint about SQL_CALC_FOUND_ROWS. Just what I needed!
  14. § will sahatdjian: I’m relatively new to SQL Server but this is what I came up with (requires an autonumber column to work):

    Select TOP 20 * From [Table] WHERE ID > (SELECT TOP 1 x FROM (SELECT TOP 1000 ID as x FROM [Table] ORDER BY ID) as x ORDER BY x DESC)

    20 is the Page Size and 1000 is the Offset
  15. § will sahatdjian: Edit:

    Select TOP 20 * From [Table] WHERE ID >= (SELECT TOP 1 x FROM (SELECT TOP 1000 ID as x FROM [Table] ORDER BY ID) as x ORDER BY x DESC)

    The Offset value (1000 in this case) must be at least 1, as zero will obviously not work
  16. § Zeeshan Ahmed:

    Hello,

    Is there any way to develop stored procedures in MySql database?

    Thank you,
    Zeeshan Ahmed

  17. § Wahya:

    Now, i have a problem in SQL SERVER 2000. I want to viewing 5 record, from record 5 to 10. How can i solve this problem?
    I can find the answer using Select TOP 5 * from Table

    Help ME

  18. § lee:

    Refreshing graphics and layout, Drew.

  19. § meyer:

    Dont use top!
    Use row_number()

    link:

    http://msdn2.microsoft.com/en-us/library/ms186734.aspx

    ROW_NUMBER ( ) OVER ( [ ] )

    example:

    with ordernr as (SELECT *, ROW_NUMBER() OVER (order by trade_id) as row FROM trade) SELECT * FROM ordernr WHERE row between 5 and 10

  20. § venkat:

    ROW_NUMBER() is in SQL 2k5 not in SQL 2k. So, we have to use Paging mechanism writing our own algorithm.

  21. § farhan:

    im using this sql in msaccess
    Select TOP 5 * From [Table] WHERE ID=abc ORDER BY date DESC
    but i got 8 records in result

Photographs

Work With Me

edgeofmyseat.com logo

At edgeofmyseat.com we build custom content management systems, ecommerce solutions and develop web apps.

Recent Links

Affiliation

  • Web Standards Project
  • Britpack
  • 24 ways

About Drew McLellan

Photo of Drew McLellan

Drew McLellan has been hacking on the web since around 1996 following an unfortunate incident with a margarine tub. Since then he’s spread himself between both front- and back-end development projects, and now is Director and Senior Web Developer at edgeofmyseat.com in Maidenhead, UK (GEO: 51.5217, -0.7177). Prior to this, Drew was a Web Developer for Yahoo!, and before that primarily worked as a technical lead within design and branding agencies for clients such as Nissan, Goodyear Dunlop, Siemens/Bosch, Cadburys, ICI Dulux and Virgin.net. Somewhere along the way, Drew managed to get himself embroiled with Dreamweaver and was made an early Macromedia Evangelist for that product. This lead to book deals, public appearances, fame, glory, and his eventual downfall.

Picking himself up again, Drew is now a strong advocate for best practises, and stood as Group Lead for The Web Standards Project 2006-08. He has had articles published by A List Apart, Adobe, and O’Reilly Media’s XML.com, mostly due to mistaken identity. Drew is a proponent of the lower-case semantic web, and is currently expending energies in the direction of the microformats movement, with particular interests in making parsers an off-the-shelf commodity and developing simple UI conventions. He writes here at all in the head and, with a little help from his friends, at 24 ways.