SQL Server MVP Adam Haines and I have been going back and forth for months in our blogs on the subject of creating and splitting a delimited string… it’s been kind of a virtual T-SQL Tennis match… not a competition, but rather a friendly rally, knocking the subject back and forth, practicing our skills. Between the two of us, we’ve just about beaten the subject to death; however, I like to think that our collective research has been a win-win situation for the SQL Server community.
(If you’d like to look at these past articles, read Adam’s posts here, here, here, and here and read my posts and comments here, here, here, and here.)
In Adam’s latest volley a few days ago (which you can read here), he did some in-depth time tests on various methods of splitting a delimited string…. the XML approach using XPath and the .nodes() and .value() methods, the permanent Numbers table approach, and the approach using a TVF with a virtual table of Numbers.
His verdict was that the XML approach was a biiiiig disappointment when handling a comma-delimited string of many items, in one case performing over 200 times slower (!!) than the Numbers table approaches. Like Adam, I was a bit crestfallen. The XML method is the usual method suggested on the T-SQL forums for doing this sort of thing.
It is literally shocking how badly it performs.
Take a look at the following example, in which I create a variable that contains a comma-delimited list of 5000 items. Then it uses the XML approach to shred the list into a result set of 5000 rows:
/*
Note that I have to CAST() the first parameter of REPLICATE
in order to force it to produce more than 8000 bytes
*/
declare @CommaList varchar(max)
set @CommaList='ABC'+replicate(cast(',ABC' as varchar(max)),4999)
select x.i.value('.','varchar(3)')
from (select XMLList=cast('<i>'+replace(@CommaList,',','</i><i>')+'</i>' as xml)) a
cross apply XMLList.nodes('i') x(i)
The task that we’re asking SQL Server to do is not that complicated, but do you want to guess how long this takes to fully execute on my system? Are you sitting down?
It takes over 6 minutes!
Really disappointing…
I had a lot of trouble accepting this. After doing several experiments, I finally figured out what was going on.
The problem lies in the derived table in the FROM clause, where we create a derived XML column via the CAST function.
It seems we are not really creating a complete column or value at all. Apparently we are inadvertently creating an XML stream, which sends small chunks of data to the .nodes() method and then the chunks get handed over to the .value() method to come up with our result… one chunk at a time.
On my system, when I execute the above code, I see nothing at all happen for about almost 50 seconds. Then the first 676 rows get spit out into the results window. After another 50 seconds or so, another similar-sized chunk of rows get spit out. This happens like clockwork until it spits out the final 5000th row over 6 minutes after I started the query.
You may recall in a previous post of mine where I said that an XML stream is processed in 2033-byte chunks. Well, 676 rows of 3 bytes (‘ABC’) each equals 2028 bytes. (I’m assuming that there’s a little tiny bit of overhead involved and that’s why we can’t squeeze in the 677th row). My comma-delimited string consists of 5000 items at 3 bytes each, coming to 15000 bytes. Divide that by 2033 bytes and you get 7.38, which is the number of chunks of data that get processed. Multiply that by 50 seconds per chunk, and you get… voila… 6.15 minutes.
Well, this must be all really exciting for you math nerds out there, but how does it help us?
It does help to explain what’s going on, and it does show us that streaming the XML is reeeeeaaalllly sloooooow.
So is there hope for the XML method?
YES! YES! YES!
This slow streaming only occurs when we are creating the XML to be processed via a derived table or CTE. What If we instead transfer our comma-delimited string into an actual XML variable and then process that variable, like so?:
/*
Note that I have to CAST() the first parameter of REPLICATE
in order to force it to produce more than 8000 bytes
*/
declare @CommaList varchar(max)
set @CommaList='ABC'+replicate(cast(',ABC' as varchar(max)),4999)
declare @XMLList xml
set @XMLList=cast('<i>'+replace(@CommaList,',','</i><i>')+'</i>' as xml)
select x.i.value('.','varchar(3)')
from @XMLList.nodes('i') x(i)
Believe it or not, this takes only 0.174 seconds!
In comparing the two pieces of code outlined in this article, here are the results:
/* Derived XML
Table Variable Percent
Measurement Method Method Improvement
--------------------------------------------
CPU 386,172 125 2989%
Duration(ms) 368,747 174 2019%
*/
So I’m happy to report that all is not lost for the XML method. It’s still a contender in the list-splitting competition, giving the Numbers table methods a run for their money… as long as it is implemented in a certain way.