This is a continuation of my previous blog post, Delimited String Tennis Anyone?, which playfully talked about a virtual T-SQL Tennis match… a friendly rally back and forth between SQL Server MVP Adam Haines and myself in discussing methods of shredding comma-delimited strings in our blogs.
In my previous post, I expressed my disappointment with the following bit of code, which shreds elements from a comma-delimited string consisting of 5000 elements:
/*
Note that I have to CAST() the first parameter of REPLICATE
in order to force it to produce more than 8000 bytes
*/
declare @CommaList varchar(max)
set @CommaList='ABC'+replicate(cast(',ABC' as varchar(max)),4999)
select x.i.value('.','varchar(3)')
from (select XMLList=cast('<i>'+replace(@CommaList,',','</i><i>')+'</i>' as xml)) a
cross apply XMLList.nodes('i') x(i)
This code, which I will call the “Derived Table Method”, took over 6 painful minutes to execute!
It had to do with how the XML column in the derived table was actually streamed and processed in small chunks as opposed to just immediately instantiating a normal, vanilla XML column.
My solution to this problem was to just create an intermediate XML variable and process it directly:
declare @XMLList xml
set @XMLList=cast('<i>'+replace(@CommaList,',','</i><i>')+'</i>' as xml)
select x.i.value('.','varchar(3)')
from @XMLList.nodes('i') x(i)
This code, which I will call the “XML Variable Method”, was more like it… it took less than a second to execute… an improvement of over 2000%!
Since I wrote that blog entry, I received a few comments and e-mails that suggested some other approaches.
A person named “Mister Magoo” (no relation to the cartoon character voiced by Jim Backus in the 1950’s and 1960’s I’m sure) commented that the “Derived Table Method” could be improved dramatically just by adding a .query() to the XML column created in the derived table:
select x.i.value('.','varchar(3)')
from (select XMLList=cast('<i>'+replace(@CommaList,',','</i><i>')+'</i>' as xml).query('.')) a
cross apply XMLList.nodes('i') x(i)
He was correct! This also took less than a second! I suppose the theory here is that adding the .query() method forces the XML column to be instantiated immediately instead of bringing about the whole streaming nonsense.
This is excellent… it eliminates the step of having to create an intermediate XML variable… everything can be done in one single statement. It’s not quite as fast as the “XML Variable Method”, but we’re only talking about a small difference in milliseconds… it certainly performs like lightning.
The results of the above 3 methods come out as follows:
/*
Method CPU Duration(ms)
----------------------------------------------------
Derived Table 369,187 398,791
Derived Table with .query() 156 168
XML Variable 99 117
*/
I also got an e-mail from my friend SQL Server MVP Alejandro Mesa (some of you may know him as Hunchback on various SQL Forums), who mentioned that things could be sped up even more by changing the XQuery expression passed as the first parameter to the .value() method.
Instead of the simple ‘.’ that I passed…
select x.i.value('.','varchar(3)')
…Alejandro suggested using the .text() accessor and an ordinal [1] predicate, like so:
select x.i.value('(./text())[1]','varchar(3)')
True enough, this did improve the performance of the query. With this XQuery expression, we are indicating that we want the first instance ([1]) of the text-only node values (i.e. not any XML sub-elements) in our x.i node. This is more specific than our original ‘.’ expression, which indicated that we wanted to process the entire x.i node.
You can get a clearer idea of how these two XQuery expressions work in the following examples:
declare @XMLVar xml
set @XMLVar='<a>abc<b>xyz</b>123</a>'
select x.i.value('.','varchar(10)')
from @XMLVar.nodes('a') x(i)
/* Returns 'abcxyz123' */
select x.i.value('(./text())[1]','varchar(10)')
from @XMLVar.nodes('a') x(i)
/* Returns 'abc' */
select x.i.value('(./text())[2]','varchar(10)')
from @XMLVar.nodes('a') x(i)
/* Returns '123' */
The @XMLVar variable above had two text nodes within the <a></a> element. In our original queries at the beginning of this article, each of the <i></i> elements contained only a single text node (and that’s why the expressions of ‘.’ and ‘(./text())[1]’ both produced the same result), and yet the expression ‘(./text())[1]’ was much more efficient and therefore faster because it was so much more specific in indicating what we wanted to process.
When I applied Alejandro’s suggestion to the queries, you can see how the CPU and Duration decreased even more:
/*
Method CPU Duration(ms)
----------------------------------------------------------------
Derived Table 369,187 398,791
Derived Table with .query() 156 168
Derived Table with .query() and text()[1] 130 159
XML Variable 99 117
XML Variable with text()[1] 76 77
*/
Alejandro also pointed me towards a very detailed technical article at Microsoft TechNet entitled Performance Optimizations for the XML Data Type in SQL Server 2005. If you do a lot of XML manipulation, this article looks like a must-read.
Thank you again to Mister Magoo and Alejandro for their contributions! With all of this knowledge, we can all now be the speed demons of delimited string shredding.