Usually, if we have to extract numbers from a string, in any programming language, the first idea is to use regular expressions. How could we use regular expressions to extract integers from strings in TSQL?
Let’s suppose we have a table with a string column in which we can have letters and numbers. We want to extract the numbers, the possible situations are:
- numbers can be at any possition in the string column
- the numbers are integers but they can have different length
- in a string there can be one or many numbers
We want to obtain a column that will contain only the numbers in the string column, if there are many numbers in a cell, we will have all of them separated by comma (for example). We will define the following test data:
DECLARE @sample TABLE(
id INT,
data VARCHAR(200) )
INSERT @sample
SELECT 1, 'bar 1245 foo' UNION ALL
SELECT 2, 'foo 35234 bar 38383' UNION ALL
SELECT 3, 'bar 126831' UNION ALL
SELECT 4, '72348707 foo ' UNION ALL
SELECT 5, 'foo bar';
We want to use regular expressions, and we have to find an algorithm to solve the following problems:
- how to identify where it starts a number and where it ends
- how to find one or many occurences in the same string
For the first problem, identifing a number in a string can be done using regular experssions like [0-9][^0-9] which means “one or no number followed by not a number”. We can use this regular expression in PATINDEX and manage the special case of a number at the end of the string.
For the second problem, we can use a recursive CTE to find all occurences of an integer number in the string.
So, a possible solution can be:
WITH numberList AS (
SELECT
id, data, nEnd, nStart,
SUBSTRING(data, nStart,
CASE
WHEN nEnd = 0 THEN LEN(data)
ELSE nEnd
END - nStart + 1) AS numbers
FROM
(SELECT
data, id,
PATINDEX('%[0-9]%', data) AS nStart,
PATINDEX('%[0-9][^0-9]%', data) AS nEnd
FROM
@sample
WHERE
data LIKE '%[0-9]%') process
UNION ALL
SELECT
id, data, nEnd, nStart,
SUBSTRING(data, nStart,
CASE
WHEN nEnd = 0 THEN LEN(data)
ELSE nEnd
END - nStart + 1) AS numbers
FROM
(SELECT
numbers, id,
SUBSTRING(data, nEnd + 1, LEN(data) - nEnd) AS data,
PATINDEX('%[0-9]%', SUBSTRING(data, nEnd + 1, LEN(data) - nEnd)) AS nStart,
PATINDEX('%[0-9][^0-9]%', SUBSTRING(data, nEnd + 1, LEN(data) - nEnd)) AS nEnd
FROM
numberList
WHERE
nEnd > 0 AND
data LIKE '%[0-9]%') process
WHERE
nStart > 0)
SELECT distinct
id, STUFF((SELECT
', ' + numbers
FROM
numberList nl2
WHERE
nl2.id = nl.id
FOR XML PATH('')), 1, 2, '' ) AS numbers
FROM
numberList nl
The most important points in this algorithm are:
- to have inside the CTE a select that keeps the start and end positions of the numbers found in the string
- the condition WHERE nEnd > 0 is used to exclude rows from CTE processing once they have been searched
- the SUBSTRING functions uses for the last parameter a CASE instruction that takes care of the special case of having numbers at the end of the string
- the result contains two columns, the id and the concatenated numbers found in the initial string column. To concatenate strings a combination of two instruction have been used: STUFF and FOR XML PATH('')
This example only works for integers, decimals present the extra difficulty of interpreting the decimal separator ('.' or ',') in different ways depending on its position.