Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strip numbers from string after certain characters into a recursive table

I have unstructured, free-form data, which requires cleansing to produce a clean list of all PO numbers in a column. Unfortunately, the column can accommodate multiple PO numbers in various formats (nightmare!!).

See test example:-

DECLARE @temp TABLE
(
    string VARCHAR(50)
)

INSERT INTO @temp (string)
VALUES 
    ('Po998 blah blah po1001'),
    ('PO 999'),
    ('PO 1000 12345 blah PO1002')

Ideally, I'm looking for a single table which holds the following:-

PO - All PO numbers will begin with the letters' PO' and vary in length.

  1. 998
  2. 999
  3. 1000
  4. 1001
  5. 1002

Note 12345 is not a PO and should not appear in the table.

Here is my current attempt to isolate the numerical values after the PO, but it does not give the correct results.

with
clean_stg as (
    select  *,
            ltrim(SUBSTRING(string,CHARINDEX('po',string)+2,len(string))) Val
    from @temp
),

clean_po as (
            select b.*,
            LEFT(b.Val,PATINDEX('%[^0-9]%', b.Val+'a')-1) PO
from clean_stg b
) 

select * from clean_po;

I would love to do this with REGEX in Python, but I need to keep this in the SQL Server environment. I think a recursive Common Table Expression (CTE) might be my answer, but I'm unsure how to structure it best.

like image 768
Lee Murray Avatar asked Sep 07 '25 15:09

Lee Murray


1 Answers

I don't think you need recursion here, and the CTE is only useful in that it prevents having to repeat expressions.

;WITH x AS 
(
  SELECT x = REPLACE(UPPER(string), 'PO ', 'PO') 
    FROM @temp
) 
SELECT PO = TRY_CONVERT(int, SUBSTRING(value,3,255))
  FROM x 
 CROSS APPLY STRING_SPLIT(x, ' ')
 WHERE value LIKE 'PO[0-9]%'
 ORDER BY PO;

And in fact you can do it without the CTE this way:

SELECT PO = TRY_CONVERT(int, SUBSTRING(value,3,255))
  FROM @temp
 CROSS APPLY STRING_SPLIT(REPLACE(UPPER(string), 'PO ', 'PO') , ' ')
 WHERE value LIKE 'PO[0-9]%'
 ORDER BY PO;

It wasn't mentioned but you may want to filter duplicates.

You can do additional safety checks to filter out a value like PO3x5 instead of returning NULL, but I already think you should have better/cleaner starting data in the first place. Parsing messy, inconsistent strings for nuggets of info is a nightmare.

like image 111
Aaron Bertrand Avatar answered Sep 10 '25 07:09

Aaron Bertrand