I have a string as below
Welcome to the world of the Hackers
I am trying to replace the occurrences of listed strings i.e. of,to,the
in between the entire string using below query, but it's not working properly if the patterns are consecutive, it fails.
SELECT regexp_replace( 'Welcome to the world of the Hackers', '( to )|( the )|( of )', ' ' )
FROM dual;
Output: Welcome the world the Hackers
Even if the pattern is repeating consecutive it is not working i.e.
SELECT regexp_replace( 'Welcome to to the world of the Hackers', '( to )|( the )|( of )', ' ' )
FROM dual;
Output: Welcome to world the Hackers
Whereas my expected output is: Welcome world Hackers
Is there any alternative/solution for this using REGEXP_REPLACE
?
You can use the regular expression (^|\s+)((to|the|of)(\s+|$))+
:
SQL Fiddle
Query 1:
WITH test_data ( sentence ) AS (
SELECT 'to the of' FROM DUAL UNION ALL
SELECT 'woof breathe toto' FROM DUAL UNION ALL -- has all the words as sub-strings of words
SELECT 'theory of the offer to total' FROM DUAL -- mix of words to replace and words starting with those words
)
SELECT sentence,
regexp_replace(
sentence,
'(^|\s+)((to|the|of)(\s+|$))+',
'\1'
) AS replaced
FROM test_data
Results:
| SENTENCE | REPLACED |
|------------------------------|--------------------|
| to the of | (null) | -- All words replaced
| woof breathe toto | woof breathe toto |
| theory of the offer to total | theory offer total |
Why doesn't
regexp_replace( 'Welcome to the world of the Hackers', '( to )|( the )|( of )', ' ' )
work with successive matches?
Because the regular expression parser will look for the second match after the end of the first match and will not include the already parsed part of the string or the replacement text when looking for subsequent matches.
So the first match will be:
'Welcome to the world of the Hackers'
^^^^
The second match will look in the sub-string following that match
'the world of the Hackers'
^^^^
The 'the '
at the start of the sub-string will not be matched as it has no leading space character (yes, there was a space before it but that was matched in the previous match and, yes, that match was replaced with a space but overlapping matches and matches on previous replacements are not how regular expressions work).
So the second match is the ' of '
in the middle of the remaining sub-string.
There will be no third match as the remaining un-parsed sub-string is:
'the Hackers'
and, again, the 'the '
is not matched as there is not leading space character to match.
REGEXP_REPLACE
does not match a second pattern which is a part of the already matched pattern. This is more apparent when you use the multi-pattern matching like |
. Thus, you can't rely on spaces for word boundaries to match multiple patterns this way. One solution could be to split and combine the characters. This may not be the best way, but works nonetheless. I would be glad to know a better solution.
This also assumes that you are ok with single spaces in the combined string when it had more than one in the original string.Also, words ending with comma or semicolon aren't considered. You may enhance it using NOT REGEXP_LIKE
instead of NOT IN
for such cases.
WITH t (id,s)
AS (
SELECT 1 , 'Welcome to the world of the Hackers, you told me these words at the'
FROM DUAL
UNION ALL
SELECT 2, 'The second line.Welcome to the world of the Hackers, you told me these words at the'
FROM DUAL
)
SELECT LISTAGG(word, ' ') WITHIN
GROUP (
ORDER BY w
)
FROM (
SELECT id,
LEVEL AS w
,REGEXP_SUBSTR(s, '[^ ]+', 1, LEVEL) AS word
FROM t CONNECT BY LEVEL <= REGEXP_COUNT(s, '[^ ]+')
AND PRIOR id = id
AND PRIOR SYS_GUID() IS NOT NULL
)
WHERE lower(word) NOT IN (
'to'
,'the'
,'of'
)
GROUP BY id;
Demo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With