Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL tsvector parse include xml tags in search

We store "broken" xml in a text field in a table. Fixing xml is not currently possible, so normal use pattern is to search through it using ~* E'my_regex' in where clause. The challenge is that size of the table is very very large.

I am trying to employ gin index to make queries faster. The problem is that while parsing text to tsvector, the parser ignores xml tags.
For example I am looking for the records containing word "file". The word could be a part of xml tag or not part of tag. This is what I tried:

select to_tsvector ( 'simple', '<file mode="blah"><value>bar</value>' ) @@ to_tsquery('simple','file');  
?column?  
f

What am I doing wrong? How can I specify delimiters to be one of <>&=,./? Thank you in advance

like image 244
alex_123 Avatar asked Jan 18 '26 21:01

alex_123


1 Answers

If you are only intereseted in searching for words/tagnames etc then use a regexp to remove all the markup when creating your tsvector:

select to_tsvector( 'simple', 
    regexp_replace('<file mode="blah"><value>bar</value>', E'[^A-Za-z0-9]', ' ', 'g')
) @@ to_tsquery('simple','file');
like image 191
Chris Farmiloe Avatar answered Jan 20 '26 20:01

Chris Farmiloe



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!