We store "broken" xml in a text field in a table. Fixing xml is not currently possible, so normal use pattern is to search through it using ~* E'my_regex' in where clause. The challenge is that size of the table is very very large.
I am trying to employ gin index to make queries faster. The problem is that while parsing text to tsvector, the parser ignores xml tags.
For example I am looking for the records containing word "file". The word could be a part of xml tag or not part of tag.
This is what I tried:
select to_tsvector ( 'simple', '<file mode="blah"><value>bar</value>' ) @@ to_tsquery('simple','file');
?column?
f
What am I doing wrong? How can I specify delimiters to be one of <>&=,./? Thank you in advance
If you are only intereseted in searching for words/tagnames etc then use a regexp to remove all the markup when creating your tsvector:
select to_tsvector( 'simple',
regexp_replace('<file mode="blah"><value>bar</value>', E'[^A-Za-z0-9]', ' ', 'g')
) @@ to_tsquery('simple','file');
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With