In a text database every word of the text is in a single row an has it's unique Id, which determines the sequence. The DB is static, the text will not be changed, no insertions or deletions will be made. So it looks like this:
Tablename = "text"
Id | Word
---------
1 | The
2 | texts
3 | articles
4 | and
5 | conversations
6 | are
7 | brief
8 | and
9 | appropriate
10 | for
11 | all
... | ...
Now, the user should be able to do a KWIC (keyword in context) search, the result should be, e.g. when the user searches for "conversations" and context is 2 (that is 2 words to the left and 2 word to the right) the result should be "articles and conversations are brief".
When the user searches for a word, which occurs multiple times, they should get multiple rows as a result, e.g. if they search for "and" in the given example and the context is 2 again, they should get both "texts articles and conversations are" and "are brief and appropriate for".
So far I have tried this:
SELECT * FROM text
WHERE Id
BETWEEN ((SELECT Id FROM text WHERE Word='conversations') - 2)
AND ((SELECT Id FROM text WHERE Word='conversations') + 2)
This works fine, but only if the searched word occurs only once in the text. (Which is the case with "conversations".) If the search term occurs multiple times in the text (e.g. if "conversations" in the query is substituted with "and"), it crashes, probably because the result of the inner SELECTs in this case is not a single Id, but a list of IDs. I would appreciate if somebody could give me advice on how to deal with multiple occurrences of the search term. Thank you!
---- Edit -----
Thank you, Dale, for the hint with LEAD and LAG! I've tried this together with the CONCAT function and the query now looks like this:
SELECT Word, temp FROM
(SELECT Word, CONCAT (
LAG(Word, 1) OVER (ORDER BY Id),
' <B>', Word, '</B> ',
LEAD(Word, 1) OVER (ORDER BY Id))
AS temp FROM text) t
WHERE Word='and';
It now gives the results "articels and conversations" and "brief and appropriate" as expected. Furthermore, the searched word ("and") can be bolded, which is a great byproduct. Now, I am stuck with the problem, how to expand the range from one single word left and right of the searched word to a another number, eg. 10 or max. 20. This number should be determined by the user in the final application. Is there a way, how this could be accomplished?
You can use a join with a range. This is will be more flexible and scalable more than (Lag/Lead), especially for arbitrary context length:
SELECT kwic.Id AS CenterId,
STRING_AGG(ctx.Word, ' ') WITHIN GROUP (ORDER BY ctx.Id) AS KWIC_Text
FROM text kwic
JOIN text ctx
ON ctx.Id BETWEEN kwic.Id - 2 AND kwic.Id + 2
WHERE kwic.Word = 'and'
GROUP BY kwic.Id
ORDER BY kwic.Id;
If you use this method, you will get one result per match of the word.
You can also make the context dynamic by declaring these variables:
DECLARE @context INT = 2;
DECLARE @keyword NVARCHAR(100) = 'and';
If you are stick to use (Lag/Lead) which I not prefer because it is not dynamic and needs hardcoded number of shifts, then this is how you can do it:
SELECT *
FROM (
SELECT Id, Word,
CONCAT_WS(' ',
LAG(Word, 2) OVER (ORDER BY Id),
LAG(Word, 1) OVER (ORDER BY Id),
'<B>' + Word + '</B>',
LEAD(Word, 1) OVER (ORDER BY Id),
LEAD(Word, 2) OVER (ORDER BY Id)
) AS KWIC_Text
FROM text
) AS KWIC
WHERE Word = 'and';
---- Edit -----
To make it work with MySQL, you will use GROUP_CONCAT method, but if you used it with JOIN you will lose the ability to mark (E.G. Bold) the search term because the entire phrase comes from the ctx table since you generating the full context around the word.
So, to solve this issue you need to inject the bolded version of the word only when ctx.Id = Kwic.Id - That is, where the context word is the exact match (the center word).
If you are rendering to a web application, this will appear bold in HTML. If not, you could instead use word or [word] if it is going to plain text.
So, this is how to solve it:
SET @context := 2;
SET @keyword := 'and';
SELECT kwic.Id AS CenterId,
GROUP_CONCAT(
CASE
WHEN ctx.Id = kwic.Id THEN CONCAT('<b>', ctx.Word, '</b>')
ELSE ctx.Word
END
ORDER BY ctx.Id SEPARATOR ' '
) AS KWIC_Text
FROM text kwic
JOIN text ctx
ON ctx.Id BETWEEN kwic.Id - @context AND kwic.Id + @context
WHERE kwic.Word = @keyword
GROUP BY kwic.Id
ORDER BY kwic.Id;
---- Edit 2 -----
To increase the performance of the query, you need to do the following:
1- Add composite index on (Word, Id); Note: If you already have Id as a primary key (Which is good), adding Word to the index makes it covering for this query.
2- Use the inner join only after filtering; This will reduce the number of rows processed in the join by huge factor.
So, this is how to solve it:
Firstly, you need to run the SQL query to add the index for both (Id, Word):
CREATE INDEX idx_word_id ON text (Word, Id);
Secondly, you need to update your SQL query to use the inner join only after filtering:
SET @context := 5;
SET @keyword := 'and';
SELECT kwic.Id AS CenterId,
GROUP_CONCAT(
CASE
WHEN ctx.Id = kwic.Id THEN CONCAT('<b>', ctx.Word, '</b>')
ELSE ctx.Word
END
ORDER BY ctx.Id SEPARATOR ' '
) AS KWIC_Text
FROM (SELECT * FROM text WHERE Word = @keyword) kwic
JOIN text ctx
ON ctx.Id BETWEEN kwic.Id - @context AND kwic.Id + @context
GROUP BY kwic.Id
ORDER BY kwic.Id;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With