I have a piece of text. It can contain every character from ASCII 32 (space) to ASCII 126 (tilde) and including ASCII 9 (horizontal tab).
The text may contain sentences. Every sentence ends with dot, question mark or exclamation mark, directly followed by space.
The text may contain a basic markdown styling, that is: bold text (**, also __), italic text (*, also _) and strikethrough (~~). Markdown may occur inside sentences (e.g. **this** is a sentence.) or outside them (e.g. **this is a sentence!**). Markdown may not occur across sentences, that is, there may not be a situation like this: **sentence. sente** nce.. Markdown may include more than one sentence, that is, there may be a situation like this: **sentence. sentence.**.
It can also contain two sequences of characters: <!-- and -->. Everything between these sequences is treated as a comment (like in HTML). Comments can occur at every position in the text, but cannot contains newlines characters (I hope that on Linux it is just ASCII 10).
I want to detect in Javascript all sentences, and for each of them put its length after this sentence in a comment, like this: sentence.<!-- 9 -->. Mainly, I do not care if their length includes the length of the markdown tags or not, but it would be nice if it does not.
So far, with help of this answer, I have prepared the following regex for detecting sentences. It mostly fits my needs – except that it includes comments.
const basicSentence = /(?:^|\n| )(?:[^.!?]|[.!?][^ *_~\n])+[.!?]/gi;
I have also prepared the following regex for detecting comments. It also works as expected, at least in my own tests.
const comment = /<!--.*?-->/gi;
To better see what I want to achieve, let us have an example. Say, I have the following piece of text:
foo0
b<!-- comment -->ar.
foo1 bar?
<!-- comment -->
foo2bar!
(There is also a newline at the end of it, but I do not know how to add an empty line in Stackoverflow markdown.)
And the expected result is:
foo0
b<!-- comment -->ar.<!-- 10 -->
foo1 bar?<!-- 9 -->
<!-- comment -->
foo2bar!<!-- 12 -->
(This time, there is no also newline at the end.)
UPDATE: Sorry, I have corrected the expected result in the example.
Pass a callback to .replace that replaces all comments with the empty string, and then returns the length of the resulting trimmed match:
const input = `foo0
b<!-- comment -->ar.
foo1 bar?
<!-- comment -->
foo2bar!
`;
const output = input.replace(
/(?:^|\n| )(?:[^.!?]|[.!?][^ *_~\n])+[.!?]/g,
(match) => {
const matchWithoutComments = match.replace(/<!--.*?-->/g, '');
return `${match}<!-- ${matchWithoutComments.length} -->`;
}
);
console.log(output);
Of course, you can use a similar pattern to replace markdown notation with the inner text content as well, if you wish:
.replace(/([*_]{1,2}|~~)((.|\n)*?)\1/g, '$2')
(due to nested and possibly unbalanced tags, which regex is not very good at working with, you may have to repeat that line until no further replacements can be found)
Also, per comment, your current regular expression is expecting every sentence to end in ., !, or ?. The comment's ! in <!-- is treated as the end of a (short) sentence. One option would be to lookahead for whitespace (a space, or a newline) or the end of the input at the very end of the regex:
const input = `foo0
b<!-- comment -->ar.
foo1 bar?
<!-- comment -->
foo2bar!
<!-- comment -->`;
const output = input.replace(
/(?:^|\n| )(?:[^.!?]|[.!?][^ *_~\n])+[.!?](?=\s|$|[*_~])/g,
(match) => {
const matchWithoutComments = match.replace(/<!--.*?-->/g, '');
return `${match}<!-- ${matchWithoutComments.length} -->`;
}
);
console.log(output);
https://regex101.com/r/RaTIOi/1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With