Please have a look at the following.
String[]sentenceHolder = titleAndBodyContainer.split("\n|\\.(?!\\d)|(?<!\\d)\\.");
This is how I tried to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan. 13, 2014, words like U.S and numbers like 2.2. They all got splitted by the above code. So basically, this code splits lot of 'dots' whether it is a full stop or not.
I tried String[]sentenceHolder = titleAndBodyContainer.split(".\n"); and String[]sentenceHolder = titleAndBodyContainer.split("\\."); as well. All failed.
How can I split a paragraph into sentences "properly"?
Use sent_tokenize() to split text into sentences Call nltk. tokenize. sent_tokenize(text) with a string as text to split the string into a list of sentences.
For splitting sentences first mark the clauses. Then make sub-clauses independent by omitting subordinating linkers and inserting subjects or other words wherever necessary. Example – When I went to Delhi I met my friend who lives there. Clause 1 (When) I went to Delhi.
In Word documents etc., each newline indicates a new paragraph so you'd just use `text. split(“\n”)` (where `text` is a string variable containing the text of your file). In other formats, paragraphs are separated by a blank line (two consecutive newlines), so you'd use `text.
Splitting a string by sentence as a delimiter You can also split a sentence by passing a sentence as a delimiter if you do so each time the specified sentence occurs the String is divided as a separate token.
You can try this
String str = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S and numbers like 2.2. They all got split by the above code.";
Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher reMatcher = re.matcher(str);
while (reMatcher.find()) {
System.out.println(reMatcher.group());
}
Output:
This is how I tried to split a paragraph into a sentence.
But, there is a problem.
My paragraph includes dates like Jan.13, 2014 , words like U.S and numbers like 2.2.
They all got split by the above code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With