Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extra white space when using regex in Scanner.useDelimiter()

Tags:

java

regex

I am trying to read in a text file from user input using scanner and delimit the words in the file with certain cases. One of the cases that the words must be delimited by is when a word has an apostrophe in the begging or end but should not affect apostrophes within words. For example: if scanner sees a word such as 'tis the scanner.useDlimeter() should be able to take off the apostrophe and leave the word "tis" but if it sees a word like "don't" then it should leave the word as is.

I am using a regex expression to cover the multiple cases that the delimiter should delimit the words by. The regex is doing what I need but for some reason, my results are printing out an extra space before words that have a space and then an apostrophe in the front of a word. I am new to regex and I don't know how to fix this problem but any suggestions would be greatly appreciated.

Below are the words in my text file:

'Twas the night before christmas! But don't open your presents. 'Tis the only way to celebrate.

Code:

  public static void main (String[] args){
      Pattern p = Pattern.compile("[\\p{Punct}\\s&&[^']]+|('(?![\\w]))+|((?<![\\w])')+");
      System.out.println("Please enter a text file name.");
        
      Scanner sc = new Scanner(System.in);
        
      File file = new File(sc.nextLine());
        
      Scanner nSc = new Scanner(file);
        
      nSc.useDelimiter(p);
        
      while (nSc.hasNext()){
        
         String word = nSc.next().toLowerCase();
         System.out.println(word);
       
      }
      nSc.close();
}

Expected:

twas 
the 
night 
before 
christmas 
but 
don't 
open 
your 
presents 
tis 
the 
only 
way 
to 
celebrate

Actual:

twas 
the 
night 
before 
christmas 
but 
don't 
open 
your 
presents

tis 
the 
only 
way 
to 
celebrate
like image 690
ssang Avatar asked Dec 22 '25 00:12

ssang


1 Answers

You can use the regex, '?\b\w+'?\w+\b to grab the desired words from teh string and then replace the regex, '(.*) with $1 where $1 specifies group(1).

import java.util.List;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

public class Main {
    public static void main(String[] args) {
        String str = "'Twas the night before christmas! But don't open your presents. 'Tis the only way to celebrate.";
        List<String> list = Pattern.compile("'?\\b\\w+'?\\w+\\b")
                .matcher(str)
                .results()
                .map(r->r.group().replaceAll("'(.*)", "$1"))
                .collect(Collectors.toList());

        System.out.println(list);
    }
}

Output:

[Twas, the, night, before, christmas, But, dont, open, your, presents, Tis, the, only, way, to, celebrate]

Explanation of the regex, '?\b\w+'?\w+\b:

  1. \b specifies word boundary.
  2. \w+ specifies one or more word character.
  3. '? specifies optional '

If you are not familiar with Stream API, you can do it as follows:

Scanner nSc = new Scanner(file);
while (nSc.hasNextLine()) {
    String line = nSc.nextLine().toLowerCase();
    Pattern pattern = Pattern.compile("'?\\b\\w+'?\\w+\\b");
    Matcher matcher = pattern.matcher(line);
    while (matcher.find()) {
        String word = matcher.group();
        System.out.println(word.replaceAll("'(.*)", "$1"));
    }
}
nSc.close();
like image 61
Arvind Kumar Avinash Avatar answered Dec 23 '25 14:12

Arvind Kumar Avinash



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!