I've been having an issue getting the java.util.Scanner to read a text file I saved in Notepad, even though it works fine with others. Basically, when it tries to read the problem file, it comes up completely empty handed -- hasNextLine() is false, buffer is empty, etc. I narrowed it down to the fact that it won't even read the first line if there is a curly quote anywhere in the file. No exceptions are thrown. Note that a BufferedReader on the same file doesn't have a problem.
try {        
    int count = 0;
    Scanner scanner = new Scanner(new File("C:/myfile.txt"));
    while (scanner.hasNextLine()) {
        count++;
        scanner.nextLine();
    }
    scanner.close();
    System.out.print(count);
    count = 0;
    BufferedReader reader = new BufferedReader(new FileReader("C:/myfile.txt"));
    while (reader.readLine() != null) {
        count++;
    }
    reader.close();
    System.out.print(count);
}
catch(IOException e) {
    e.printStackTrace();
}
The above code, reading a file that contains nothing but a single curly quote, prints out "01". Searches on Google led me to try this:
Scanner scanner = new Scanner(new File("C:/myfile.txt"), "ISO-8859-1");
This makes it work (ie. it prints out "11"). I also noticed that if I go into Notepad and do a Save As... the default encoding at the bottom is "ANSI." If I change this to "UTF-8" and save the file, then the scanner (without an encoding) also works. If I tell the scanner "UTF-8", then understandably it only works if I save as UTF-8, but "ISO-8859-1" seems to make it work even if I save it as "ANSI".
So, I know it has something to do with file encoding, but the problem is I don't understand anything about file encoding. My knowledge of what "ISO-8859-1" means is extremely vague; why does that make it work no matter how I save the file? Why does BufferedReader work regardless?
EDIT:
The links/comments below really helped point me in the right direction! I think I've got it figured out.
First of all, in Notepad:
In hexadecimal, a curly apostrophe is represented as:
The default encoding Java uses on my system, according to Charset.defaultCharset(), is UTF-8. So when I saved the file in UTF-8, the scanner knew what to expect. When I saved the file in CP1252, however, it choked once it hit that "92", because it's not a valid way to represent a character in that encoding. It works fine as long as there aren't any such chracters in the file -- the hex for "hello world" happens to be the same in both CP1252 and UTF-8 and doesn't happen to cause a problem.
UTF-8 doesn't work with a UTF-16 file, because it doesn't know what to do with the byte order mark ("FFFE"), regardless of what characters are in the file.
On the other hand, when I set the scanner to CP1252 or ISO-8859-1, it's much more tolerant. It doesn't necessarily interpret the characters correctly, mind you, but there's nothing that prevents it from recognizing lines in the file and looping through.
As far as why Scanner has a problem but the FileReader/BufferedReader does not, I am going to guess that it's because the scanner needs to tokenize the file, ie. interpret the characters so it can identify whitespace and other patterns, so it chokes when there's something unrecognizable. The reader doesn't need to do that. All it needs to identify are the line breaks.
The only way the code terminates, as designed, is with CTRL-Z in Windows or CTRL-D in UNIX/Linux, which ends the byte stream, causes hasNextLine() not to block waiting for input and to return a boolean false which terminates the while loop.
The hasNextLine() is a method of Java Scanner class which is used to check if there is another line in the input of this scanner. It returns true if it finds another line, otherwise returns false.
In this article, we've learned that Scanner's hasNextLine() method checks if there is another line in the input, no matter if the line is blank or not, while hasNext() uses a delimiter to check for another token.
If you don't specify an encoding when you create the scanner it will try to divine the encoding based on a byte order mark (BOM), which is the first few bytes of a file. If it doesn't have one, it will default to whatever default the OS uses. Since you're using Windows, the default is cp-1252. It seems that notepad is saving your text file using ISO-8859-1 which is similar, but not that same as cp-1252. See this link for more details:
http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html
When you save it as UTF-8, it probably places the UTF-8 BOM at the beginning of the file and the scanner can pick up on it.
If you want to look more into BOM, look it up in wikipedia--the article is quite good. You can also download PSPad and open the text file in hex mode to see the individual bytes. Hope that helps :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With