Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading data from UTF-8 text file and tokenize

I'm trying to read UTF-8 from a text file and do some tokenization, but I'm having issues with the encoding:

try {
    fis = new FileInputStream(fName);
} catch (FileNotFoundException ex) {
    //...
}

DataInputStream myInput = new DataInputStream(fis);
    try {
        while (thisLine = myInput.readLine()) != null) {
            StringTokenizer st = new StringTokenizer(thisLine, ";");
            while (st.hasMoreElements()) {
            // do something with st.nextToken();
    }
}
} catch (Exception e) {
//...
}

and DataInputStream doesn't have any parameters to set the encoding!


2 Answers

Let me quote the Javadoc for this method.

DataInputStream.readLine()

Deprecated. This method does not properly convert bytes to characters. As of JDK 1.1, the preferred way to read lines of text is via the BufferedReader.readLine() method. Programs that use the DataInputStream class to read lines can be converted to use the BufferedReader class by replacing code of the form:

     DataInputStream d = new DataInputStream(in);

with:

     BufferedReader d
          = new BufferedReader(new InputStreamReader(in));

BTW: JDK 1.1 came out in Feb 1997 so this shouldn't be new to you.

Just think how much time everyone would have saved if you had read the Javadoc. ;)

like image 73
Peter Lawrey Avatar answered Dec 10 '25 01:12

Peter Lawrey


You can use InputStreamReader:

BufferedReader br = new BufferedReader (new InputStreamReader (source, charset);
while (br.readLine () != null) { ... }

You can also try Scanner, but I'm not sure that it would work fine

like image 35
Roman Avatar answered Dec 10 '25 02:12

Roman



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!