I'm writing code which reads in a text file (each line a tweet) and goes through each tweet comparing it against a list of English words to see if the word is misspelled.
So the list of English words is read in from a text file as well, this is then stored in a List. When I run the code for this alone, it operates in less than one second. When I run the code for storing each word in the tweet file (without checking for spelling) for the 1,000,000 tweets, it stores each word and its frequency in a HashMap<String, Integer> in around 20-30sec.
But when I add the line to check if the word is spelled correctly, it causes a ridiculous run time increase, to the point where I could almost watch a movie before it finished running.
The simple aspect of invoking isSpelledCorrectly(X) (which just invokes list.contains(x), which has a worst case run-time of O(n)), yet it seems quite confounding that it causes the code to go from a 30 sec runtime to a 50 min runtime?
Code:
Spelling:
static List<String> spellCheck = new ArrayList<String>();
public AssignTwo() throws IOException{
spellCheck = initCorrectSpelling("C:\\Users\\Gregs\\InfoRetrieval\\src\\english-words");
}
public static List<String> initCorrectSpelling(String filename) throws IOException { //store correct spelling of words in list
Scanner scanner = new Scanner(new FileInputStream(filename));
try{
while(scanner.hasNextLine()){
String next = scanner.nextLine();
spellCheck.add(next);
}
}
finally{
scanner.close();
}
return spellCheck;
}
public static boolean isSpelledCorrectly(String word){ //check if any given word is spelled correctly by seeing if it is
boolean output = false; //contained within the spellCheck list
if(spellCheck.contains(word)) output = true;
return output;
}
Code storing Tweets:
public static HashMap<String, Integer> misSpell;
public AssignOne() throws IOException { //read in file from path, test functions
index("C:\\Users\\Gregs\\InfoRetrieval\\src\\tweets");
}
public static void index(String filename) throws IOException {
misSpell = new HashMap<String, Integer>();
Scanner scanner = new Scanner(new FileInputStream(filename));
try{
while(scanner.hasNextLine()){
String line = scanner.nextLine();
String[] lineArr = line.split(" ");
for(int i=3; i<lineArr.length; i++){
int count=1;
lineArr[i] = lineArr[i].replaceAll("[^a-zA-Z0-9]", "");
//if(!AssignTwo.isSpelledCorrectly(lineArr[i].toLowerCase())){ //with this line commented out, runtime <30sec, with line >50mins
if(misSpell.containsKey(lineArr[i].toLowerCase())){
count = 1 + misSpell.get(lineArr[i].toLowerCase());
}
misSpell.put(lineArr[i].toLowerCase(), count);
//}
}
}
}
finally{
scanner.close();
}
}
Any suggestion on where to improve code or how to make the comparisons more efficient? Is there a faster data structure for correctly spelled words?
List.contains() is O(N), N being the number of words in the dictionary.
Use a HashSet, where contains() is O(1).
Using A buffered reader would also speed things up. And avoiding to call toLowerCase() three times on each word would too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With