Comparing one data structure against another resulting in run time of over 50 mins

Question

I'm writing code which reads in a text file (each line a tweet) and goes through each tweet comparing it against a list of English words to see if the word is misspelled.

So the list of English words is read in from a text file as well, this is then stored in a List. When I run the code for this alone, it operates in less than one second. When I run the code for storing each word in the tweet file (without checking for spelling) for the 1,000,000 tweets, it stores each word and its frequency in a HashMap<String, Integer> in around 20-30sec.

But when I add the line to check if the word is spelled correctly, it causes a ridiculous run time increase, to the point where I could almost watch a movie before it finished running.

The simple aspect of invoking isSpelledCorrectly(X) (which just invokes list.contains(x), which has a worst case run-time of O(n)), yet it seems quite confounding that it causes the code to go from a 30 sec runtime to a 50 min runtime?

Code:

Spelling:

static List<String> spellCheck = new ArrayList<String>();

public AssignTwo() throws IOException{
    spellCheck = initCorrectSpelling("C:\Users\Gregs\InfoRetrieval\src\english-words");
}

public static List<String> initCorrectSpelling(String filename) throws IOException { //store correct spelling of words in list
    Scanner scanner = new Scanner(new FileInputStream(filename));
    try{
        while(scanner.hasNextLine()){
            String next = scanner.nextLine();
            spellCheck.add(next);
        }
    }
    finally{
        scanner.close();
    }
    return spellCheck;
}

public static boolean isSpelledCorrectly(String word){  //check if any given word is spelled correctly by seeing if it is 
    boolean output = false;                             //contained within the spellCheck list
    if(spellCheck.contains(word)) output = true;                    
    return output;
}

Code storing Tweets:

public static HashMap<String, Integer> misSpell;
public AssignOne() throws IOException {         //read in file from path, test functions
    index("C:\Users\Gregs\InfoRetrieval\src\tweets");
}

public static void index(String filename)  throws IOException {
    misSpell = new HashMap<String, Integer>();
    Scanner scanner = new Scanner(new FileInputStream(filename));
    try{
        while(scanner.hasNextLine()){
            String line = scanner.nextLine();       
            String[] lineArr = line.split(" ");
            for(int i=3; i<lineArr.length; i++){
                int count=1;
                lineArr[i] = lineArr[i].replaceAll("[^a-zA-Z0-9]", "");
                //if(!AssignTwo.isSpelledCorrectly(lineArr[i].toLowerCase())){  //with this line commented out, runtime <30sec, with line >50mins
                    if(misSpell.containsKey(lineArr[i].toLowerCase())){             
                        count = 1 + misSpell.get(lineArr[i].toLowerCase());             
                    }
                    misSpell.put(lineArr[i].toLowerCase(), count);
                //}
            }
        }
    } 
    finally{
        scanner.close();
    }
}

Any suggestion on where to improve code or how to make the comparisons more efficient? Is there a faster data structure for correctly spelled words?

JB Nizet · Accepted Answer

List.contains() is O(N), N being the number of words in the dictionary.

Use a HashSet, where contains() is O(1).

Using A buffered reader would also speed things up. And avoiding to call toLowerCase() three times on each word would too.

Comparing one data structure against another resulting in run time of over 50 mins

Tags:

java

performance

hashmap

spelling

NoName

1 Answers

JB Nizet

Recent Activity

Donate For Us

Comparing one data structure against another resulting in run time of over 50 mins

Tags:

java

performance

hashmap

spelling

NoName

1 Answers

JB Nizet

Related questions

Recent Activity

Donate For Us