I recently had to write a challenge for a company that was to merge 3 CSV files into one based on the first attribute of each (the attributes were repeating in all files).
I wrote the code and sent it to them, but they said it took 2 minutes to run. That was funny because it ran for 10 seconds on my machine. My machine had the same processor, 16GB of RAM, and had an SSD as well. Very similar environments.
I tried optimising it and resubmitted it. This time they said they ran it on an Ubuntu machine and got 11 seconds, while the code ran for 100 seconds on the Windows 10 still.
Another peculiar thing was that when I tried profiling it with the Profile module, it went on forever, had to terminate after 450 seconds. I moved to cProfiler and it recorded it for 7 seconds.
EDIT: The exact formulation of the problem is
Write a console program to merge the files provided in a timely and efficient manner. File paths should be supplied as arguments so that the program can be evaluated on different data sets. The merged file should be saved as CSV; use the id column as the unique key for merging; the program should do any necessary data cleaning and error checking.
Feel free to use any language you’re comfortable with – only restriction is no external libraries as this defeats the purpose of the test. If the language provides CSV parsing libraries (like Python), please avoid using them as well as this is a part of the test.
Without further ado here's the code:
#!/usr/bin/python3
import sys
from multiprocessing import Pool
HEADERS = ['id']
def csv_tuple_quotes_valid(a_tuple):
    """
    checks if a quotes in each attribute of a entry (i.e. a tuple) agree with the csv format
    returns True or False
    """
    for attribute in a_tuple:
        in_quotes = False
        attr_len = len(attribute)
        skip_next = False
        for i in range(0, attr_len):
            if not skip_next and attribute[i] == '\"':
                if i < attr_len - 1 and attribute[i + 1] == '\"':
                    skip_next = True
                    continue
                elif i == 0 or i == attr_len - 1:
                    in_quotes = not in_quotes
                else:
                    return False
            else:
                skip_next = False
        if in_quotes:
            return False
    return True
def check_and_parse_potential_tuple(to_parse):
    """
    receives a string and returns an array of the attributes of the csv line
    if the string was not a valid csv line, then returns False
    """
    a_tuple = []
    attribute_start_index = 0
    to_parse_len = len(to_parse)
    in_quotes = False
    i = 0
    #iterate through the string (line from the csv)
    while i < to_parse_len:
        current_char = to_parse[i]
        #this works the following way: if we meet a quote ("), it must be in one
        #of five cases: "" | ", | ," | "\0 | (start_of_string)"
        #in case we are inside a quoted attribute (i.e. "123"), then commas are ignored
        #the following code also extracts the tuples' attributes 
        if current_char == '\"':
            if i == 0 or (to_parse[i - 1] == ',' and not in_quotes): # (start_of_string)" and ," case
                #not including the quote in the next attr
                attribute_start_index = i + 1
                #starting a quoted attr
                in_quotes = True
            elif i + 1 < to_parse_len:
                if to_parse[i + 1] == '\"': # "" case
                    i += 1 #skip the next " because it is part of a ""
                elif to_parse[i + 1] == ',' and in_quotes: # ", case
                    a_tuple.append(to_parse[attribute_start_index:i].strip())
                    #not including the quote and comma in the next attr
                    attribute_start_index = i + 2
                    in_quotes = False #the quoted attr has ended
                    #skip the next comma - we know what it is for
                    i += 1
                else:
                    #since we cannot have a random " in the middle of an attr
                    return False 
            elif i == to_parse_len - 1: # "\0 case
                a_tuple.append(to_parse[attribute_start_index:i].strip())
                #reached end of line, so no more attr's to extract
                attribute_start_index = to_parse_len
                in_quotes = False
            else:
                return False
        elif current_char == ',':
            if not in_quotes:
                a_tuple.append(to_parse[attribute_start_index:i].strip())
                attribute_start_index = i + 1
        i += 1
    #in case the last attr was left empty or unquoted
    if attribute_start_index < to_parse_len or (not in_quotes and to_parse[-1] == ','):
        a_tuple.append(to_parse[attribute_start_index:])
    #line ended while parsing; i.e. a quote was openned but not closed 
    if in_quotes:
        return False
    return a_tuple
def parse_tuple(to_parse, no_of_headers):
    """
    parses a string and returns an array with no_of_headers number of headers
    raises an error if the string was not a valid CSV line
    """
    #get rid of the newline at the end of every line
    to_parse = to_parse.strip()
    # return to_parse.split(',') #if we assume the data is in a valid format
    #the following checking of the format of the data increases the execution
    #time by a factor of 2; if the data is know to be valid, uncomment 3 lines above here
    #if there are more commas than fields, then we must take into consideration
    #how the quotes parse and then extract the attributes
    if to_parse.count(',') + 1 > no_of_headers:
        result = check_and_parse_potential_tuple(to_parse)
        if result:
            a_tuple = result
        else:
            raise TypeError('Error while parsing CSV line %s. The quotes do not parse' % to_parse)
    else:
        a_tuple = to_parse.split(',')
        if not csv_tuple_quotes_valid(a_tuple):
            raise TypeError('Error while parsing CSV line %s. The quotes do not parse' % to_parse)
    #if the format is correct but more data fields were provided
    #the following works faster than an if statement that checks the length of a_tuple
    try:
        a_tuple[no_of_headers - 1]
    except IndexError:
        raise TypeError('Error while parsing CSV line %s. Unknown reason' % to_parse)
    #this replaces the use my own hashtables to store the duplicated values for the attributes
    for i in range(1, no_of_headers):
        a_tuple[i] = sys.intern(a_tuple[i])
    return a_tuple
def read_file(path, file_number):
    """
    reads the csv file and returns (dict, int)
    the dict is the mapping of id's to attributes
    the integer is the number of attributes (headers) for the csv file
    """
    global HEADERS
    try:
        file = open(path, 'r');
    except FileNotFoundError as e:
        print("error in %s:\n%s\nexiting...")
        exit(1)
    main_table = {}
    headers = file.readline().strip().split(',')
    no_of_headers = len(headers)
    HEADERS.extend(headers[1:]) #keep the headers from the file
    lines = file.readlines()
    file.close()
    args = []
    for line in lines:
        args.append((line, no_of_headers))
    #pool is a pool of worker processes parsing the lines in parallel
    with Pool() as workers:
        try:
            all_tuples = workers.starmap(parse_tuple, args, 1000)
        except TypeError as e:
            print('Error in file %s:\n%s\nexiting thread...' % (path, e.args))
            exit(1)
    for a_tuple in all_tuples:
        #add quotes to key if needed
        key = a_tuple[0] if a_tuple[0][0] == '\"' else ('\"%s\"' % a_tuple[0])
        main_table[key] = a_tuple[1:]
    return (main_table, no_of_headers)
def merge_files():
    """
    produces a file called merged.csv 
    """
    global HEADERS
    no_of_files = len(sys.argv) - 1
    processed_files = [None] * no_of_files
    for i in range(0, no_of_files):
        processed_files[i] = read_file(sys.argv[i + 1], i)
    out_file = open('merged.csv', 'w+')
    merged_str = ','.join(HEADERS)
    all_keys = {}
    #this is to ensure that we include all keys in the final file.
    #even those that are missing from some files and present in others
    for processed_file in processed_files:
        all_keys.update(processed_file[0])
    for key in all_keys:
        merged_str += '\n%s' % key
        for i in range(0, no_of_files):
            (main_table, no_of_headers) = processed_files[i]
            try:
                for attr in main_table[key]:
                    merged_str += ',%s' % attr
            except KeyError:
                print('NOTE: no values found for id %s in file \"%s\"' % (key, sys.argv[i + 1]))
                merged_str += ',' * (no_of_headers - 1)
    out_file.write(merged_str)
    out_file.close()
if __name__ == '__main__':
    # merge_files()
    import cProfile
    cProfile.run('merge_files()')
# import time
# start = time.time()
# print(time.time() - start);
Here is the profiler report I got on my Windows.
EDIT: The rest of the csv data provided is here. Pastebin was taking too long to process the files, so...
It might not be the best code and I know that, but my question is what slows down Windows so much that doesn't slow down an Ubuntu? The merge_files() function takes the longest, with 94 seconds just for itself, not including the calls to other functions. And there doesn't seem to be anything too obvious to me for why it is so slow.
Thanks
EDIT: Note: We both used the same dataset to run the code with.
It turns out that Windows and Linux handle very long strings differently. When I moved the out_file.write(merged_str) inside the outer for loop (for key in all_keys:) and stopped appending to merged_str, it ran for 11 seconds as expected. I don't have enough knowledge on either of the OS's memory management systems to be able to give a prediction on why it is so different. 
But I would say that the way that the second one (the Windows one) is the more fail-safe method because it is unreasonable to keep a 30 MB string in memory. It just turns out that Linux sees that and doesn't always try to keep the string in cache, or to rebuild it every time.
Funny enough, initially I did run it a few times on my Linux machine with these same writing strategies, and the one with the large string seemed to go faster, so I stuck with it. I guess you never know.
Here's the modified code
    for key in all_keys:
        merged_str = '%s' % key
        for i in range(0, no_of_files):
            (main_table, no_of_headers) = processed_files[i]
            try:
                for attr in main_table[key]:
                    merged_str += ',%s' % attr
            except KeyError:
                print('NOTE: no values found for id %s in file \"%s\"' % (key, sys.argv[i + 1]))
                merged_str += ',' * (no_of_headers - 1)
        out_file.write(merged_str + '\n')
    out_file.close()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With