Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Import and process text file within MySQL

I'm working on a research project that requires me to process large csv files (~2-5 GB) with 500,000+ records. These files contain information on government contracts (from USASpending.gov). So far, I've been using PHP or Python scripts to attack the files row-by-row, parse them, and then insert the information into the relevant tables. The parsing is moderately complex. For each record, the script checks to see if the entity named is already in the database (using a combination of string and regex matching); if it is not, it first adds the entity to a table of entities and then proceeds to parse the rest of the record and inserts the information into the appropriate tables. The list of entities is over 100,000.

Here are the basic functions (part of a class) that try to match each record with any existing entities:

private function _getOrg($data)
{
    // if name of organization is null, skip it
    if($data[44] == '') return null;

    // use each of the possible names to check if organization exists
    $names = array($data[44],$data[45],$data[46],$data[47]);

    // cycle through the names
    foreach($names as $name) {
        // check to see if there is actually an entry here
        if($name != '') {
            if(($org_id = $this->_parseOrg($name)) != null) {
                $this->update_org_meta($org_id,$data); // updates some information of existing entity based on record
                return $org_id;
            }
        }
    }

    return $this->_addOrg($data);
}

private function _parseOrg($name)
{
    // check to see if it matches any org names
    // db class function, performs simple "like" match
    $this->db->where('org_name',$name,'like');

    $result = $this->db->get('orgs');

    if(mysql_num_rows($result) == 1) {
        $row = mysql_fetch_object($result);
        return $row->org_id;
    }

    // check to see if matches any org aliases
    $this->db->where('org_alias_name',$name,'like');

    $result = $this->db->get('orgs_aliases');

    if(mysql_num_rows($result) == 1) {
        $row = mysql_fetch_object($result);
        return $row->org_id;
    }
    return null; // no matches, have to add new entity
 }

The _addOrg function inserts the new entity's information into the db, where hopefully it will match subsequent records.

Here's the problem: I can only get these scripts to parse about 10,000 records / hour, which, given the size, means a few solid days for each file. The way my db is structured requires a several different tables to be updated for each record because I'm compiling multiple external datasets. So, each record updates two tables, and each new entity updates three tables. I'm worried that this adds too much lag time between MySQL server and my script.

Here's my question: is there a way to import the text file into a temporary MySQL table and then use internal MySQL functions (or PHP/Python wrapper) to speed up the processing?

I'm running this on my Mac OS 10.6 with local MySQL server.

like image 276
tchaymore Avatar asked Dec 06 '25 04:12

tchaymore


1 Answers

load the file into a temporary/staging table using load data infile and then use a stored procedure to process the data - shouldnt take more than 1-2 mins at the most to completely load and process the data.

you might also find some of my other answers of interest:

Optimal MySQL settings for queries that deliver large amounts of data?

MySQL and NoSQL: Help me to choose the right one

How to avoid "Using temporary" in many-to-many queries?

60 million entries, select entries from a certain month. How to optimize database?

Interesting presentation:

http://www.mysqlperformanceblog.com/2011/03/18/video-the-innodb-storage-engine-for-mysql/

example code (may be of use to you)

truncate table staging;

start transaction;

load data infile 'your_data.dat' 
into table staging
fields terminated by ',' optionally enclosed by '"'
lines terminated by '\n'
(
org_name
...
)
set
org_name = nullif(org_name,'');

commit;

drop procedure if exists process_staging_data;

delimiter #

create procedure process_staging_data()
begin

    insert ignore into organisations (org_name) select distinct org_name from staging;

    update...

    etc.. 

    -- or use a cursor if you have to ??

end#

delimiter ;

call  process_staging_data();

Hope this helps

like image 56
Jon Black Avatar answered Dec 07 '25 19:12

Jon Black



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!