Import and process a text file in MySQL

I am working on a research project that requires me to process large csv files (~ 2-5 GB) with 500,000+ records. These files contain information about government contracts (from USASpending.gov ). So far, I have used PHP or Python scripts to attack files line by line, analyze them, and then insert information into the corresponding tables. The analysis is moderately difficult. For each record, the script checks to see if the object name is already in the database (using a combination of string matching and regular expressions); if it is not, he first adds the object to the entity table, and then proceeds to analyze the rest of the record and inserts the information into the corresponding tables. The list of objects is more than 100,000.

Here are the main functions (part of the class) that try to match each record with any existing objects:

private function _getOrg($data)
{
    // if name of organization is null, skip it
    if($data[44] == '') return null;

    // use each of the possible names to check if organization exists
    $names = array($data[44],$data[45],$data[46],$data[47]);

    // cycle through the names
    foreach($names as $name) {
        // check to see if there is actually an entry here
        if($name != '') {
            if(($org_id = $this->_parseOrg($name)) != null) {
                $this->update_org_meta($org_id,$data); // updates some information of existing entity based on record
                return $org_id;
            }
        }
    }

    return $this->_addOrg($data);
}

private function _parseOrg($name)
{
    // check to see if it matches any org names
    // db class function, performs simple "like" match
    $this->db->where('org_name',$name,'like');

    $result = $this->db->get('orgs');

    if(mysql_num_rows($result) == 1) {
        $row = mysql_fetch_object($result);
        return $row->org_id;
    }

    // check to see if matches any org aliases
    $this->db->where('org_alias_name',$name,'like');

    $result = $this->db->get('orgs_aliases');

    if(mysql_num_rows($result) == 1) {
        $row = mysql_fetch_object($result);
        return $row->org_id;
    }
    return null; // no matches, have to add new entity
 }

The _addOrg function inserts new entity information into db, where, hopefully, it will correspond to subsequent entries.

Here's the problem: I can only get these scripts to analyze about 10,000 records / hour, which, given the size, means several solid days for each file. The way to structure my db requires updating several different tables for each record, as I collect several external datasets. Thus, each record updates two tables, and each new object updates three tables. I am worried that this adds too much latency between the MySQL server and my script.

: MySQL, MySQL ( PHP/Python wrapper) ?

Mac OS 10.6 MySQL.

+3
2

/ load data infile, - 1-2 .

:

MySQL , ?

MySQL NoSQL:

" " - ?

60 , . ?

:

http://www.mysqlperformanceblog.com/2011/03/18/video-the-innodb-storage-engine-for-mysql/

( )

truncate table staging;

start transaction;

load data infile 'your_data.dat' 
into table staging
fields terminated by ',' optionally enclosed by '"'
lines terminated by '\n'
(
org_name
...
)
set
org_name = nullif(org_name,'');

commit;

drop procedure if exists process_staging_data;

delimiter #

create procedure process_staging_data()
begin

    insert ignore into organisations (org_name) select distinct org_name from staging;

    update...

    etc.. 

    -- or use a cursor if you have to ??

end#

delimiter ;

call  process_staging_data();

,

+1

, SQL-, , , script . , PHP MySQL, MySQLdb Python . , / 10 / . , SELECT , , , REGEXP, , ( : MySQL LIKE IN()?). MySQLdb executemany() , , , PHP- ?

, , Python multiprocessing . PyMOTW .

0

All Articles