I am working on a research project that requires me to process large csv files (~ 2-5 GB) with 500,000+ records. These files contain information about government contracts (from USASpending.gov ). So far, I have used PHP or Python scripts to attack files line by line, analyze them, and then insert information into the corresponding tables. The analysis is moderately difficult. For each record, the script checks to see if the object name is already in the database (using a combination of string matching and regular expressions); if it is not, he first adds the object to the entity table, and then proceeds to analyze the rest of the record and inserts the information into the corresponding tables. The list of objects is more than 100,000.
Here are the main functions (part of the class) that try to match each record with any existing objects:
private function _getOrg($data)
{
if($data[44] == '') return null;
$names = array($data[44],$data[45],$data[46],$data[47]);
foreach($names as $name) {
if($name != '') {
if(($org_id = $this->_parseOrg($name)) != null) {
$this->update_org_meta($org_id,$data);
return $org_id;
}
}
}
return $this->_addOrg($data);
}
private function _parseOrg($name)
{
$this->db->where('org_name',$name,'like');
$result = $this->db->get('orgs');
if(mysql_num_rows($result) == 1) {
$row = mysql_fetch_object($result);
return $row->org_id;
}
$this->db->where('org_alias_name',$name,'like');
$result = $this->db->get('orgs_aliases');
if(mysql_num_rows($result) == 1) {
$row = mysql_fetch_object($result);
return $row->org_id;
}
return null;
}
The _addOrg function inserts new entity information into db, where, hopefully, it will correspond to subsequent entries.
Here's the problem: I can only get these scripts to analyze about 10,000 records / hour, which, given the size, means several solid days for each file. The way to structure my db requires updating several different tables for each record, as I collect several external datasets. Thus, each record updates two tables, and each new object updates three tables. I am worried that this adds too much latency between the MySQL server and my script.
: MySQL, MySQL ( PHP/Python wrapper) ?
Mac OS 10.6 MySQL.