A good way to pull newlines from a (non-indexed / unclassified) huge file

I have a large csv file (> 1 GB) sitting in a NAS that is updated weekly with new entries. The file has the same columns:

Customer ID | Product | Online? (Bool) | Amount | Date

I need to use this file to update the postgresql database of customer IDs with a total of each month by product and store. Something like that:

Customer ID | Month | (several unrelated fields) | Product 1 (Online) | Product 1 (Offline) | Product 2 (Online) | ect...

Since the file is so large (and with each update it gets bigger and bigger), I need an effective way to capture updated records and update the database. Unfortunately, our server updates the file by client ID, and not by date, so I can not drive it.

Is there a smart way to split a file in a way that won't break as the file continues to grow?

+3
source share
3 answers

COPY file to staging table. This assumes that, of course, you have a PC, which is a unique identifier for each line that does not mutate. I check the remaining columns and the same for the rows that you have already loaded into the destination table, and compare the source with the destination, this will detect updates, deletes new rows as well.

As you can see, I did not add any indexes or configure it in any other way. My goal was to make it function correctly.

create schema source;
create schema destination;

--DROP TABLE source.employee; 
--DROP TABLE destination.employee;

select x employee_id, CAST('Bob' as text) first_name,cast('H'as text) last_name, cast(21 as integer) age
INTO source.employee
from generate_series(1,10000000) x;

select x employee_id, CAST('Bob' as text) first_name,cast('H'as text) last_name, cast(21 as integer) age
INTO destination.employee
from generate_series(1,10000000) x;

select 
destination.employee.*,
source.employee.*,
CASE WHEN (md5(source.employee.first_name || source.employee.last_name || source.employee.age)) != md5((destination.employee.first_name || destination.employee.last_name || destination.employee.age)) THEN 'CHECKSUM' 
     WHEN (destination.employee.employee_id IS NULL) THEN 'Missing'
     WHEN (source.employee.employee_id IS NULL) THEN 'Orphan' END AS AuditFailureType
FROM destination.employee
FULL OUTER JOIN source.employee
             on destination.employee.employee_id = source.employee.employee_id
WHERE (destination.employee.employee_id IS NULL OR source.employee.employee_id IS NULL)
   OR (md5(source.employee.first_name || source.employee.last_name || source.employee.age)) != md5((destination.employee.first_name || destination.employee.last_name || destination.employee.age));

--Mimic source data getting an update.
UPDATE source.employee
SET age = 99
where employee_id = 45000;

select 
destination.employee.*,
source.employee.*,
CASE WHEN (md5(source.employee.first_name || source.employee.last_name || source.employee.age)) != md5((destination.employee.first_name || destination.employee.last_name || destination.employee.age)) THEN 'CHECKSUM' 
     WHEN (destination.employee.employee_id IS NULL) THEN 'Missing'
     WHEN (source.employee.employee_id IS NULL) THEN 'Orphan' END AS AuditFailureType
FROM destination.employee
FULL OUTER JOIN source.employee
             on destination.employee.employee_id = source.employee.employee_id
WHERE (destination.employee.employee_id IS NULL OR source.employee.employee_id IS NULL)
   OR (md5(source.employee.first_name || source.employee.last_name || source.employee.age)) != md5((destination.employee.first_name || destination.employee.last_name || destination.employee.age));
+1
source

CSV > 1 . current_week_sales. script, 2014_12_sales current_week_sales.

+1

The only truly effective solution is to gain control over the program that creates this file and make it do something more sensible.

If you cannot do this,> 1 GB is just not that big, unless it is → → 1GB. Just recount it all. If it is slow, then do it faster. There is no reason to believe that calculating multiple 1 GB resumes should be slow.

0
source

All Articles