ETL for processing history records

I am a kind of DWH project (not really, but still). And there is this problem that we are constantly confronted with, and I was wondering if there would be a better solution. Is following

We get several large files with records containing all the states the user was included in, for example:

UID | State    | Date
1   | Active   | 20120518
2   | Inactive | 20120517
1   | Inactive | 20120517
...

And we are usually interested in the last state of each user. So far so good, with a little sorting, and we could get the way we want it. The only problem is that these files are usually large .. like 20-60gb, sorting these guys is sometimes a pain, since the sorting logic is usually not so simple.

As a rule, we load everything into our Oracle and use intermediate tables and materialized representations for this. However, sometimes performance bites us.

20-60gb can be big, but not so big. I mean, there should be a slightly more specialized way to handle these records, right?

I present two main ways to solve the problem:

1) Programming outside the DBMS, scripts and compiled things. But perhaps this is not very flexible unless more time is spent on developing something. In addition, I may have to deal with the administration of mailbox resources, while I don’t want to worry about it.

2) Download everything into the DBMS (in our case, Oracle) and use any tools that it provides for sorting and writing data. That would be my business, although I'm not sure if we use all the tools or just do it right, as for Oracle 10g.

Question:

60gb , , .

, , ?

!

+3
4

, , .

, , - . Enterprise Edition , .

- , . . , , Hadoop, (, , , , , ).

: . , , . ( , ). .

, ,

CREATE TABLE user_status_external (
   uid     NUMBER(6),
   status      VARCHAR2(10),
   sdate        DATE
ORGANIZATION EXTERNAL
(TYPE oracle_loader
 DEFAULT DIRECTORY data_dir
 ACCESS PARAMETERS
 (
  RECORDS DELIMITED BY newline
  BADFILE 'usrsts.bad'
  DISCARDFILE 'usrsts.dis'
  LOGFILE 'usrsts.log'
  FIELDS TERMINATED BY ","  OPTIONALLY ENCLOSED BY '"'
  (
   uid     INTEGER EXTERNAL(6),
   status     CHAR(10),
   sdate       date 'yyyymmdd' )
 )
 LOCATION ('usrsts.dmp')
)
PARALLEL
REJECT LIMIT UNLIMITED;

, DATA_DIR.

, insert:

insert into user_status (uid, status, last_status_date)
    select  sq.uid
            ,  sq.status
            ,  sq.sdate
    from (
        select /*+ parallel (et,4) */ 
               et.uid
               , et.status
               , et.sdate
               , row_number() over (partition by et.uid order by et.sdate desc) rn  
        from user_status_external et
        ) sq
    where sq.rn = 1

, , , . .

- INSERT: , USERID, , . , , , MERGE .


: , , , , . , . , .. . .

+2

- , APC . , , , , . ? 20-60 - , - , X 2GB, ?

, 60 , - , , . - , , user_id. Oracle X , .

, :

  • .
  • ., , .
  • , - , /* + nologging */, . force_logging true, nologging .
  • , , , .

nologging, , - 60 , 60 , , , . , , !

, , , . , - IO, , , , .

+1

, - :

create materialized view my_view
tablespace my_tablespace
nologging
build immediate
refresh complete on demand
with primary key
as
select uid,state,date from
(
  select /*+ parallel (t,4) */ uid, state, date, row_number() over (partition by uid order by date desc) rnum
  from my_table t;
)
where rnum = 1;

, .

: , , uid.

0

, , . .

, , .

In general, this becomes (in pseudo-code):

foreach row in file
    if savedrow is null
        save row
    else
        if row is more desirable than savedrow
             save row
        end
    end
end

send saved rows to database

Point, you need to determine how one line is considered more desirable than another. In the simple case, for a given user, the current date of the line is later than the last saved line. At the end you will have a list of strings, one for each user, each of which has the most recent date that you saw.

You can use a script or program so that the structure is separate from the code that each data file understands.

It will take some time, mind :-)

-1
source

All Articles