Cut | Sort | Uniq -d -c | but?

Question

Cut | Sort | Uniq -d -c | but?

This file is in the following format.

GGRPW,33332211,kr,P,SUCCESS,systemrenewal,REN,RAMS,SAA,0080527763,on:X,10.0,N,20120419,migr
GBRPW,1232221,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASD,20075578623,on:X,1.0,N,20120419,migr
GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr

I need to take out duplicates and count (each duplicate is classified by f1,2,5,14). Then insert the first records of duplicate records of all integer fields into the database and mark the counter (dups) in another column. To do this, I need to cut out all 4 mentioned fields and sort them and find duplicates using uniq -d and for the counters I used -c. Now we come back again after all sorting from duplexes, and he believes that the result should be in the form below.

3,GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr

While three are the number of repeated duplicates for f1,2,5,14, and the remaining fields can be from any of the dup strings.

Thus, duplicates must be removed from the source file and shown in the above format. And those remaining in the source file will be uniq, they go as they are ...

What I've done...

awk '{printf("%5d,%s\n", NR,$0)}' renewstatus_2012-04-19.txt > n_renewstatus_2012-04-19.txt 
cut -d',' -f2,3,6,15 n_renewstatus_2012-04-19.txt |sort | uniq -d -c

but to get lines for duplets, you need to go back to the original file again ...

Let me not be confused .. for this I need a different point of view .. and my brain clings to my approach .. I need a cigar .. Any thots ... ??

+3

shell

mannoj Apr 21 '12 at 11:09

source share

3 answers

fanlix · Answer 1 · 2012-04-21T11:38:25+0000

sort has the -k option

   -k, --key=POS1[,POS2]
          start a key at POS1, end it at POS2 (origin 1)

uniq has the -f option

   -f, --skip-fields=N
          avoid comparing the first N fields

so sort and uniq with field numbers (count NUM and check this cmd yourself, plz)

awk -F"," '{print $0,$1,$2,...}' file.txt | sort -k NUM,NUM2 | uniq -f NUM3 -c

glenn jackman · Answer 2 · 2012-04-21T18:21:59+0000

Using awk associative arrays is a convenient way to find unique / duplicate strings:

awk '
    BEGIN {FS = OFS = ","}
    {
        key = $1 FS $2 FS $5 FS $14
        if (key in count) 
            count[key]++
        else {
            count[key] = 1
            line[key] = $0
        }
    }
    END {for (key in count) print count[key], line[key]}
' filename

mannoj · Answer 3 · 2012-08-21T10:07:24+0000

:

awk -F, '! (($ 1 SUBSEP $2 SUBSEP $5 SUBSEP $14) uniq) {uniq [$ 1, $2, $5, $14] = $0} {count [$ 1, $2, $5, $14] ++} END {for (i in count) {if (count [i] > 1) file = "dupes"; else file = "uniq"; print uniq [i], "," count [i] > file}} 'renewstatus_2012 -04-19.txt

:

sym @localhost: ~ $cut -f16 -d ',' uniq | | uniq -d -c 124275 1 ----- > UNIQ (1)

sym @localhost: ~ $ cut -f16 -d ',' dupes | sort | uniq -d -c 3860 2 850 3 71 4 7 5 3 6 sym @localhost: ~ $ cut -f16 -d ',' dupes | sort | uniq -u -c

1 7

10614 ------> AMOUNT OF VALID RECORDS DISPLACED WITH ITS ACCOUNTS

sym @localhost: ~ $ wc -l renewstatus_2012-04-19.txt 134889 renewstatus_2012-04-19.txt ---> TOTAL ORIGINAL FILE COUNTERS EXACTLY ACCORDING TO (124275 + 10614) = 134889

Cut | Sort | Uniq -d -c | but?

1 7

More articles: