Cut | Sort | Uniq -d -c | but?

This file is in the following format.

GGRPW,33332211,kr,P,SUCCESS,systemrenewal,REN,RAMS,SAA,0080527763,on:X,10.0,N,20120419,migr
GBRPW,1232221,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASD,20075578623,on:X,1.0,N,20120419,migr
GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr

I need to take out duplicates and count (each duplicate is classified by f1,2,5,14). Then insert the first records of duplicate records of all integer fields into the database and mark the counter (dups) in another column. To do this, I need to cut out all 4 mentioned fields and sort them and find duplicates using uniq -d and for the counters I used -c. Now we come back again after all sorting from duplexes, and he believes that the result should be in the form below.

3,GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr

While three are the number of repeated duplicates for f1,2,5,14, and the remaining fields can be from any of the dup strings.

Thus, duplicates must be removed from the source file and shown in the above format. And those remaining in the source file will be uniq, they go as they are ...


What I've done...

awk '{printf("%5d,%s\n", NR,$0)}' renewstatus_2012-04-19.txt > n_renewstatus_2012-04-19.txt 
cut -d',' -f2,3,6,15 n_renewstatus_2012-04-19.txt |sort | uniq -d -c 

but to get lines for duplets, you need to go back to the original file again ...

Let me not be confused .. for this I need a different point of view .. and my brain clings to my approach .. I need a cigar .. Any thots ... ??

+3
source share
3 answers

sort has the -k option

   -k, --key=POS1[,POS2]
          start a key at POS1, end it at POS2 (origin 1)

uniq has the -f option

   -f, --skip-fields=N
          avoid comparing the first N fields

so sort and uniq with field numbers (count NUM and check this cmd yourself, plz)

awk -F"," '{print $0,$1,$2,...}' file.txt | sort -k NUM,NUM2 | uniq -f NUM3 -c
+4
source

Using awk associative arrays is a convenient way to find unique / duplicate strings:

awk '
    BEGIN {FS = OFS = ","}
    {
        key = $1 FS $2 FS $5 FS $14
        if (key in count) 
            count[key]++
        else {
            count[key] = 1
            line[key] = $0
        }
    }
    END {for (key in count) print count[key], line[key]}
' filename
0
source

:

awk -F, '! (($ 1 SUBSEP $2 SUBSEP $5 SUBSEP $14) uniq) {uniq [$ 1, $2, $5, $14] = $0} {count [$ 1, $2, $5, $14] ++} END {for (i in count) {if (count [i] > 1) file = "dupes"; else file = "uniq"; print uniq [i], "," count [i] > file}} 'renewstatus_2012 -04-19.txt

:

sym @localhost: ~ $cut -f16 -d ',' uniq | | uniq -d -c 124275 1 ----- > UNIQ (1)

sym @localhost: ~ $ cut -f16 -d ',' dupes | sort | uniq -d -c 3860 2 850 3 71 4 7 5 3 6 sym @localhost: ~ $ cut -f16 -d ',' dupes | sort | uniq -u -c

1 7

10614 ------> AMOUNT OF VALID RECORDS DISPLACED WITH ITS ACCOUNTS

sym @localhost: ~ $ wc -l renewstatus_2012-04-19.txt 134889 renewstatus_2012-04-19.txt ---> TOTAL ORIGINAL FILE COUNTERS EXACTLY ACCORDING TO (124275 + 10614) = 134889

0
source

All Articles