Print the whole row once for each unique column value (Bash)

Question

Print the whole row once for each unique column value (Bash)

It certainly should be a trivial task with awkone way or another, but it left me scratching my head this morning. I have a file with a format similar to this:

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> AIQLTGK        8   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> AIQLTGK        10  genes ADUm.1999,ADUm.3560
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750
pep> VSSILEDKILSR   2   genes ADUm.2146,ADUm.5750

I would like to print a line for each individual peptide value in column 2, which means that the above input will be as follows:

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750

This is what I have tried so far, but obviously does not do what I need:

awk '{print $2}' file | sort | uniq
# Prints only the peptides...
awk '{print $0, "\t", $1}' file |sort | uniq -u -f 4
# Altogether omits peptides which are not unique...

The last thing he will need to treat is peptides, which are substrings of other peptides as separate values (for example, VSSILED and VSSILEDKILSR). Thank:)

+5

bash shell awk uniq

Bede constantinides Aug 21 '12 at 10:09

source share

4 answers

sort:

sort -k 2,2 -u file

-u ( ), -k 2,2 2 ( ).

+16

flolo 21 . '12 10:23

I would use Perl for this:

perl -nae 'print unless exists $seen{$F[1]}; undef $seen{$F[1]}' < input.txt

The switch nworks line by line with the input, the switch abreaks the line into an array @F.

+2

choroba Aug 21 '12 at 10:20

source share

awk '{if($2==temp){next;}else{print}temp=$2}' your_file

checked below:

> awk '{if($2==temp){next;}else{print}temp=$2}' temp
pep> AEYTCVAETK         2       genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK            1       genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR      5       genes ADUm.367
pep> VSSILEDKTT         9       genes ADUm.1192,ADUm.2731
pep> AIQLTGK            10      genes ADUm.1999,ADUm.3560
pep> VSSILEDKILSR       3       genes ADUm.2146,ADUm.5750

+2

Vijay Aug 21 '12 at 10:35

source share

Steve · Accepted Answer · 2012-08-21T10:22:23+0000

One way awk:

awk '!array[$2]++' file.txt

Results:

pep> AEYTCVAETK     2   genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK        1   genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR  5   genes ADUm.367
pep> VSSILEDKTT     9   genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR   3   genes ADUm.2146,ADUm.5750

Print the whole row once for each unique column value (Bash)

More articles: