Find the number and delete adjacent characters equal to this number

Part of my 4 column output is as follows:

5    cc1kcc1kc    5    cc1kcc1kc
5    cc2ppggg   5    cc2ppggg
6    ccg12qqqqqqqqqqqqggg    10 ccccg11qqqqqqqqqqqggggg 
3    4qqqqcgc1q   12    cgccgccgccgc

I want the second and fourth columns to change, is there any way with awk / sed to remove numbers with characters next to them? Or would it be easier / better to use a perl script to perform this conversion?

The result should look like this:

5    ccccc    5    ccccc
5    ccggg    5    ccggg
6    ccgggg   10    ccccgggggg 
3    cgc    12    cgccgccgccgc
+5
source share
4 answers

Taking the question literally, this removes the following n characters from fields 2 and 4 for any n embedded in the field.

perl -lane 'for $i (1, 3) {@nums = $F[$i] =~ /(\d+)/g; for $num (@nums) {$F[$i] =~ s/$num.{$num}//}}; print join("\t", @F)'

Other answers delete the number and all subsequent characters that are the same.

To illustrate the difference between my answer and others, use the following input:

6    ccg8qqqqqqqqqqqqggg    10 ccccg3qqqqqqqqqqqggggg

My version outputs this:

6    ccgqqqqggg     10      ccccgqqqqqqqqggggg

:

6    ccgggg    10 ccccgggggg
+4

perl:

perl -pe 's/\d+([^\d\s])\1*//g'
+3

sed:

sed 's/[0-9]\+\([a-z]\)\1*//g'

Matches any string of digits ( [0-9]+), followed by any letter ( [a-z]). \1*matches any subsequent occurrences of this symbol. The modifier /g(global) ensures that replacement is performed more than once per line.

+2
source

This may work for you (GNU sed):

sed 'h;s/\S*\s*\(\S*\).*/\1/;:a;s/[^0-9]*\([0-9]\+\).*/sed "s|\1.\\{\1\\}||" <<<"&"/e;ta;H;g;/\n.*\n/bb;s/\(\S*\s*\)\{3\}\(\S*\).*/\2/;ba;:b;s/^\(\S*\s*\)\(\S*\)\([^\n]*\)\n\(\S*\)/\1\4\3/;s/\(\S*\s*\)\n\(.*\)/\2/' file
+1
source

All Articles