PIG: Get all the tuples from a grouped bag

I use PIG to generate groups from tuples as follows:

a1, b1
a1, b2
a1, b3
...

->

a1, [b1, b2, b3]
...

It is easy and convenient. But my problem is to get the following: from the received groups, I would like to generate a set of all tuples in the group package:

a1, [b1, b2, b3]

->

b1,b2
b1,b3
b2,b3

It would be easy if I could attach a “foreach” and, firstly, iterate over each group, and then on top of my bag.

I believe that I misunderstand the concept, and I will be grateful for your explanation.

Thank.

+5
source share
3 answers

It looks like you need a Cartesian product between the bag and yourself. To do this, you need to use FLATTEN (bag) twice.

The code:

inpt = load '.../group.txt' using PigStorage(',') as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as value_bag;
result = foreach id_grp generate id, FLATTEN(value_bag) as v1, FLATTEN(value_bag) as v2; 
dump result;

, . , TOP (...) FLATTEN:

inpt = load '....group.txt' using PigStorage(',')  as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as values;
result = foreach id_grp {
    limited_bag = TOP(50, 0, values); -- all sorts of filtering could be done here
    generate id, FLATTEN(limited_bag) as v1, FLATTEN(limited_bag) as v2; 
};
dump result;

FLATTEN:

inpt = load '..../group.txt' as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as values;
result = foreach id_grp {
    l = filter values by val == 'b1' or val == 'b2';
    generate id, FLATTEN(l) as v1, FLATTEN(values) as v2; 
};
result = filter result by v1 != v2;

, .

+15
+4

You can use the GROUP ALLpig operator to generate

A  = -- Some bag
B  = -- Another bag

groupedB = group B ALL;
result   = foreach A GENERATE 
    TOTUPLE(*), groupedB.$1;

-- Will generate
((a1), {(b1, b2, b3)})
((a2), {(b1, b2, b3)})
((a3), {(b1, b2, b3)})
...
+1
source

All Articles