This is the way I am looking to process my data.. from pig..
A = Load 'data' ...
B = FOREACH A GENERATE my.udfs.extract(*);
or
B = FOREACH A GENERATE my.udfs.extract('flag');
So basically extract either has no arguments or takes an argument... 'flag'
On my udf side...
@Override
public DataBag exec(Tuple input) throws IOException {
//if flag == true
//do this
//else
// do that
}
Now how do i implement this in pig?
The preferred way is to use DEFINE.
,,Use DEFINE to specify a UDF function when:
...
The constructor for the function takes string parameters. If you need to use different constructor parameters for different calls to the function you will need to create multiple defines – one for each parameter set"
E.g:
Given the following UDF:
public class Extract extends EvalFunc<String> {
private boolean flag;
public Extract(String flag) {
//Note that a boolean param cannot be passed from script/grunt
//therefore pass it as a string
this.flag = Boolean.valueOf(flag);
}
public Extract() {
}
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) {
return null;
}
try {
if (flag) {
...
}
else {
...
}
}
catch (Exception e) {
throw new IOException("Caught exception processing input row ", e);
}
}
}
Then
define ex_arg my.udfs.Extract('true');
define ex my.udfs.Extract();
...
B = foreach A generate ex_arg(); --calls extract with flag set to true
C = foreach A generate ex(); --calls extract without any flag set
Another option (hack?) :
In this case the UDF gets instantiated with its noarg constructor and you pass the flag you want to evaluate in its exec method. Since this method takes a tuple as a parameter you need to first check whether the first field is the boolean flag.
public class Extract extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) {
return null;
}
try {
boolean flag = false;
if (input.getType(0) == DataType.BOOLEAN) {
flag = (Boolean) input.get(0);
}
//process rest of the fields in the tuple
if (flag) {
...
}
else {
...
}
}
catch (Exception e) {
throw new IOException("Caught exception processing input row ", e);
}
}
}
Then
...
B = foreach A generate Extract2(true,*); --use flag
C = foreach A generate Extract2();
I'd rather stick to the first solution as this smells.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With