While working with both Primitive types and Embedded Data Structures was discussed in part one, the UDF interfaces are limited to a single output.
In this post we will look at user defined table functions represented by org.apache.hadoop.hive.ql.udf.generic.GenericUDTF interface. This function type is more complex, but allows us to output multiple rows and multiple columns for a single input (nifty!).
The table that will be used for demonstration is called people. It has one column - name, which contains names of individuals and couples.
It is stored in a file called people.txt
We can upload this to Hadoop to a directory called people:
Then load up the hive shell, and create the hive table
The Value of UDTF
The UDF and GenericUDF functions from the previous article manipulate a single row of data. They return one element, and they must return a value.
This is not convenient for all data processing tasks. As Hive can store data of many kinds sometimes we do not want to have exactly one row of output for a given input. Perhaps we wish to output a few rows per input row, or output no rows at all. As an example, think what the function explode (a Hive Built-In function) can do.
Similarly, perhaps we also wish to output several columns of data, instead of simply returning a single value.
Both these things we can accomplish with a UDTF.
A Practical Example
Lets suppose that we would like to create a cleaner table of peoples’ names. The new table will have:
Separate columns for First Name and Surname.
No records that do not contain both first and last names (have no separating white space).
Separate rows for each person in a couple (eg Nick and Nicole Smith).
To accomplish this goal, we will implement the org.apache.hadoop.hive.ql.udf.generic.GenericUDTF API.
The UDTF takes string as a parameter and returns a struct with two fields. Similarly to the GenericUDF, we have to manually configure all of the input and output object inspectors Hive needs in order to understand the inputs and outputs.
We identify a PrimitiveObjectInspector for the input string.
Defining the output object inspectors requires us to define both field names, and the object inspectors required to read each field (in our case, both fields are strings).
The bulk of our logic resides in the processInputRecord function which is fairly straightforward. Separating our logic allows easier testing without having to struggle with object inspectors.
Finally, once we have the result we can forward it, this registers that object as an output record for Hive to process.
Using our function
We can build our function and use it in Hive
Then use it from hive
It is best to divide testing of a UDTF into two parts. Testing the data processing itself, and then testing the function as a whole in Hive. Testing in Hive is always recommended due to the complexity of the different elements, input formats, and data.
Below is an example unit test for splitting person’s name into name and surname, again this can found in full on GitHub:
By now you should be a pro at customizing Hive functions.