Image by Sarah
I’ve written a lot about hadoop-related technologies and MapReduce frameworks in particular in which I have implemented the same data-querying solution in many MapReduce frameworks. However until this point I haven’t really spoken much about how to get data into Hadoop in the first place.
A common problem is moving data from a relational datastore (like MySQL or Postres) to Hadoop HDFS in order to process it with Hive, MapReduce or Spark.
There’s a great open source piece of software called Sqoop, it provides these capabilities and lets you move data from MySQL, Postgres, Oracle, and Teradata. Just to make things confusing there is also Sqoop2 which has slightly different database version support, but does include SQL Server.
The goal of this article is to walk through a practical data import example with a couple of datasets that need moving from a relational database to HDFS.
Let’s pretend that I have an e-commerce site that stores:
- User information (id, email, language, location)
- Transaction information (transaction-id, product-id, user-id, purchase-amount, item-description)
This information is stored in a relational database (MySQL).
I need to export that data to a Hadoop cluster to combine with other datasets, use in MapReduce jobs, and serve as a base for reporting. In fact if you want to see how someone might use this information take a look to my personal blog where I use these exact datasets in my many MapReduce guides:
- Hive HDFS Query Walkthrough
- Pig Query Walkthrough with Examples
- Java Map Reduce Guide
- Hadoop Python MapReduce Tutorial for Beginners
- Cascading MapReduce Hadoop Tutorial with Examples
- Type-Safe Scalding MapReduce Tutorial - Joining and Summarizing Data
- Apache Spark Scala Tutorial with Examples
Why Should I Import Relational Data into Hadoop?
If you’ve started using Hadoop for other data storage and analysis tasks you’ll find that much of this analysis could be improved with access to reliable user or product data. In our example maybe user and transaction records could be used in coordination with website log data to understand purchasing patterns, come up with recommendations, or to drive an inventory supply algorithm.
By building an automated, reliable way to populate Hadoop with relational data, you’ll be one step closer to a rich and fully populated data lake and a world of possibilities.
An Introduction to Sqoop
While there are ways to stream data directly to Hadoop (like log data) common data that resides in a relational database is usually imported using a batch read/write tool, like Sqoop.
Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases. Sqoop can be used to import data from a relational database such as MySQL or Oracle into the Hadoop Distributed File System (HDFS). It can also be used to transform data stored in Hadoop as part of a MapReduce and then export the data back into a relational database.
Sqoop is a java utility that can be run from the command line, you tell it configuration like the connection path to your database and it sets up a series of MapReduce jobs to extract and process the information.
Sqoop2 has a client-server architecture and transfers data between more types of sources - structured, semi-structured and unstructured data sources (eg Cassandra, MongoDB).
Citing Cloudera on Sqoop: “Sqoop 2 is essentially the future of the Apache Sqoop project. However, since Sqoop 2 currently lacks some of the features of Sqoop 1, Cloudera recommends you use Sqoop 2 only if it contains all the features required for your use case; otherwise, continue to use Sqoop 1.”
Also of note to readers - the client/server architecture also makes Sqoop 2 more complex to get started with.
In this tutorial I’ll show examples from both versions of Sqoop, but won’t go into too much detail on the varied ways you can interact with Sqoop2. Instead I’ll mostly focus on command-line interaction with both pieces of software.
Amazon’s Elastic MapReduce (EMR) will have Sqoop preinstalled. Outside of EMR you have to install in yourself, that can be done by downloading Sqoop (or Sqoop2) compatible with your Hadoop version from Sqoop’s web-site. Some Hadoop distributions (like CDH) will also include it by default, check your distribution docs to find out.
To start, you need to add a JDBC library for your database to Sqoop’s lib directory (usually this is due to licensing restrictions). EG: The MySQL JDBC Connector.
As it has a client-server architecture, Sqoop2 requires installing both client and server following the installation guide.
MySQL Database Schema
Do not forget to allow your user access to DB from your Hadoop machines. Create a user with the required permissions for access directly from the Hadoop cluster.
There are 2 main options to export data:
- Export the entire table
- Export the results of a query
And there are a number of ways to run Sqoop:
- From the command line
- Via a Java API
- Via an interactive shell (Sqoop2 only)
Sqoop : Command Line
To collect all of the data from the table:
To preselect what to import with a mysql query:
In this example, substitute
$CONDITIONS with some sort of table filter.
Most variables are obvious, but there are a couple of non-obvious flags here:
-m 1is responsible for the number of Map tasks your job uses. This controls the speed with which data will be copied, plus the number of resulting chunks.
--split-bydefines the table column by which the resulting files should be broken up. This by default is the primary key (and probably should be usually, but you can specify it directly).
--lines-terminated-bydefine the field separators for the resulting file. By default if you’re going to use hive then
newline \nare the best options here.
Sqoop2 : Shell
While Sqoop2 performs the same operations as Sqoop, the way you interact with it is different. Instead of simply running a command line task with flags you can communicate with Sqoop2 with an interactive shell.
Establishing a database connection:
Register job to import data:
Run the job:
So the interactive shell makes Sqoop2 a little easier to get started with from the client side as you don’t need to remember a set of esotoric command line flags, but it’s certainly more work to have to fill-in all the details as prompted.
The nice thing about the command line client is that you can specify your configuration in a
.sqoop file and have the client run that directly. This file is literally just a list of Sqoop client commands, so you’ll have to learn those first.
That said, I think it’s still a whole lot more complex than original Sqoop, especially for newcomers – if you’re new to Hadoop it’s not clear the best way to get started with these tools.
Sqoop : Java API
This is the API for original Sqoop (not Sqoop2). The options resemble the command line options almost 1-1.
The code is very self-explanatory.
Sqoop2 : Java API
This code is a little more involved and abstracts many of the common database concepts behind ‘forms’.
The Sqoop2 API is a little more involved, but still looks similar to the original Sqoop API. It does still require you to have the server running.
The output from either Sqoop tools should be the data inside a folder on Hadoop. This can then serve as an input for downstream processes with MapReduce or Spark.
In case further processing will be done in Hive, I recommend working with the data as an
external table in Hive. That way the data does not have to be controlled by Hive, but is instead mostly read-only.
Sqoop actually comes with some built-in flags for creating Hive tables (
--create-hive-table and more), but for some reason does not support creating external tables.
Personally I prefer using external tables for all data that was originally ‘owned’ by a different system – this prevents accidental deletion of data that is slow to import or hard to recover. For users it sets the explicit idea that the data is not owned by hive, which it is not.
Actually making an external table semi-automatically with Sqoop isn’t fun, here’s the best guide I could find, which isn’t great.
You can of course manually create external tables with hive:
Both Sqoop and Sqoop2 provide DB <> Hadoop data import and export. While Sqoop2 is the ‘future’ of Sqoop it is more complicated and harder to get started with. Frankly I prefer the simpler semantics of Sqoop, but Sqoop2 does come with advantages, such as the centralized coordination of connections and ongoing jobs. Either way I would recommend using Sqoop as simply the means of data transfer, and managing workflows and connections with a robust workflow management framework like Luigi that you can also use for downstream jobs.
The biggest problem I found with both tools is that it’s hard to figure out what documentation to look through, and even after you find the right thing it can be hard to get through. This is a problem generally with Hadoop projects, but the introduction of a second, very different, version of Sqoop makes this doubly onerous, especially as in some places there is no naming distinction made between Sqoop and Sqoop2.