Surprising Sqoop-to-Hive Gotchas
In this blog post, I describe a few surprising gotchas related to the import of a MySQL table into Hive using Sqoop 1.4.5 (the most recent version supported by vendors like Hortonworks or Cloudera at the time of writing this post).
In my simple (yet real-world) use-case, I have a MySQL table and I want to copy it as an external table in Hive. The data in Hive should be stored in a columnar format such as Parquet or ORC.
Although Sqoop was created many years ago and it seems to be a very mature tool, you’ll see that, even in this simple use-case, it still lacks useful features here and there.
Two possible ways of importing
Basically, there are two ways to import a MySQL table into Hive using Sqoop:
- One-shoot command import – just adding the --hive-import option to your Sqoop command
- Two-hop import – first dumping the content of the table to files in HDFS using a Sqoop command, and then creating Hive table (or partition) on top of it using a Hive command.
Both of them have own pros, cons and surprising limitations.
Gotchas related to the “one-shoot command” import
Because of simplicity, this seems to be the first choice to go. Let’s review it.
1. Non-textual file formats are not supported in Sqoop to Hive import (in 1.4.5)
So if you want to use a one-shoot import from MySQL to Hive today, you must use text format. Text is not only more time consuming to parse and more storage heavy file format, but it’s also vulnerable to many issues caused by dirty data. For example, the occurrence of the new line character (\n) or field delimiter (e.g. \u001 or ,) in any individual column of your imported rows might cause problems later when reading them in Hive. Fortunately, you can overcome this problem by using the --hive-delims-replacement option to replace these delimiters with your own string when importing table, or even drop them by using the --hive-drop-import-delims option. Although it helps you avoid parsing errors, the dataset imported to Hive will slightly differ from the original one in MySQL. Changing the delimiters also impacts performance, because Sqoop will have to parse the mysqldump‘s output into fields and transcode them into the user-specified delimiter instead of just copying the data directly from mysqldump‘s output into HDFS.
2. External tables/partitions are not supported
See SQOOP-816 to learn more.
3. Partitions are created under the default location
When loading data into some partition, Sqoop will create a subdirectory for a partition inside the directory where the table is located. Unfortunately, you can’t override it, because the --target-dir and --warehouse-dir options are not respected. See SQOOP-1293 to learn more.
4. Existing Hive table can’t be truncated before the import
The --delete-target-dir option is not respected. This means that if you run your import more than once, you risk duplication of the dataset.
See SQOOP-1293 to learn more.
In other words…
The one-shot command doesn’t satisfy the requirement stated in this blog post. Hari Sekhon already mentioned this in the last comment of SQOOP-1293 saying that a following feature is needed:
Import a table from database into Hive as external table (or partition) placing data in a given path in HDFS (in columnar format), deleting the directory if it exists to avoid cumulative data build up (i.e a total table refresh operation from source).
Cons of the “two-hop” import
Although some multi-step workarounds can satisfy our requirements, they have their own cons…
1. Inelegant because it forces a multi-step procedure
It’s no longer a just a single command… If you have multiple commands, then something can fail in the middle and the cleanup/recovery procedure becomes a bit more complex.
2. You must write code to create a Hive table and partitions
… and probably integrate it with your scheduling tool e.g. Oozie (and/or Falcon), Azkaban, Luigi.
3. Requires an additional job to change the file format
Sqoop 1.4.5 allows you to dump data to Avro or SequenceFile format (using the --as-avrodatafile option, for example), but you still need to implement and run an identity map-only MapReduce job (e.g. in Pig or Hive) to convert your dataset to a columnar format such as Parquet or ORC.
If you decide to dump data as Avro files (e.g. to make parsing of your records easier), you won’t be able to … leverage the direct (fast) import. It turns out that MySQL direct import currently supports only text output format. Therefore, an option like --as-avrodatafile isn’t supported with the --direct option when importing a MySQL table. This makes the --direct option much less exciting for the MySQL users.
Sqoop, Hive, HCatalog and ORC
Updated on June 25th, 2015
As Venkat Ranganathan wrote in the comment, the import from RDBMS to Hive in ORC format via HCatalog is supported. I tested it and it works fine:
$ hive -e "CREATE TABLE concert (id int, artist string, day date, location string) STORED AS ORCFILE;"
When running Sqoop import to Hive, the main difference is that we use the --hcatalog-database and --hcatalog-table options instead of the --hive-table option as described in Sqoop-HCatalog Integration.
$ sqoop import \ --connect jdbc:mysql://$(hostname)/streamrock \ --username kawaa -P \ --table concert \ --hcatalog-database streamrock \ --hcatalog-table concert
The integration of Sqoop and HCatalog doesn’t work with Parquet, though (due to HIVE-7502).
Although the “two-hop” import is less elegant, it gives us more flexibility. Therefore, it’s a better fit for the use-case that I described in the beginning of this blog post.