Surprising Apache Sqoop-to-Hive Gotchas

In this blog post, I describe a few surprising gotchas related to the import of a MySQL table into Apache Hive using Apache Sqoop 1.4.5 (the most recent version supported by vendors like Hortonworks or Cloudera at the time of writing this post).

Real-world scenario

In my simple (yet real-world) use-case, I have a MySQL table and I want to copy it as an external table in Hive. The data in Hive should be stored in a columnar format such as Parquet or ORC.

Although Apache Sqoop was created many years ago and it seems to be a very mature tool, you’ll see that, even in this simple use-case, it still lacks useful features here and there.

Two possible ways of importing

Basically, there are two ways to import a MySQL table into Apache Hive using Apache Sqoop:

  • One-shoot command import – just adding the --hive-import option to your Sqoop command
  • Two-hop import – first dumping the content of the table to files in HDFS using a Sqoop command, and then creating Hive table (or partition) on top of it using a Hive command.

Both of them have own pros, cons and surprising limitations.

Gotchas related to the “one-shoot command” import

Because of simplicity, this seems to be the first choice to go. Let’s review it.

1. Non-textual file formats are not supported in Apache Sqoop to Apache Hive import (in 1.4.5)

There is no support for Avro (see SQOOP-324), ORC or Parquet today. The support for Parquet was added to Sqoop 1.4.6 (see SQOOP-1366), but Sqoop 1.4.6 is not shipped with HDP 2.2 or CDH 5.3.

So if you want to use a one-shoot import from MySQL to Hive today, you must use text format. The text is not only more time consuming to parse and more storage heavy file format, but it’s also vulnerable to many issues caused by dirty data. For example, the occurrence of the new line character (\n) or field delimiter (e.g. \u001 or ,) in any individual column of your imported rows might cause problems later when reading them in Hive. Fortunately, you can overcome this problem by using the --hive-delims-replacement option to replace these delimiters with your own string when importing table or even drop them by using the --hive-drop-import-delims option. Although it helps you avoid parsing errors, the dataset imported to Hive will slightly differ from the original one in MySQL. Changing the delimiters also impacts performance, because Sqoop will have to parse the mysqldump‘s output into fields and transcode them into the user-specified delimiter instead of just copying the data directly from mysqldump‘s output into HDFS.

2. External tables/partitions are not supported

See SQOOP-816 to learn more.

3. Partitions are created under the default location

When loading data into some partition, Sqoop will create a subdirectory for a partition inside the directory where the table is located. Unfortunately, you can’t override it, because of the --target-dir and --warehouse-dir options are not respected. See SQOOP-1293 to learn more.

4. Existing Hive table can’t be truncated before the import

The --delete-target-dir option is not respected. This means that if you run your import more than once, you risk duplication of the dataset.

See SQOOP-1293 to learn more.

In other words…

The one-shot command doesn’t satisfy the requirement stated in this blog post. Hari Sekhon already mentioned this in the last comment of SQOOP-1293 saying that the following feature is needed:

Import a table from a database into Hive as an external table (or partition) placing data in a given path in HDFS (in columnar format), deleting the directory if it exists to avoid cumulative data build up (i.e a total table refresh operation from source).

Cons of the “two-hop” import

Although some multi-step workarounds can satisfy our requirements, they have their own cons…

1. Inelegant because it forces a multi-step procedure

It’s no longer a just a single command… If you have multiple commands, then something can fail in the middle and the cleanup/recovery procedure becomes a bit more complex.

2. You must write code to create a Hive table and partitions

… and probably integrate it with your scheduling tool e.g. Oozie (and/or Falcon), Azkaban, Luigi.

3. Requires an additional job to change the file format

Sqoop 1.4.5 allows you to dump data to Avro or SequenceFile format (using the --as-avrodatafile option, for example), but you still need to implement and run an identity map-only MapReduce job (e.g. in Pig or Hive) to convert your dataset to a columnar format such as Parquet or ORC.

If you decide to dump data as Avro files (e.g. to make parsing of your records easier), you won’t be able to … leverage the direct (fast) import. It turns out that MySQL direct import currently supports only text output format. Therefore, an option like --as-avrodatafile isn’t supported with the --direct option when importing a MySQL table. This makes the --direct option much less exciting for the MySQL users.

Apache Sqoop, Apache Hive, HCatalog and ORC

Updated on June 25th, 2015

As Venkat Ranganathan wrote in the comment, the import from RDBMS to Hive in ORC format via HCatalog is supported. I tested it and it works fine:

$ hive -e "CREATE TABLE concert (id int, artist string, day date, location string) STORED AS ORCFILE;"

When running Sqoop import to Hive, the main difference is that we use the --hcatalog-database and --hcatalog-table options instead of the --hive-table option as described in Sqoop-HCatalog Integration.

$ sqoop import \
--connect jdbc:mysql://$(hostname)/streamrock \
--username kawaa -P \
--table concert \
--hcatalog-database streamrock \
--hcatalog-table concert

The integration of Sqoop and HCatalog doesn’t work with Parquet, though (due to HIVE-7502).


Although the “two-hop” import is less elegant, it gives us more flexibility. Therefore, it’s a better fit for the use-case that I described at the beginning of this blog post.

Post by Adam Kawa

Adam became a fan of Big Data after implementing his first Hadoop job in 2010. Since then he has been working with Hadoop at Spotify (where he had proudly operated one of the largest and fastest-growing Hadoop clusters in Europe for two years), Truecaller, Authorized Cloudera Training Partner and finally now at GetInData. He works with technologies like Hadoop, Hive, Spark, Flink, Kafka, HBase and more. He has helped a number of companies ranging from fast-growing startups to global corporations. Adam regularly blogs about Big Data and he also is a frequent speaker at major Big Data conferences and meetups. He is the co-founder of Stockholm HUG and the co-organizer of Warsaw HUG.

8 Responses to Surprising Apache Sqoop-to-Hive Gotchas

  1. Naveen Arumugam

    Thanks for the excellent write-up! I always subscribe to two-phase transfer. What I did not know is how the partition is handled and can it differ from primary key (eg: date partition derived from datetime column)?

    Hopefully, they introduce RDB to Hive as external table under desired hive schema in desired file format. 🙁

  2. Venkat Ranganathan

    Apache Sqoop 1.4.4+ has supported importing into Hive using HCatalog interfaces providing storage format agnostic imports and exports. In 1.4.5, Sqoop was enhanced to support all Hive datatypes that have been introduced since hive 0.7 (yes, the hive-import option in Sqoop only supports basic data types and only text formats). Sqoop supports both static and dynamic partitioning keys.

    Please see the following

    External table import is not available currently (a hive limitation that is being addressed0. Parquet support for HCatalog is being discussed in a separate JIRA as well

    So Sqoop can take advantage of those once Hive support is there.

    • Hi! The presentation about Sqoop and HCatalog integration is very informative (thanks for sharing!). I will try the conversion to ORC and update my post accordingly. This feature sounds great!

    • Karthik

      I am trying to load partitioned tables dynamically using sqoop. It is failing with null pointer exception!!!
      I have created the below partition table as below:
      create table customer(cust_id int, name string) partitioned by (country string) stored as orc;
      In my sqoop i have setup as below:
      sqoop import –connect <> –query “select cust_id, name, country from cust_table” –hcatalog-database testdb –hcatalog-table customer –split by cust_id –num-mappers 20

      What am I doing wrong?

  3. Pravin Yadav

    First way : I am trying to import data from mysql to Hive tables with sqoop. It is importing data file in HDFS but not loading data properly in Hive tables as their delimiters are different. Mysql is structured and Hive is t seperated fields.
    Second Way : But I tried second way which is imported data from mysql to HDFS using sqoop and then loaded data into HIVE tables from HDFS its working fine. Can you explain why first way is not working properly.

  4. ronnie

    I have to settle using SQOOP2 to import MSSQL to HDFS to HIVE the “two hop” import due to some problem with SQOOP. However, the table imported to HDFS is splitted to a few files with not meaningful file name such as f9b07da3-e4be-4007-bcbe-dcf882f59e5f.txt, how can I set in hadoop so that it can be something like this file1.txt, file2.txt….fileN.txt so that I can merge them and import to HIVE?

Leave a Reply

Your email address will not be published. Required fields are marked *

Blue Captcha Image