Tutorial: Using Presto to combine data from Hive and MySQL in one SQL-like query

Tutorials Comments (6)

Presto (originated at Facebook) is a yet another distributed SQL query engine for Hadoop that has recently generated huge excitement. What makes Presto so interesting, especially, in comparison to other existing engines like Cloudera Impala or Hive on Tez? The list of key differentiators includes:

  • Presto seems to be great for ad-hoc analysis – it runs queries quickly and scales from gigabytes to petabytes.
  • A single Presto query can process data from multiple sources e.g. HDFS, MySQL, Cassandra or even Kafka.
  • Large-scale companies like Facebook, Netflix, Airbnb or Dropbox like it and actively contribute to it.

Because (a) the excitement about Presto is evident, (b) running own benchmarks is time-consuming, in this tutorial we focus on the feature that allow us to process data from HDFS and MySQL in a single Presto query. Such a feature is quite unique, because it’s hasn’t been provided by other open-source projects like Hive and Impala (Impala, however, can process Hive and HBase table in a single query).

Presto Connectors

First, a bit of technical background. Presto supports pluggable connectors that provide metadata and data for queries. There is Hive connector (currently supports Text, SequenceFile, RCFile, ORC and, in a limited way, Parquet formats), MySQL connector and several others. You can even implement a custom one.

A coordinator (a master daemon) uses connectors to get metadata (such as table schema) that is needed to build a query plan. Workers use connectors to get actual data that will be processed by them.

Presto Plugins

Our use-case

To demonstrate this feature, we use the Dataset – 1K users dataset provided by It contains profile information (such as gender, age, country) about 1K users plus tracks that these users had been listening to.

In our tutorial, user profile information (40 kilobytes) will be kept in a MySQL table (something that actually makes sense in real-world scenarios when user base is small enough), while information about streamed tracks (2.4 gigabytes) will be uploaded to a Hive table.

Training environment

For simplicity and quick iterations, we use an already pre-configured single-node Hadoop YARN cluster with Hive and MySQL. An easy way to get it is to download Hortonworks Sandbox (we used Hortonworks Sandbox 2.2).

Presto installation

Because the installation of Presto is very smooth and well documented, we don’t duplicate these steps here. Should you follow the instructions carefully, you have a running Presto server in less then 10 minutes.

Please note that Presto requires Java 8.

Unfortunately, Presto on YARN is not “officially” supported yet. This means that a Presto daemon should be installed on each node of your cluster (and share resources with NodeManager and YARN containers, if you happen to run YARN on the same nodes). There is, however, an interesting blog post that describes how to run Presto on YARN over Apache Twill.

Integration between Presto and MySQL and Hive

Please follow the instructions needed to integrate Presto with MySQL and Hive. In case of Hortonworks Sandbox 2.2, files shown below contain exemplary (and working) configuration settings.

When these files are created, you can start Presto server.

Uploading input datasets into MySQL and Hive

First we create a MySQL database and table. Then, we upload information about users into it.

Next, we create a Hive database and table. Then we upload streamed tracks into it.

Using Presto to query both MySQL and Hive tables

Finally, we are ready to process data! 🙂

Let’s start the Presto terminal-based interactive shell for running queries.

In one of the previous steps, we defined two catalogs:

  • mysql (configured in etc/catalog/ file)
  • hive (configured in the etc/catalog/ file)

With Presto, to specify a table that you wish to process, we use the catalog.database.table naming convention e.g. mysql.tutorial.user or

Without further ado, let’s run some basic queries:

  • List tables in the tutorial database in MySQL
  • Find two countries with the largest number of users
  • Describe the stream table in Hive
  • Find top three countries with the largest number of streams
  • This query actually requires combining data from MySQL (user profiles) and Hive (listened tracks).

    2:37 [19.2M rows, 2.36GB] [122K rows/s, 15.3MB/s] means that it took 2min:37sec to process 19.2M rows (that equals 2.36GB) with the average speed of 122K rows/s.

    Presto Query You can find more statistics, if you navigate to Presto Web UI at (see a picture on the left).

    One SQL semantics for all

    Since Presto gives you a single SQL-like interface to query data from multiple data-sources, one should understand its implications.

    Presto uses ANSI SQL syntax and semantics that, for example, has several differences from HiveQL (which is loosely modeled after MySQL). Be aware of these syntax and semantics incompatibilities when your try to re-run your HiveQL queries in Presto.


    First impression of Presto is very positive – the installation process is problemless, the tested features work as expected, and there is actually no troubleshooting needed.

    Although Hive on Tez (or possibly Hive on Spark in the nearest future) aims to provide a single SQL-like query and BI solution on the Hadoop regardless of dataset size, Presto goes one step further and aims to provide also a single SQL-like query solution regardless of dataset location e.g. HDFS, RDBMS, NoSQL, proprietary data stores. This idea is really cool and it’s great to see tool that makes it happen.

    Because Presto is a relatively new projects, it’s still lacking some useful features – integration with YARN (less efficient sharing of resources between Presto and other engines like MapReduce, Spark), possibility to write results back to Hive tables (problematic if you want to integrate Presto into your ETL pipelines), support for Avro, to name a few. Fortunately, Presto adoption and community is growing fast, so hopefully these features will be implemented soon. Stay tuned!

    Tweet about this on TwitterShare on LinkedIn11Share on Facebook0Share on Google+0Pin on Pinterest0Email this to someone
    Adam Kawa

    Adam Kawa

    Big Data Consultant and Founder at GetInData
    Adam became a fan of Hadoop after implementing his first MapReduce job in 2010. Since then he has been working with Hadoop at Netezza, the University of Warsaw, Spotify (where he had operated one of the largest and fastest-growing Hadoop clusters in Europe for two years), as an Authorized Cloudera Training Partner. Now he works as Big Data consultant at GetInData.
    Adam Kawa

» Tutorials » Tutorial: Using Presto to combine...
On December 27, 2014
, , , ,

6 Responses to Tutorial: Using Presto to combine data from Hive and MySQL in one SQL-like query

  1. Guruveer says:

    Seems good tool………….!!!!!

  2. Sumanth says:

    I always get this error when trying to access table in mysql running on my localhost.
    Query 20150605_121024_00042_dhedn failed: Catalog mysql does not exist

    Pls help

  3. Vitaliy says:

    Thank you! Everything really works!

  4. Bob says:

    Great tutorial. I’m very new to Presto and I’m trying to get a feel for if Presto would be a good solution for our company. I haven’t been able to find much information or documentation on this but is it possible to just do a select * on a table from multiple connectors without actually having to join a specific table from one connector to a different specific table from a different connector on a column id? For example, can Presto recognize that two or more MySQL connectors have the same schema allowing you to just query on all of those connectors? Thank you.

  5. Thanks on your marvelous posting! I seriously enjoyed reading it, you can be a great author.
    I will make certain to bookmark your blog and may come back
    later on. I want to encourage you to continue your great
    writing, have a nice weekend!

Leave a Reply

Your email address will not be published. Required fields are marked *

Blue Captcha Image


« »