twitter-evolution

Schema Evolution With Avro and Hive

Post Comments (0)

Schema evolution of a Hive table backed by Avro file format allows you to modify the table schema in several “schema-compatible” ways without the need of rewriting all existing data. Thanks to that, your HiveQL queries can read old and new Avro files uniformly using the current table schema. In this blog post I briefly explain this concept and demonstrate a working example of how to use it.

Schema Evolution

There are two main types of “schema-compatibility” – backward and forward. With backward compatibility, a new schema can be applied to read the data created using previous schemas. With forward compatibility, an older schema can be applied to read data created using newer schemas – it’s useful when schema evolves, but (legacy) applications may not be updated immediately and still need to read new files written in newer schemas (and perhaps skipping new features).

Below, you can find the picture that illustrates the backward schema compatibility, where a single Hive query can read Avro files written in four different, but backward compatible schemas.

Schema Evolution In Hive and Avro

You can see that a number of operations can be allowed as a simple requirement change:

  • Adding a new column to a table (the “country” column in the 2nd file)
  • Dropping a column from a table (the “id” column in the 3nd file)
  • Renaming a column (the “birthday” column in the 4th file)
  • Playing with Schema Evolution of Hive Table with Avro

    Let’s see the scheme evolution in practice. We will re-produce exactly the same example that you see on the picture above.

    Download input Avro files

    You can download the input Avro files here. Each file contains data written in a different, but backward-compatible schema.

    Create a Hive table backed by Avro format

    The table contains only three columns: id, name and bday.

    Load an Avro file into the table

    Our input file contains three fields – the same names and types as in our table’s columns.

    We can load this file into the Hive table by simply placing it in the table’s directory in HDFS.

    We can query our table and see that expected results:

    Modify the schema by adding an extra column

    Let’s now modify the schema of our Hive table by adding an extra column (at the end):

    From now, the Hive table will use new schema that contains four columns when reading its data. Because our input file doesn’t contain the country column, we should see NULLs in its place.

    Load a second Avro file into the table

    We have the second Avro file that contains the country field:

    Let’s upload the second Avro file into the table.

    We can query this table – the new schema will be applied for both the first and the second file.

    So far so good!

    Modify the schema by removing an existing column

    We modify the schema of Hive table again, but this time, we remove an existing column id.

    Unfortunately, the above command doesn’t work because this syntax is not supported by Hive. The DROP command can be only applied for removing PARTITIONs, not COLUMNs. Let’s use the other method that is shown in Hive DDL documentation.

    Well, the above command doesn’t work for Avro format…

    The working method is to actually drop the whole table, and create it with a new Avro schema.

    Please note that this method should be only used when you create an EXTERNAL Hive table (because the data in HDFS is not removed when dropping the table).

    If we now query the table, the ID column is ignored (not visible), while the values in the three other columns are shown:

    Load the third Avro file into the table

    The third Avro file that we want to load to our table doesn’t contain the ID field:

    Our Hive query is processing data correctly and printing expected output:

    Modify the schema by renaming an existing column

    According to my knowledge, there is no easy way to rename column in Hive table. The way that I know is to re-create the whole table again and specify Avro schema as a part of its definition.

    Note that we specify that birthday is the new name of the column that previously was named bday – this is specified using the aliases property.

    We are able to read existing three Avro files correctly:

    We have the fourth file that contains a field named birthday (not bday).

    Let’s upload it into the table’s directory:

    As we can see, Hive is handling it well and printing both birthday and bday fields in the same column!

    As you see the support for schema evolution with Hive and Avro is very good.

    Play More

    Feel free to experiment with schema changes for Hive table backed by Parquet.

    Tweet about this on TwitterShare on LinkedIn61Share on Facebook0Share on Google+2Pin on Pinterest0Email this to someone
    Adam Kawa

    Adam Kawa

    Big Data Consultant and Founder at GetInData
    Adam became a fan of Hadoop after implementing his first MapReduce job in 2010. Since then he has been working with Hadoop at Netezza, the University of Warsaw, Spotify (where he had operated one of the largest and fastest-growing Hadoop clusters in Europe for two years), as an Authorized Cloudera Training Partner. Now he works as Big Data consultant at GetInData.
    Adam Kawa

    » Post » Schema Evolution With Avro and...
    On November 8, 2016
    By

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    Blue Captcha Image
    Refresh

    *

    « »