Schema Evolution With Avro and Hive

Nov 08, 2016

Adam Kawa

Blog

1

Schema evolution of a Hive table backed by Avro file format allows you to modify the table schema in several “schema-compatible” ways without the need of rewriting all existing data. Thanks to that, your HiveQL queries can read old and new Avro files uniformly using the current table schema. In this blog post I briefly explain this concept and demonstrate a working example of how to use it.

Schema Evolution

There are two main types of “schema-compatibility” – backward and forward. With backward compatibility, a new schema can be applied to read the data created using previous schemas. With forward compatibility, an older schema can be applied to read data created using newer schemas – it’s useful when schema evolves, but (legacy) applications may not be updated immediately and still need to read new files written in newer schemas (and perhaps skipping new features).

Below, you can find the picture that illustrates the backward schema compatibility, where a single Hive query can read Avro files written in four different, but backward compatible schemas.

Schema Evolution In Hive and Avro

You can see that a number of operations can be allowed as a simple requirement change:

  • Adding a new column to a table (the “country” column in the 2nd file)
  • Dropping a column from a table (the “id” column in the 3nd file)
  • Renaming a column (the “birthday” column in the 4th file)

Example Use-Case

Let’s see the scheme evolution in practice. We will re-produce exactly the same example that you see on the picture above.

Download input Avro files

You can download the input Avro files here. Each file contains data written in a different, but backward-compatible schema.

Create a Hive table backed by Avro format

The table contains only three columns: id, name and bday.

hive> CREATE EXTERNAL TABLE user_profile (id BIGINT, name STRING, bday STRING) 
      STORED AS avro;
hive> DESCRIBE FORMATTED user_profile;

Load an Avro file into the table

Our input file contains three fields – the same names and types as in our table’s columns.

$ java -jar avro-tools-1.7.4.jar tojson user1.avro

{"id":1,"name":"adam","bday":"1985-09-30"}

We can load this file into the Hive table by simply placing it in the table’s directory in HDFS.

$ hadoop fs -put user1.avro /*/hive/warehouse/user_profile

We can query our table and see that expected results:

hive> SELECT * FROM user_profile;

1	adam	1985-09-30

Modify the schema by adding an extra column

Let’s now modify the schema of our Hive table by adding an extra column (at the end):

hive> ALTER TABLE user_profile ADD COLUMNS (country STRING);

From now, the Hive table will use new schema that contains four columns when reading its data. Because our input file doesn’t contain the country column, we should see NULLs in its place.

hive> select * from user_profile;

1	adam	1985-09-30	NULL

Load a second Avro file into the table

We have the second Avro file that contains the country field:

$ java -jar avro-tools-1.7.4.jar tojson user2.avro

{"id":2,"name":"natalia","bday":"1989-08-09","country":"Poland"}

Let’s upload the second Avro file into the table.

$ hadoop fs -put user2.avro /apps/hive/warehouse/user_profile

We can query this table – the new schema will be applied for both the first and the second file.

hive> SELECT * FROM user_profile;

1	adam	1985-09-30	NULL
2	natalia	1989-08-09	Poland

So far so good!

Modify the schema by removing an existing column

We modify the schema of Hive table again, but this time, we remove an existing column id.

hive> ALTER TABLE user_profile DROP COLUMN id;
...
FAILED: ParseException line 1:30 mismatched input 'column' 
        expecting PARTITION near 'drop' in drop partition statement

Unfortunately, the above command doesn’t work because this syntax is not supported by Hive. The DROP command can be only applied for removing PARTITIONs, not COLUMNs. Let’s use the other method that is shown in Hive DDL documentation.

hive> ALTER TABLE user_profile 
      REPLACE COLUMNS (name STRING, bday STRING, country STRING);
... 
FAILED: Execution Error, return code 1 
        from org.apache.hadoop.hive.ql.exec.DDLTask. 
        Replace columns is not supported for table default.user_profile. 
        SerDe may be incompatible.

Well, the above command doesn’t work for Avro format…

The working method is to actually drop the whole table, and create it with a new Avro schema.

hive> DROP TABLE user_profile;
hive> CREATE EXTERNAL TABLE user_profile (name STRING, bday STRING, country STRING) 
      STORED AS avro;
hive> DESCRIBE FORMATTED user_profile;

Please note that this method should be only used when you create an EXTERNAL Hive table (because the data in HDFS is not removed when dropping the table).

If we now query the table, the ID column is ignored (not visible), while the values in the three other columns are shown:

hive> SELECT * FROM user_profile;

adam	1985-09-30	NULL
natalia	1989-08-09	Poland

Load the third Avro file into the table

The third Avro file that we want to load to our table doesn’t contain the ID field:

$ java -jar avro-tools-1.7.4.jar tojson user3.avro

{"name":"tofi","bday":"2006-06-06","country":"Sweden"}
$ hadoop fs -put user3.avro /apps/hive/warehouse/user_profile

Our Hive query is processing data correctly and printing expected output:

hive> SELECT * FROM user_profile;

adam	1985-09-30	NULL
natalia	1989-08-09	Poland
tofi	2006-06-06	Sweden

Modify the schema by renaming an existing column

According to my knowledge, there is no easy way to rename column in Hive table. The way that I know is to re-create the whole table again and specify Avro schema as a part of its definition.

hive> DROP TABLE user_profile;

hive> CREATE EXTERNAL TABLE user_profile
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
   'avro.schema.literal'='{
    "type" : "record",
    "name" : "user",
    "namespace" : "com.getindata.avro",
    "fields" : [ {"name" : "name", "type" : "string"}, 
                 {"name" : "birthday", "type" : "string", "aliases" : ["bday"]}, 
                 {"name" : "country", "type" : "string", "default" : "NULL"}
               ]
}');

hive> DESCRIBE user_profile;

name                	string
birthday            	string
country             	string

Note that we specify that birthday is the new name of the column that previously was named bday – this is specified using the aliases property.

We are able to read existing three Avro files correctly:

hive> SELECT * FROM user_profile;

adam	1985-09-30	NULL
natalia	1989-08-09	Poland
tofi	2006-06-06	Sweden

We have the fourth file that contains a field named birthday (not bday).

java -jar avro-tools-1.7.4.jar tojson user4.avro

{"name":"fox","birthday":"2013-03-22","country":"Poland"}

Let’s upload it into the table’s directory:

hadoop fs -put user4.avro /apps/hive/warehouse/user_profile

As we can see, Hive is handling it well and printing both birthday and bday fields in the same column!

hive> SELECT * FROM user_profile;

adam	1985-09-30	NULL
natalia	1989-08-09	Poland
tofi	2006-06-06	Sweden
fox	2013-03-22	Poland

As you see the support for schema evolution with Hive and Avro is very good.

Play More

Feel free to experiment with schema changes for Hive table backed by Parquet.

Post by Adam Kawa

Adam became a fan of Big Data after implementing his first Hadoop job in 2010. Since then he has been working with Hadoop at Spotify (where he had proudly operated one of the largest and fastest-growing Hadoop clusters in Europe for two years), Truecaller, Authorized Cloudera Training Partner and finally now at GetInData. He works with technologies like Hadoop, Hive, Spark, Flink, Kafka, HBase and more. He has helped a number of companies ranging from fast-growing startups to global corporations. Adam regularly blogs about Big Data and he also is a frequent speaker at major Big Data conferences and meetups. He is the co-founder of Stockholm HUG and the co-organizer of Warsaw HUG.

One Response to Schema Evolution With Avro and Hive

  1. Austin

    Adam this is the most straight-forward article of hive on avro I’ve seen. So thank you! My question/hang-up is that manually updating the hive structure is NOT “schema evolution” – SE as I understand it is that, theoretically, your Hive table can point to several Avro files and its structure will be such that it accomodates all the underlying data. From an operational stand-point if my Avro data is “evolving” over time and I don’t have any forewarning then running DDL on my Hive table is not feasible. What are your thoughts on this? Again, thanks so much for the straightforward article.

Leave a Reply

Your email address will not be published. Required fields are marked *

Blue Captcha Image
Refresh

*