Tuesday, February 18, 2020

Hive and Spark

In this blog post, we take a slight deviation from core issues related to the spring framework and look at an issue that spring programmers might face regularly. Recently I was looking with Spark and found a need to read the data from MySQL, do some processing and write it back to a hive instance. We will look at this issue in this post.
We start by making sure our hive instance is backed by a database. To do this, we do the following.
$ cp $HIVE_HOME/conf/hive-default.xml.template $HIVE_HOME/conf/hive-site.xml
$ vi gnu/apache-hive-3.1.2-bin/conf/hive-site.xml

We edit the hive-site.xml file and make sure it is configured as below.
We can configure the values to suit our needs. But make sure the MySQL username, password, database URL exists and has relevant permissions.
Now we create another database in MySQL which will contain the data that we need to process. I am calling this database mystuff, with username mystuff, password mystuff123. We need to run following commands in MySQL to make sure everything exists and permissions are appropriate.
create database mystuff;
Query OK, 1 row affected (0.00 sec)

mysql> create user mystuff@localhost identified by 'mystuff123';
Query OK, 0 rows affected (0.00 sec)

mysql> create user mystuff@'%' identified by 'mystuff123';
Query OK, 0 rows affected (0.01 sec)

mysql> grant all on mystuff.* to mystuff@localhost;
Query OK, 0 rows affected (0.00 sec)

mysql> grant all on mystuff.* to mystuff@'%';
Query OK, 0 rows affected (0.00 sec)

Now we create a plain java project in IntelliJ with the following pom.xml file.
Now to look at the problem at hand. We have a table in MySQL with the following structure.
mysql> desc mydata;
+-------+--------------+------+-----+---------+-------+
| Field | Type         | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| id    | int          | YES  |     | NULL    |       |
| k     | varchar(10)  | YES  |     | NULL    |       |
| v     | varchar(255) | YES  |     | NULL    |       |
+-------+--------------+------+-----+---------+-------+
3 rows in set (0.00 sec)

We want to flatten this table such that each key gets converted to a column for each id. So assume our current data is as below.
mysql> select * from mydata;
+------+-------+-------------------+
| id   | k     | v                 |
+------+-------+-------------------+
|    1 | NAME  | John Doe.         |
|    1 | EMAIL | jd@example.com    |
|    2 | NAME  | Jane Doe          |
|    2 | EMAIL | janed@example.com |
+------+-------+-------------------+
2 rows in set (0.00 sec)
We want to load this data and convert it into a flattened table that has three columns i.e. id, name. email. Then we want to populate this into table person into hive with flattened data.
hive (default)> select * from person;
OK
person.id person.email person.name
1 jd@example.com.  John Doe
2 janed@example.com Jane Doe
Time taken: 0.094 seconds, Fetched: 2 row(s)

The following code will perform the above conversion.
Line numbers 11 through 17 create a SparkSession for hive operations. The key instruction here is enableHiveSupport. Lines 20 through 26 create a SparkSession that will be used for MySQL operations. Lines 29 through 37 load the complete contents of the table. Lines 39 through 42 will group the results by the id and pivot the table on the field K. Line 46 through 49 creates a data frame for hiveSession and writes the contents in a table with named person.

No comments:

Post a Comment