Monday, February 24, 2020

Hadoop and Spark Locally

As a continuation of the last post, we now look at how to make it deployable in a proper Spark/Hadoop cluster. We will not go into the details of the setup of these clusters themselves but more into how do we make sure a program that we developed earlier could run as a job in a cluster.
We will continue with the setup in our local machine. I am using a Mac so the instructions are with respect to that but most instructions would be common to any other platform.
If Spark is processing data from a database and writing into a hive, pretty much what we did in the last post would work. The problem arises if some of the data being processed exists as flat files. If we want to submit our jobs to a Spark cluster, we can not use local files because the jobs are not running in the local file system. 
The best approach is to either use a hdfs cluster or deploy a single node hdfs on your machine. Here I am enumerating the steps to set up a single node hdfs cluster on a Mac OS X machine.
  • Download the Hadoop distribution for your machine here.
  • Hadoop distribution is available in the form of a .tar.gz file and you can expand it in some directory on your machine. The expansion will create a directory of the form hadoop-x.y.z assuming your Hadoop version is x.y.z. set the environment variable HADOOP_HOME to the full pathname of this Hadoop directory.
  • Add $HADOOP_HOME/bin to the PATH variable.
  • Now we need to update the configuration files for hadoop.
$ cd $HADOOP_HOME/etc/hadoop
$ vi core-site.xml

We update the file with the following properties.

$ vi hdfs-site.xml

We update the file with the following properties.
$ vi mapred-site.xml

We update the file with the following properties.
$ vi yarn-site.xml

We update the file with the following properties.
Now we start the Hadoop.
$ sbin/
Now we can access files stored into the HDFS in our spark jobs.
The next post will go more into the details of how to process files in spark.

No comments:

Post a Comment