Configuring Hadoop cluster using Ansible
Ansible is an open-source software provisioning, configuration management, and application-deployment tool enabling infrastructure as code. It runs on many Unix-like systems, and can configure both Unix-like systems as well as Microsoft Windows.
Apache Hadoop is an open source, Java-based, software framework and parallel data processing engine. It enables big data analytics processing tasks to be broken down into smaller tasks that can be performed in parallel by using an algorithm (like the MapReduce algorithm), and distributing them across a Hadoop cluster.
A Hadoop cluster is a collection of computers, known as nodes, that are networked together to perform these kinds of parallel computations on big data sets. Unlike other computer clusters, Hadoop clusters are designed specifically to store and analyze mass amounts of structured and unstructured data in a distributed computing environment.
- To create an ansible role to configure NameNode of hadoop cluster.
- To create an ansible role to configure DataNode of hadoop cluster.
- To create ansible playbook to configure Hadoop Cluster using the NameNode and DataNode ansible roles.
Lets start the practical….
Prerequisites before we start :
- Create ansible.cfg configuration file as shown below.
2. Setup your ansible inventory with group all, name node and data node.
ANSIBLE ROLE FOR NAME NODE CONFIGURATION
Step 1: Create an ansible role for namenode, to create a role in ansible use the following command.
$ ansible-galaxy init hadoop_name
Step 2: Create the following tasks to configure Hadoop name node.
- Create a name node directory “/nn”
- Then copy the configured hdfs-site.xml and core-site.xml file into hadoop folder of name node using template module.
- Using command module format name node directory.
- Check for the running java processes and then stat the namenode if it is not started.
Place the configured hdfs-site.xml and core-site.xml files in template folder then only the copy module can copy the files to managed system.
ANSIBLE ROLE FOR DATA NODE CONFIGURATION
Step 1: Create an ansible role for data node, to create a role in ansible use the following command.
$ ansible-galaxy init hadoop_data
Step 2: Create the following tasks to configure Hadoop data node.
- Create a name node directory “/dn”
- Then copy the configured hdfs-site.xml and core-site.xml file into hadoop folder of data node using template module.
- Check for the running java processes and then stat the datanode if it is not started.
Step 3: Create the variable for namenode ip address in vares/main.yml file.
Place the configured hdfs-site.xml and core-site.xml files in template folder then only the template module can copy the files to managed system.
Add a jinja variable in core-site.xml file for the namenode IP address to make our play book dynamic it will automatically take the IP address of namenode group from the ansible inventory.
CREATE ANSIBLE PLAY BOOK TO INSTALL HADOOP IN ALL THE NODES AND CONFIGURE HADOOP NAME NODE AND DATA NODE.
Step 1: Create a playbook with the following tasks for all the nodes.
- Install JDK software using yum module
- Check for hadoop software and install it using command module because we need to force install which we don’t have that option in yum module.
- After installation clean the RAM so that we get the required free space so that name node can start.
Step 2: For name nodes group hosts run hadoop_name role that will configure the name node for us and for data node group hosts use hadoop_data role and pass the IP address of namenode using group variables for configuring data node.
Step 3: Run the ansible play book using ansible-playbook command.
NAME NODE CONFIGURATION
DATA NODES CONFIGURATION
PRINT HADOOP REPORT
Thankyou for reading.