Big Data: Part 2- Let’s differentiate Hadoop Version 1 and Version 2.

Deeksha Sharma
Towards Dev
Published in
3 min readJan 26, 2024

--

In Part 1 we understood about Hadoop version 1 and a little bit about Hadoop version 2. Now let’s understand the advantages which were provided by Hadoop version 2:-

🥁YARN replaced Job Tracker of Hadoop version 1 in Hadoop version 2.YARN has become the resource manager and it has two parts:-

→Scheduler (Tackle which job has to be scheduled on which data node.)

→Applications manager (Tackle the resources allocation for a job controls the application master.)

🥁Task Tracker of Hadoop version 1 was replaced by Node Manager. It has basically two parts:-

→Container (Have resources allocated for that particular job on the data node).

→App Master (Controls this container)

🤠Name Node Drwabacks in Hadoop V1

In Hadoop version version 1, there used to be one name node and one checkpoint node(secondary name node). The name node saves the file system snapshot (fs image) in the RAM. Every block in HDFS will be making an entry in this fs image. So, there is a high chance that the RAM of the name node will be exhausted and despite having free space on data nodes we will not be able to make more data block entries in HDFS.

So to overcome this problem, the HDFS Federation concept is introduced in Hadoop v2.

⛩️HDFS Federation

To scale the name service horizontally, the federation uses multiple independent Namenodes/namespaces. It could be understood like that one name node will be storing data block info of the Finance dept, other name node will be storing data block info of the sales dept.

Another drawback with Hadoop version 1 was that namenode hot standby was not there. Failure of name node leads to downtime of the whole cluster. In Hadoop version 2 they introduced, Active Namenode, Standby Namenode, and Secondary Namenode.

All namespace edits were logged to shared storage so that every change made on active namenode can be replicated on standby namenode through the checkpointing operation.Standby namenode will be periodically collating these edit logs.

🧩Journal Nodes in Hadoop

Journal Nodes are the dedicated servers for painting the copies of edit logs(The shared storage I was talking about in above para).Active namenode informs journal node for changes. Standby name node asks to journal node about what changed.There will be more than one journal nodes to handle the case of failure.The process of merging these edit logs with fs image is called checkpointing.

Active namenode informs journal node for changes. Stand by name node asks to journal node about what changed.

🐯Zookeeper

Zookeeper is a software service running on servers. It is responsible for maintaining the uniform time across all the machines in the Hadoop cluster. Data nodes as well as master nodes both can contact the zookeeper.

Zookeeper maintains the lock system if someone is writing on a disk to avoid race conditions.

Zookeeper also handles the failure of the name node. It makes sure that if the name node fails then the standby name node should be activated automatically through the zookeeper fail-over controller.

Namenode periodically sends heartbeat signals to the zookeeper through this failover controller. A zookeeper is a centralized watchman.

If the active name node fails then the built-in mechanism of Fencing ensures that this failed name node should not be touching the edit logs and will not be making any communication with data nodes.

So, this sums up this article. Happy Learning!

--

--