HDFS file-system

At the first part, I fully describe hdfs file-system. Then I go into some configuration concept based on Hadoop cluster we have setup in previous part. And at the end I will go through tunning the hdfs file-system.

Concepts

HDFS is a Java-based file system that provides scalable and reliable (fault-tolerant features) data storage. It can support close to a billion files and blocks with up to 200 PB of storage. Maybe the main advantage of it is the ability of working with a variety of data access applications which is coordinated by YARN as I will talk about it in separate thread.

Design:

Similar to many file-systems like Lustre and BeeGFS, HDFS stores the Metadata and real data separately. The server that stores the Metadata is called NameNode and the real Data is stored on Slave nodes which is called DataNode.

All nodes are communicating with each other using TCP protocol. Therefore if you are using Infiniband as a high-speed interconnect, make sure that IpoIB is working properly.

Unlike many existent file-system such as Lustre that need a RAID configuration for nodes (DataNodes) in order to protect the data lost, here we have a built-in HDFS replicated mechanism that replicate data into several DataNodes. Beside providing data protection, this strategy also increase data transfer rate (bandwidth) and provides the data locality concepts (computation being near data).

Important: HDFS in nature is append only (write-once and read-many times). It means that in order to modify any part of a file that is already written to namespace, the whole file need to be rewrite and being replaced by old file. In a very normal cases, it doesn’t matter since big data applications are usually build based on the fact that data are not changed or modified. Hbase which is a open-source version of google Bigtable solved this problem as I discuss it in separate thread later. Only keep it in mind for now that Hbase is usually working at top of HDFS.

Normally we have only one namespace which can be managed by a single namenode. In order to increase availability of the namenode we can implement High-Availability namenode as well which is working in active-passive mode (no need for secondary node here). In order to have the backup of metadata in separate node plus decreasing cpu intensive job of doing it from namenode, we can also have secondary namenode.

The namenode stores the entire file system metadata in memory. This limits the number of blocks, files, and directories supported on the file system to what can be accommodated in the memory of a single namenode. Therefore I suggest strongly to have as high as possible the size of RAM in namenode. But sometimes for a very very large clusters like yahoo and facebook, the concept of multiple namespace is useful which has been introduced in new hadoop version as HDFS fedaration which I avoid to talk about it.

NameNode

The Namenode is taking care of the whole namespace (hierarchy of files and directories) by mapping of blocks into DataNodes. We need to keep it in mind that files are divided into blocks and blocks are replicated into different Datanodes. By default the blocks are replicated to 3 datanodes (replicating factor of 3), however we can increase replication factor in order to have higher bandwidth.

Complete namespace is represented on the Namenode by inodes. Each inode stores the attributes (permission, modification, access times, disk space quotas) and the Block location(s) of the file-system on Datanode. These inodes which basically defines the NameNode Metadata are called Image (fsimage-* in our case located at /hadoop/hdfs/namenode/current).

The entire Metadata (image or simply inodes) are kept in RAM and all requests are served from in-memory snapshot of the metadata. Obviously we need a mechanism to store the image by writing it to the local file-system in case the Namenode crashes. And this can be done as a part of the design and the file which is written is called CheckPoint (fsimage-* ). There is a option that beside checkpointing in NameNode (which be done by default) also do it in separate node. The separate node is called Secondary NameNode (Checkpoint Node).

Description: Checkpointing is the process of merging the content of the most recent fsimage with all edits applied after that fsimage is merged in order to create a new fsimage, and as a result the old fsimages are deleted unless specified in configuration files.

Several important concepts need to fully being understood regarding the Checkpoint concept:

a. The checkpoint (fsimage-*) never changed automatically by NameNode during the time Namenode is running. It implies the fact that the changes to the namespace Must be stored somewhere other than checkpoint. Namenode does this by recording the changes in the file called Journal inside its local file-system (edits-* in our case located at /hadoop/hdfs/namenode/current).

b. The journal file size is keep increasing in NameNode and if it grows very large the possibility of being corrupted is inevitable. So we need periodically merge the checkpoint (fsimage-*) and Journal (edits-*), create new checkpoint files and empty journal files by creating completely new journal files.

c. There are 3 ways for initiating the merging phase as described in previous part. Admin initiate it by configuring the Namenode properly. Simply restart the Namenode as in restart it will merge. Or using Checkpoint node (secondary Namenode).

d. The checkpoint node can periodically download checkpoint and journal files from NameNode (in our case /hadoop/hdfs/namenode/current) and merge them locally and write it to its local file-system (in our case is /hadoop/hdfs/namesecondary/current) and upload back the new Checkpoint to the Namenode beside writing new journal file(s) which are empty initially.

e. How often the secondary namenode (checkpoint node) initiate to call the Namenode is based on configuration parameters. The start of the checkpoint process on the Checkpoint node is controlled by a configuration parameters, HDFS Maximum Checkpoint Delay. It specifies the maximum delay between two consecutive checkpoints.

Summary:

NameNode keeps the entire namespace image in RAM. The persistent record of the image stored in the NameNode’s local native filesystem is called a checkpoint. The NameNode records changes to HDFS in a write-ahead log called the journal in its local native filesystem. The location of block replicas are not part of the persistent checkpoint.

Each client-initiated transaction is recorded in the journal, and the journal file is flushed and synced before the acknowledgment is sent to the client. The checkpoint file is never changed by the NameNode; a new file is written when a checkpoint is created during restart, when requested by the administrator, or by the CheckpointNode described in the next section. During startup the NameNode initializes the namespace image from the checkpoint, and then replays changes from the journal. A new checkpoint and an empty journal are written back to the storage directories before the NameNode starts serving clients.

High Availability Node

It is a node which eliminate the problem of Single Point of failuer which is a new concept in new hadoop versions. It provides automatic fail over in case the NameNode crashes.

It is completely different concept from Secondary namenode (checkpoint node). Both HA & Secondary namenode roles cannot be used together. As I said previously, checkpoint node does not provide at all the fail-over capability but does a Intensive CPU tasks for namenode by merging fsimage-* (inodes which also called checkpoint) and edit-* (journals) together and sending back to it which offcourse indirectly also providing more security for us.

DataNode (slave node)

It is responsible for storing the actual data in HDFS. NameNode and datanode are in constant communication by sending heart bits. There is a assumption that datanode in some points will fail, as a result namenode need to re-replicate the blocks that were in that failed node to other nodes in the cluster. This is the reason that we have the concept of replication of blocks in hdfs which by defaults is three.

Installation and Configuration

1. Directories: during HDFS installation with amabri, some directories have been created. However we could change the directories based on our needs, but I left as default.

a. Namenode directories

In my case, the /hadoop/hdfs/namenode on local native file-system of Master node is devoted to Namenode which will stores the file-system image.

b. DataNode directories

In each Slave nodes we have following directories which store the real data.

/grid/0/hadoop/hdfs/data/
/grid/1/hadoop/hdfs/data/
/grid/2/hadoop/hdfs/data/
/grid/3/hadoop/hdfs/data/

I already created /grid/0-3/ partitions with ext3 in each slave nodes during kickstart installation as can be seen here:

part /grid/0 –size 900000 –fstype=ext3 –ondisk=sda
part /grid/1 –size 900000 –fstype=ext3 –ondisk=sda
part /grid/2 –size 900000 –fstype=ext3 –ondisk=sda
part /grid/3 –size 900000 –fstype=ext3 –ondisk=sda

and during installation with ambari we can refer to them. So at the end we can see the result in one of the slave node as an example:

[root@node-01 ~]# parted -l
Model: ATA ST4000NC001-1FS1 (scsi)
Disk /dev/sda: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags: pmbr_boot
Number Start End Size File system Name Flags
1 1049kB 2097kB 1049kB bios_grub
2 2097kB 526MB 524MB ext4
3 526MB 944GB 944GB ext3
4 944GB 1888GB 944GB ext3
5 1888GB 2832GB 944GB ext3
6 2832GB 3775GB 944GB ext3
7 3775GB 3901GB 126GB linux-swap(v1)
8 3901GB 4001GB 99.6GB ext4

[root@node-01 ~]# df -hT
…
/dev/sda5 ext3 865G 176M 821G 1% /grid/2
/dev/sda6 ext3 865G 259M 821G 1% /grid/3
/dev/sda4 ext3 865G 297M 821G 1% /grid/1
/dev/sda3 ext3 865G 243M 821G 1% /grid/0
..

c. Secondary Namenode (Checkpoint Node) directories

It will create /hadoop/hdfs/namesecondary which store the checkpoint image.

Namespace directories

We should not get confused between local linux file-system and hdfs file-system which have a separate namespace. So simply we have two root directory, one for our linux file-system and one for hdfs. We cannot use simply use linux commands for working with hdfs namespace and need special commands as you can see later.

In order to see the complete HDFS namespace we can use following command:

[root@hadoop-master ~]# hdfs dfs -ls /
Found 9 items
drwxrwxrwx – yarn hadoop 0 2016-08-19 15:23 /app-logs
drwxr-xr-x – hdfs hdfs 0 2016-08-19 15:30 /apps
drwxr-xr-x – yarn hadoop 0 2016-08-19 15:19 /ats
drwxr-xr-x – hdfs hdfs 0 2016-08-19 15:03 /hdp
drwxr-xr-x – mapred hdfs 0 2016-08-19 15:23 /mapred
drwxrwxrwx – mapred hadoop 0 2016-08-19 15:23 /mr-history
drwxrwxrwx – spark hadoop 0 2016-08-23 16:23 /spark-history
drwxrwxrwx – hdfs hdfs 0 2016-08-19 15:23 /tmp
drwxr-xr-x – hdfs hdfs 0 2016-08-19 15:30 /user

As can be seen, we have several directories inside the root namespace of the HDFS file-system. Please understand that the complete namespace (all files and directories) as can be seen here are distributed in all Slave nodes and it does not have anything to do with local native Linux file-system.

As can be seen there is a directory called /user that can be used for hosting the users home directory. So you can simply compare it with /home directory in native local file-system. I am going to make a directory for myself called hossein here:

[root@hadoop-master ~]# hdfs dfs -mkdir /user/hossein
mkdir: Permission denied: user=root, access=WRITE, inode=”/user/hossein”:hdfs:hdfs:drwxr-xr-x

As can be seen we have a permission denied here. The reason is that only hdfs user and the members of the hdfs group are allowed to write to this directory. There is two option here to make it works, whether we switch to hdfs user simply by ‘su – hdfs’ and then create the directory or we add our root account to hdfs group. I choose the second option by adding root to hdfs group.

[root@hadoop-master ~]# grep ‘hdfs:x’ /etc/group
hdfs:x:1004:hdfs
[root@hadoop-master ~]# usermod -a -G hdfs root
[root@hadoop-master ~]# grep ‘hdfs:x’ /etc/group
hdfs:x:1004:hdfs,root

So now we try again and all should be OK now:

[root@hadoop-master ~]# hdfs dfs -mkdir /user/hossein
[root@hadoop-master ~]# hdfs dfs -ls /
Found 9 items
drwxrwxrwx – yarn hadoop 0 2016-08-19 15:23 /app-logs
drwxr-xr-x – hdfs hdfs 0 2016-08-19 15:30 /apps
drwxr-xr-x – yarn hadoop 0 2016-08-19 15:19 /ats
drwxr-xr-x – hdfs hdfs 0 2016-08-19 15:03 /hdp
drwxr-xr-x – mapred hdfs 0 2016-08-19 15:23 /mapred
drwxrwxrwx – mapred hadoop 0 2016-08-19 15:23 /mr-history
drwxrwxrwx – spark hadoop 0 2016-08-23 17:50 /spark-history
drwxrwxrwx – hdfs hdfs 0 2016-08-19 15:23 /tmp
drwxr-xr-x – hdfs hdfs 0 2016-08-23 17:50 /user

Explanation: As we can see /user directory do not have write permissions for hdfs group members. Please keep it in mind that we have hdfs user as well as hdfs group, same name. If we need to give the group members of hdfs also write permissions, we can use following commands: (I haven’t done it)

hdfs dfs -chmod -R 775 /user

[root@hadoop-master ~]# hdfs dfs -ls /user
Found 6 items
drwxrwx— – ambari-qa hdfs 0 2016-08-19 15:00 /user/ambari-qa
drwxr-xr-x – hcat hdfs 0 2016-08-19 15:30 /user/hcat
drwx—— – hdfs hdfs 0 2016-08-23 17:19 /user/hdfs
drwxr-xr-x – hive hdfs 0 2016-08-19 15:30 /user/hive
drwxr-xr-x – root hdfs 0 2016-08-23 17:50 /user/hossein
drwxrwxr-x – spark hdfs 0 2016-08-19 15:03 /user/spark

and then we set the permissions of /user/hossein to hossein user.

[root@hadoop-master ~]# hdfs dfs -chown hossein:hossein /user/hossein
[root@hadoop-master ~]# hdfs dfs -ls /user
Found 6 items
drwxrwx— – ambari-qa hdfs 0 2016-08-19 15:00 /user/ambari-qa
drwxr-xr-x – hcat hdfs 0 2016-08-19 15:30 /user/hcat
drwx—— – hdfs hdfs 0 2016-08-23 17:19 /user/hdfs
drwxr-xr-x – hive hdfs 0 2016-08-19 15:30 /user/hive
drwxr-xr-x – hossein hossein 0 2016-08-23 17:50 /user/hossein
drwxrwxr-x – spark hdfs 0 2016-08-19 15:03 /user/spark

We can test it by creating a local file and then move it to hdfs namespace.

[root@hadoop-master ~]# hdfs dfs -put testfile.txt /user/hossein
[root@hadoop-master ~]# hdfs dfs -ls /user/hossein
Found 1 items
-rw-r–r– 3 root hossein 46 2016-08-23 18:20 /user/hossein/testfile.txt

Advanced

1. In order to get the full information regarding hdfs status in all slave nodes, we can use following command which give complete status of the hdfs namespace plus the status of each node individually.

[root@hadoop-master ~]# hdfs dfsadmin -report
Configured Capacity: 7421610426368 (6.75 TB)
Present Capacity: 7042846941516 (6.41 TB)
DFS Remaining: 7041411293516 (6.40 TB)
DFS Used: 1435648000 (1.34 GB)
DFS Used%: 0.02%
Under replicated blocks: 402
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
————————————————-
Live datanodes (2):
Name: 10.0.0.1:50010 (node-01)
Hostname: node-01.hadoop.cluster
Decommission Status : Normal
Configured Capacity: 3710805213184 (3.37 TB)
DFS Used: 717824000 (684.57 MB)
Non DFS Used: 189314633562 (176.31 GB)
DFS Remaining: 3520772755622 (3.20 TB)
DFS Used%: 0.02%
DFS Remaining%: 94.88%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 7

2. HDFS daemons and processes

Let’s understand what services are running for hdfs in different nodes and how we can stop or start them. There are mainly 3 daemons which are Namenode, Secondary Namenode (if we have) and Datanode.

Name Node

There will be a single process which run on master node(s). If we have HA or Federated namenode in place, we need special attention.
It is responsible for manage metadata about files distributed across the cluster
It manages information like location of file blocks across cluster and it’s permission
This process reads all the metadata from a file namedfsimage and keeps it in memory
After this process is started, it updates metadata for newly added or removed files in RAM
It periodically writes the changes in one file called edits as edit logs
This process is a heart of HDFS, if it is down HDFS is not accessible any more

The start and stop of processes in hadoop cluster are very dependent in other processes. In our case in order to start hdfs daemons in general we need to make sure zookeeper services are up. I will talk about zookeeper in separate thread. So in summary following services have to start in row:

Zookeeper
HDFS
YARN
HBase

And interestingly for stopping the services we need to reverse the steps. So Hbase, Yarn, hdfs and then zookeeper. In terms of hdfs we should do following steps in row if we have the corresponding services for starting them:

If you are running NameNode HA (High Availability), start the JournalNodes

[root@hadoop-master ~]# su – hdfs

[root@hadoop-master ~]# /usr/hdp/2.4.2.0-258/hadoop/sbin/hadoop-daemon.sh start journalnode

Then we start the namenode service

[root@hadoop-master ~]# /usr/hdp/2.4.2.0-258/hadoop/sbin/hadoop-daemon.sh start namenode

if we have namenode HA, we also need to start the zookeeper fail-over controller (zkfc) in all namenode machines.

[root@hadoop-master ~]# /usr/hdp/2.4.2.0-258/hadoop/sbin/hadoop-daemon.sh start zkfc

Secondary Name Node

For this also, only single instance of this process runs on a cluster
This process can run on a master node (for smaller clusters) or can run on a separate node (in larger clusters) depends on the size of the cluster
One misinterpretation from name is “This is a backup Name Node” but IT IS NOT!!!!!
It manages the metadata for the Name Node. In the sense, it reads the information written in edit logs (by Name Node) and creates an updated file of current cluster metadata
Than it transfers that file back to Name Node so that fsimage file can be updated
So, whenever Name Node daemon is restarted it can always find updated information infsimage file

If you are not running NameNode HA, execute the following command on the Secondary NameNode host machine. If you are running NameNode HA, the Standby NameNode takes on the role of the Secondary NameNode.

[root@hadoop-master ~]# /usr/hdp/2.4.2.0-258/hadoop/sbin/hadoop-daemon.sh start secondarynamenode

Data Node

There are many instances of this process running on various slave nodes(referred as Data nodes)
It is responsible for storing the individual file blocks on the slave nodes in Hadoop cluster
Based on the replication factor, a single block is replicated in multiple slave nodes(only if replication factor is > 1) to prevent the data loss
- Whenever required, this process handles the access to a data block by communicating with Name Node
- This process periodically sends heart bits to Name Node to make Name Node aware that slave process is running

[root@hadoop-master ~]# /usr/hdp/2.4.2.0-258/hadoop/sbin/hadoop-daemon.sh start datanode

which needs to be run on all slave nodes.

For stopping completely all related hdfs services we need to do the reverse steps, so first the datanode and then secondarynamenode and etc.

Also for indivitual hdfs services we need to reverse the steps for stopping completely all hdfs related services.

Tunning HDFS file-system

coming soon …