Hadoop
HDFS
Architecture
1 file is divided by blocks;
128 MB per block;
NameNode (master): has multiple children (DataNodes);
DataNode (slave/worker): has only one parent (NameNode) and contains file blocks;
1 NameNode or DataNode = 1 computer;
When dividing a file, the NameNode knows where each file block is;
The file division and recomposition is done outside the architecture (the NameNode received file already divided in blocks, and when a use ask for a file, it doesn't recompose the file for him, it just gives the file blocks);
Usual file replication is 3: for 4 file blocks, there will be 12 file blocks in total. One replication of the same file block per DataNode if possible.
In the majority of cases, there are 2 NameNodes (1 primary, 1 secondary for backing up in case the primary fails); DataNodes are linked to the primary NameNode, and there is a symbolic link between the two NameNodes. The secondary NameNode mirrors the primary one: the primary one shares all informations about what it's doing;
It's ok to have empty DataNodes in case of smaller files.
Map Reduce
Map :
Shuffle n Sort : rassembler les éléments de la même clé
Last updated