Hadoop
Below 3 components help us to process large volume of data.
Distributed DataSet - e.g. HDFS - Hadoop Distributed dataSet to store data
Computing framework - e.g. Map Reduce/Spark
Cluster Resource Manager - e.g. Yarn
Hadoop - definition - Big data is high-volume, high-velocity and/or high- variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.
In vertical scalable system we can increase the number of disk in a single high end server, but Hdfs is horizontal scalable where multiple system are connected a cluster.
Hadoop client (edge nodes) -> In large hadoop cluster, we have dedicated few nodes as edge node.There won't have any hadoop services on these edge nodes, but these are used to connect hadoop cluster for day to day activity. There prevent any unnecessary issue/security reason.
hdfs <command> <command options>
HDFS Commands Guide : http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html
http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/FileSystemShell.html
HDFS directory's size
Find text in specified folder's files
Hadoop is more than HDFS. Hadoop supports many file system other than hdfs. So Hdfs can be replaced with other file system. Amazon s3, Azure blob storage, Azure data lake storage, linux local file system etc..that means hadoop can access external file systems from hadoop cluster.
So hadoop command are valid for all other supported file systems along with hdfs file system, where as hdfs command only work with hdfs file system.
Find Sample data from hdfs file.
Last updated