Apache Hadoop

1. Distributed File System

Hadoop is a widely-used framework for dealing with large data sets distributed across clusters of computers. Here's a concise breakdown:

  • Hadoop Distributed File System (HDFS):
    • Purpose: Designed to store very large files across multiple machines within a cluster.
    • Architecture: Master-slave, consisting of a NameNode (master) and multiple DataNodes (slaves).
    • Characteristics:
      • Fault Tolerance: Automatically replicates data across multiple nodes.
      • High Throughput: Designed to write once and read many times, optimizing access times.
      • Scalability: Easily scalable by adding more nodes to the cluster.
    • Components:
      • NameNode: Manages metadata and namespace; it's a single point of failure in early versions.
      • DataNode: Stores actual data and performs operations as instructed by NameNode.
    • Data Model: Stores data in large blocks (default 128MB or larger), optimizing for large file storage operations.

1.0.1. Connections and Context:

  • Hadoop Ecosystem:
    • MapReduce: Programming model for large-scale data processing in parallel.
    • YARN (Yet Another Resource Negotiator): Manages resources and job scheduling for Hadoop clusters.
    • Other Components: Often includes elements like Hive, Pig, and HBase for data querying and management.
  • Comparison with Other Systems:
    • Vs. Traditional Relational Databases: More suited to unstructured data and large-scale batch operations.
    • Vs. Other File Systems: Specifically designed for distributed, parallel processing, unlike traditional distributed filesystems.

2. Evolution

Tags::tool:data: