Apache Hadoop
Table of Contents
1. Distributed File System
Hadoop is a widely-used framework for dealing with large data sets distributed across clusters of computers. Here's a concise breakdown:
- Hadoop Distributed File System (HDFS):
- Purpose: Designed to store very large files across multiple machines within a cluster.
- Architecture: Master-slave, consisting of a NameNode (master) and multiple DataNodes (slaves).
- Characteristics:
- Fault Tolerance: Automatically replicates data across multiple nodes.
- High Throughput: Designed to write once and read many times, optimizing access times.
- Scalability: Easily scalable by adding more nodes to the cluster.
- Components:
- NameNode: Manages metadata and namespace; it's a single point of failure in early versions.
- DataNode: Stores actual data and performs operations as instructed by NameNode.
- Data Model: Stores data in large blocks (default 128MB or larger), optimizing for large file storage operations.
1.0.1. Connections and Context:
- Hadoop Ecosystem:
- Comparison with Other Systems:
- Vs. Traditional Relational Databases: More suited to unstructured data and large-scale batch operations.
- Vs. Other File Systems: Specifically designed for distributed, parallel processing, unlike traditional distributed filesystems.
2. Evolution
- See Apache Spark