Apache Spark
Table of Contents
1. Overview
1.1. Definition
Apache Spark is an open-source unified analytics engine for large-scale data processing, known for its speed, ease of use, and sophisticated analytics.
1.2. Core Features
1.2.1. Speed
Spark processes data in memory, reducing the time consumed for disk IO operations, which enhances the speed of data processing significantly.
1.2.2. Ease of Use
Spark provides simple and expressive APIs in Python, Java, Scala, and R, which makes it accessible to a wide range of developers and data scientists.
1.2.3. Advanced Analytics
Supports SQL queries, streaming data, machine learning, and graph processing.
1.3. Components
1.3.1. Spark Core
The engine that handles memory management and data scheduling. It also provides the basic functionalities like task dispatching and inputoutput operations.
1.3.2. Spark SQL
Enables querying data via SQL as well as working with DataFrames and Datasets, which are distributed collections of data organized into named columns.
1.3.3. Spark Streaming
Allows for real-time data stream processing.
1.3.4. MLlib (Machine Learning Library)
A scalable machine-learning library that leverages Spark’s parallel processing capabilities.
1.3.5. GraphX
For graph processing and graph-parallel computation.
1.4. Deployment Modes:
- Standalone: Runs as a separate cluster on your machine.
- YARN: Deploys within a Hadoop cluster using YARN (Yet Another Resource Negotiator).
- Mesos: Runs on Apache Mesos, a cluster manager that can also manage other distributed frameworks.
- Kubernetes: Deployment on a Kubernetes-managed cluster.
1.5. Use Cases:
- Real-time data analysis
- Batch processing
- Machine learning model training and evaluation
- Interactive data exploration
1.6. Connections:
- Spark is often integrated with Hadoop’s HDFS for storage, utilizing Hadoop clusters to scale out data processing.
- It competes with tools like Apache Hadoop MapReduce but offers significantly faster processing due to its in-memory capabilities.
- Apache Kafka is frequently used alongside Spark Streaming for real-time data ingestion.