Apache Spark

1. Overview

1. Overview

1.1. Definition

Apache Spark is an open-source unified analytics engine for large-scale data processing, known for its speed, ease of use, and sophisticated analytics.

1.2. Core Features

1.2.1. Speed

Spark processes data in memory, reducing the time consumed for disk IO operations, which enhances the speed of data processing significantly.

1.2.2. Ease of Use

Spark provides simple and expressive APIs in Python, Java, Scala, and R, which makes it accessible to a wide range of developers and data scientists.

1.2.3. Advanced Analytics

Supports SQL queries, streaming data, machine learning, and graph processing.

1.3. Components

1.3.1. Spark Core

The engine that handles memory management and data scheduling. It also provides the basic functionalities like task dispatching and inputoutput operations.

1.3.2. Spark SQL

Enables querying data via SQL as well as working with DataFrames and Datasets, which are distributed collections of data organized into named columns.

1.3.3. Spark Streaming

Allows for real-time data stream processing.

1.3.4. MLlib (Machine Learning Library)

A scalable machine-learning library that leverages Spark’s parallel processing capabilities.

1.3.5. GraphX

For graph processing and graph-parallel computation.

1.4. Deployment Modes:

Standalone: Runs as a separate cluster on your machine.
YARN: Deploys within a Hadoop cluster using YARN (Yet Another Resource Negotiator).
Mesos: Runs on Apache Mesos, a cluster manager that can also manage other distributed frameworks.
Kubernetes: Deployment on a Kubernetes-managed cluster.

1.5. Use Cases:

Real-time data analysis
Batch processing
Machine learning model training and evaluation
Interactive data exploration

1.6. Connections:

Spark is often integrated with Hadoop’s HDFS for storage, utilizing Hadoop clusters to scale out data processing.
It competes with tools like Apache Hadoop MapReduce but offers significantly faster processing due to its in-memory capabilities.
Apache Kafka is frequently used alongside Spark Streaming for real-time data ingestion.