Consistent Hashing

1. Overview

Definition:
- Consistent hashing is a hashing scheme that minimizes the reorganization of data when nodes are added or removed from a distributed system.
Traditional vs. Consistent Hashing:
- Traditional hashing:
  - Data is uniformly distributed among a fixed number of nodes.
  - High re-distribution of data when nodes change.
- Consistent hashing:
  - Uses a circular space (hash ring).
  - Each node is assigned to a position on the ring.
  - Each data point is hashed to find its position on the ring.
Key Characteristics:
- Scalability: Adding or removing nodes only affects a fraction of the data.
- Load Balancing: Ensures that data is evenly distributed, though some variations may arise.
Challenges:
- Data Skew: Different nodes may store significantly varying amounts of data.
- Hotspots: Certain keys may be accessed more frequently than others, causing load issues.
Enhancements:
- Virtual Nodes: Each physical node can be represented by multiple positions on the hash ring to improve balance.
- Replica Management: Distributing replicas of data across multiple nodes for fault tolerance.

The Hash Ring: Imagine a circular ring (the "hash ring"). Both data items and nodes are assigned positions on this ring using a hash function.
Data Placement: Each data item is mapped to the first node encountered clockwise on the ring from its hash position. This node is responsible for storing and serving that data item.
Node Addition: When a new node is added, its position on the ring is determined by hashing its identifier. Only the data items that were previously mapped to the nearest clockwise node now get remapped to the new node.
Node Removal: When a node is removed, the data items it was responsible for are remapped to the next available node in the clockwise direction.

Minimizes Remapping: When nodes are added or removed, only a small fraction of the data needs to be remapped, reducing disruption to the system.
Balanced Load Distribution: Data is distributed evenly across nodes, preventing any single node from becoming a bottleneck.
Scalability: Easily scales horizontally by adding more nodes to the ring.

Distributed Caching: Used in systems like Memcached and Redis to distribute cache data across multiple servers.
Load Balancing: Used to distribute requests across multiple web servers or application servers.
Distributed Hash Tables (DHTs): Used in peer-to-peer systems to locate data stored across a network of nodes.

Hash Function Choice: Choosing a good hash function is crucial for ensuring even data distribution.
Virtual Nodes: To further improve load balancing, each physical node can be represented by multiple "virtual nodes" on the hash ring.
- allows for heterogeneity by alloting virtual nodes proportional to server sizes
Data Replication: Consistent hashing can be combined with replication strategies to ensure data availability even if nodes fail.

Data Organization:
- Each data item is hashed to determine its position on the consistent hashing ring.
- Nodes (servers) are also assigned positions on the same ring.
- Data is stored on the node that is closest in the clockwise direction from the hashed position of the data.
Replication Strategy:
- Each data item can be replicated to multiple nodes to provide redundancy.
- With consistent hashing, replicas can be stored on the next 'k' nodes in the clockwise direction from the original node.
- This ensures that even if one node fails, the data can still be accessed from replicas.
- see Quorum for synchronization across replicas

Handling Node Changes:
- When a new node is added, only a fraction of the data needs to be rehashed and migrated; this is achieved by identifying the keys that fall within the new node's range on the ring.
- When a node is removed, its data is reassigned to the next nodes in the clockwise direction, ensuring minimal disruption.