Distributed Compute

As the scale of the problem that one wishes to solve increases, several computers need to be able to coordinate to solve the problem at hand, efficiently and correctly.

These "compute"rs might not even be the conventional computer that pops up in one's mind and might rather be a group of servers, database instances or several edge devices such as in decentralized applications.

The uniting feature of all distributed systems is that the workload is distributed on the constituent "compute"rs.

Some basic advantages of distributed compute:

  • efficiency increases if workload is distributed
  • Fault tolerance in case of workload redundancy
  • execution modularity helps with debugging the specific module when an error arises rather than having to examine the whole code-base

A few trade-offs incurred due to the above:

  • latency increases due to the communication overhead
  • poor system design can affect user experience

Theoretical exploration of correctness in distributed compute is actively explored in the Consensus node.

This is an index of relevant nodes

1. Fallacies of Distributed Computing

1.1. The Network is reliable

1.2. Latency is zero

1.3. Bandwidth is infinite

1.4. The Network is secure

1.5. Topology doesn't change

1.6. There is one administrator

1.7. Transport cost is zero

1.8. The network is homogenous

2. Versioning

2.1. Lamport Timestamps

  • simple way to track the order of events in a distributed system, where multiple computers work together.
  • help understand which event happened before another, even if the computers' clocks are not perfectly synchronized.

2.1.1. Working Mechanism

  1. Initialization: Each computer starts with a timestamp of 0.
  2. Event Occurs: When an event happens on a computer, it increases its timestamp by 1.
  3. Message Sending: When a computer sends a message, it includes its current timestamp in the message.
  4. Message Receiving: When a computer receives a message, it compares its own timestamp with the timestamp in the message. It updates its own timestamp to be the maximum of the two values + 1.

2.1.2. Determining Order

  • If event A has a lower timestamp than event B, then A happened before B.
  • If event A and B have the same timestamp, then their order is unknown (concurrent events).

2.1.3. Example

Imagine two computers, A and B:

A's timestamp is 3. B's timestamp is 2.

  1. A sends a message to B with timestamp 3.
  2. B receives the message and updates its timestamp to max(2, 3) + 1 = 4.
  3. Now, we know that A's event happened before B's event (3 < 4).

2.1.4. Key Points

  • Lamport timestamps provide a partial ordering of events. They can tell you if one event happened before another, but not always if they happened concurrently.
  • They are simple to implement and require minimal overhead.
  • They are often used as a building block for more advanced algorithms like vector clocks.

2.1.5. Limitations:

  • cannot determine causality between concurrent events.
  • require a mechanism to break ties if events happen at the same time on different computers.

2.2. Vector Clocks

  • Ensuring Causal Order in Distributed Systems
  • powerful tool in distributed systems to track the order of events and detect causality relationships.
  • solve the limitations of Lamport timestamps, which cannot distinguish between concurrent events.

2.2.1. Working Mechanism

  1. Structure: A vector clock is an array of integers, one for each node in the system. Each node maintains its own vector clock.
  2. Initialization: All vector clocks start with all zeros.
  3. Event Occurs: When an event occurs at a node, it increments its own clock value in the vector. For example, if event 'e' happens at node 2, the second element in node 2's vector clock is incremented.
  4. Message Sending: When a node sends a message, it increments its own clock value and attaches its entire vector clock to the message. For example, node 1 sends a message to node 3. Node 1 increments its own clock value and sends the message along with its vector clock.
  5. Message Receiving: When a node receives a message, it updates its own vector clock by taking the element-wise maximum of its current vector clock and the received vector clock. For example, node 3 receives a message from node 1 with its vector clock. Node 3 updates its own clock by taking the maximum of each element in the two vectors.

2.2.2. Determining Causality

  • Causally Related Events: Event A happens before event B (A -> B) if and only if every element in A's vector clock is less than or equal to the corresponding element in B's vector clock, and at least one element is strictly less.
  • Concurrent Events: If neither A -> B nor B -> A holds, then events A and B are concurrent.

2.2.3. Example:

Consider a system with three nodes (N1, N2, N3).

N1: [1, 0, 0]   N2: [0, 1, 0]   N3: [0, 0, 1]  
  • These vector clocks represent the initial state where each node has experienced one event.
  • Let's say N1 sends a message to N3. N1 increments its clock and attaches its vector clock [2, 0, 0] to the message. When N3 receives the message, it updates its vector clock to [2, 0, 1] (taking the element-wise maximum).

Now, we can determine that the event at N1 happened before the updated event at N3 because [1, 0, 0] < [2, 0, 1].

2.2.4. Benefits of Vector Clocks

  • Accurate Causality Tracking: Captures the partial ordering of events in a distributed system.
  • Conflict Detection: Helps identify conflicting updates to replicated data.

Versioning: Used to manage versions of data in distributed systems.

3. Distributed Databases

3.2. etcd

3.3. Cassandra

Tags::cs: