Failure Detection

  • identifying when a component (e.g., server, network link) in a distributed system stops functioning correctly.
  • essential for triggering recovery mechanisms like failover, replication, or reconfiguration.

1. Challenges

  • Network delays and partitions can make it difficult to distinguish between slow responses and actual failures.
  • False positives (mistakenly declaring a node as failed) and false negatives (failing to detect a real failure) can have serious consequences.

2. Types of Failures

  • Crash Failures: A node stops responding completely.
  • Omission Failures: A node fails to send or receive messages.
  • Byzantine Failures: A node behaves arbitrarily or maliciously.

3. Failure Detection Mechanisms

  • Heart-beating: Nodes periodically send "I'm alive" messages to a central monitor or each other. Absence of heartbeats indicates a potential failure.
  • Ping/Echo: A node sends a ping message, and the other node replies with an echo. Lack of response suggests a failure.
  • Gossip Protocols: Nodes exchange information about the health of other nodes, spreading failure information quickly.
  • Timeout-Based Detection: If a node doesn't respond within a certain time, it's assumed to have failed.
  • Anti-Entropy Protocols

4. Failure Detector Properties

  • Completeness: Every actual failure is eventually detected.
  • Accuracy: No non-faulty node is incorrectly suspected of failing.
  • Speed: Failures are detected quickly.

5. Practical Considerations

  • Trade-offs must be made between accuracy and speed.
  • Timeout values and other parameters need to be adjusted based on network conditions and application requirements.
  • Failure detection is often probabilistic - It provides a likelihood of failure rather than absolute certainty.
Tags::cs: