Failure Detection
- identifying when a component (e.g., server, network link) in a distributed system stops functioning correctly.
- essential for triggering recovery mechanisms like failover, replication, or reconfiguration.
1. Challenges
- Network delays and partitions can make it difficult to distinguish between slow responses and actual failures.
- False positives (mistakenly declaring a node as failed) and false negatives (failing to detect a real failure) can have serious consequences.
2. Types of Failures
- Crash Failures: A node stops responding completely.
- Omission Failures: A node fails to send or receive messages.
- Byzantine Failures: A node behaves arbitrarily or maliciously.
3. Failure Detection Mechanisms
- Heart-beating: Nodes periodically send "I'm alive" messages to a central monitor or each other. Absence of heartbeats indicates a potential failure.
- Ping/Echo: A node sends a ping message, and the other node replies with an echo. Lack of response suggests a failure.
- Gossip Protocols: Nodes exchange information about the health of other nodes, spreading failure information quickly.
- Timeout-Based Detection: If a node doesn't respond within a certain time, it's assumed to have failed.
- Anti-Entropy Protocols
4. Failure Detector Properties
- Completeness: Every actual failure is eventually detected.
- Accuracy: No non-faulty node is incorrectly suspected of failing.
- Speed: Failures are detected quickly.
5. Practical Considerations
- Trade-offs must be made between accuracy and speed.
- Timeout values and other parameters need to be adjusted based on network conditions and application requirements.
- Failure detection is often probabilistic - It provides a likelihood of failure rather than absolute certainty.
Tags::cs: