Failure Detection

identifying when a component (e.g., server, network link) in a distributed system stops functioning correctly.
essential for triggering recovery mechanisms like failover, replication, or reconfiguration.

1. Challenges

Network delays and partitions can make it difficult to distinguish between slow responses and actual failures.
False positives (mistakenly declaring a node as failed) and false negatives (failing to detect a real failure) can have serious consequences.

Heart-beating: Nodes periodically send "I'm alive" messages to a central monitor or each other. Absence of heartbeats indicates a potential failure.
Ping/Echo: A node sends a ping message, and the other node replies with an echo. Lack of response suggests a failure.
Gossip Protocols: Nodes exchange information about the health of other nodes, spreading failure information quickly.
Timeout-Based Detection: If a node doesn't respond within a certain time, it's assumed to have failed.
Anti-Entropy Protocols

Trade-offs must be made between accuracy and speed.
Timeout values and other parameters need to be adjusted based on network conditions and application requirements.
Failure detection is often probabilistic - It provides a likelihood of failure rather than absolute certainty.