Determining server liveness

There are two distinct parts to determining server liveness: data collection and decision making. Server liveness data is collected using a process called servermonitor that is running on liveness testing agents. Liveness scores are sent from the liveness testing agents to a component called DatacenterStateAgent (DSA) every ten seconds. DSA makes the liveness decisions, and sends updates to GTM nameservers every five seconds.

Each liveness testing agent periodically performs one or more liveness tests on each of your servers. The liveness testing agent then computes a score that is either the download time in seconds or a penalty score if the download request times out or if the download encounters an error such a 404 error. The default penalty for a timeout is 25 and the default penalty for an error is 75. These penalties are configurable on the back-end. The time to allow before declaring a timeout (by default, 25 seconds) is configurable in the portal (see the Liveness Test section).

Note: Connection timeouts are treated as errors (penalty 75), not timeouts. Only timeouts that occur in the data transfer stage receive the timeout penalty (25). This prevents servers that are disconnected or shut off from being preferred over servers that are returning errors or refusing connections.

The scores are collected from each liveness testing agent; for each server, the median of the scores from all liveness testing agents is used for the remainder of the calculation.

In addition to the instantaneous scores, servermonitor also computes and reports an exponentially-decaying average. In the calculation that follows, the score used is the greater of the instantaneous score and the average score. This means that when a server goes down, GTM stops handing it out immediately, but when it comes back up, GTM does not hand it out again until the liveness testing agents have had several successful downloads. If a server is intermittent, it will be declared down when several liveness testing agents get errors within a few minutes of each other. While some of the averages are falling, they will still be above the cutoff when the new errors occur.

A cutoff value is computed from the median scores. Any server with a score over the cutoff value will be considered dead, and load will not be sent to it. The cutoff is computed from the minimum score across all servers (for a given property) and with two parameters; health_multiplier and health_threshold. The cutoff is either health_multiplier (default value: 1.5) times the minimum score or the health_threshold (default value: 4), whichever is greater.

The following examples describe how the cutoff is determined.

Example 1

In this example (Table 16), server A has the best score of 1.0 seconds so the cutoff is 4. As discussed above, the cutoff is 1.5 times the best score or 4, whichever is greater. Servers A, B, and C are declared up, while server D is declared down, as its score (15) is greater than 4.

Example 1
Server Score Status Cutoff
A 1.0 Up 4
B 1.2 Up
C 3.0 Up
D 15 Down
Example 2

This example (Table 17) shows a high load situation, where the servers are slow, but still responding. Server A has the best score of 8 seconds so the cutoff is 12 (1.5 * 8). Servers A, B, and D are declared up, while server C is declared down, as its score (15) is greater than 12.

Example 2
Server Score Status Cutoff
A 8 Up 12
B 11 Up
C 15 Down
D 10 Up
Example 3

This example (Table 18) server A has the best score of 25 seconds so the cutoff is 37.5 (25 * 1.5). Server A is declared up, while servers B, C, and D are declared down, as their scores are greater than 37.5.

Example 3
Server Score Status Cutoff
A 25 (timeout Up 37.5
B 75 (error) Down
C 75 (error) Down
D 75 (error) Down

The algorithm is modified slightly if a backup CNAME exists. If the cutoff score computed as described above is greater than 0.9 times the timeout penalty, use 0.9 times the timeout penalty as the cutoff value. This is to guarantee that, if all servers are timing out or returning errors, GTM declares them down so that the backup CNAME is handed out. (Normally if all servers are timing out or returning errors, the standard algorithm declares them all up so that they're handed out; you don't want this if there's a backup CNAME).