Twitter Updates

Resilient Network Infrastructure

The network is a critical resource to nearly every enterprise IT solution. While many network resiliency considerations are inherently part of modern networking, some of these features actually undermine resiliency. For example:

  • the availability of multiple routes can sometimes introduce unpredictable behavior in applications if a preferred route is unavailable and alternate routes have higher latency;
  • switch ports and server network interfaces that have worked correctly in “auto-negotiate” mode suddenly stop working;
  • unintentional asymmetric routing result in packets trying to return to the customer through a firewall that never saw the incoming request causing the packet to be dropped

In this section, we’ll examine how to ensure that the network layer of new solutions is built for resiliency and avoid these types of problems.

Consider the Resiliency and Effects of Failure of “Core” Network Services

Some network services fail so infrequently (and are so catastrophic when they do fail) that they are rarely considered explicitly as a failure mode. DNS is one such service: most architects, software developers and server administrators are so conditioned to DNS working that they never consider what happens when it fails. In many cases, DNS is such a foundational service that a failure of the service is fatal to the solution. Insulating against DNS failures requires a two pronged approach.

First, it’s important to understand the resiliency of the DNS service. How much redundancy exists in the DNS infrastructure? How frequently has it failed in the past? Is the scope of a DNS failure limited in some way such as location on the network or zone?  In a future post, I’ll discuss some tools that can be used to help assess the resiliency needs of a solution and assess failure modes.  If the network team supporting DNS cannot commit to the level of availability required for the solution, or if previous observed failure rates exceed the tolerance of the solution, the DNS service needs to be improved.

Second, the solution must be analyzed to determine which services, components and transactions will fail if DNS is not available. This can be a difficult task because DNS is generally assumed to “just work”, so identifying every place that relies on DNS can be time consuming. One approach to quickly identify DNS dependencies is to intentionally misconfigure the DNS settings on a machine-by-machine or component-by-component basis in a test environment. Since DNS failures usually cause such widespread failures, performing this kind of  ”negative test” in a very controlled fashion on small pieces of the solution at a time is usually much more manageable. It’s also important to remember that DNS dependencies are not just introduced in application code; often times web, application and database server configurations rely on DNS for their internal operation.

Ultimately, it may be infeasible or impractical to eliminate dependencies on DNS. However, understanding how DNS-related failure modes manifest themselves is useful for quickly identifying a DNS problem when it occurs. There is a saying in the medical field that is apropos: “When you hear hoofbeats, think horses, not zebras.” This is how most technicians successfully troubleshoot failures. A DNS failure is a zebra, and on the rare instances DNS failures occur, it can take hours to identify the root cause of the problem if operations teams aren’t familiar with the symptoms.

Ironically, the resiliency provided by DNS can present problems in a way unrelated to the availability of DNS itself. Many applications will cache DNS responses or ignore DNS time-to-live (TTL) settings. Cached DNS responses can cause significant problems when DNS load-balancing solutions are used to respond to failures in the environment because it sometimes requires restarting of a process to clear the cache, often resulting in even more failures. Applications that fail to honor DNS TTL experience a similar failure and sometimes are outside of a technologist’s control. One example of this is ISP’s who have proxy servers configured to override DNS TTL. Assume that a global DNS load balancer is load-balancing mydomain.com between two IP addresses with a TTL of zero. It is tempting to assume that removing one IP address from being returned will immediately stop new requests from hitting that address, but requests may continue for quite some time.

Routing protocols present another unique challenge to resiliency. In most cases, routing protocols ensure that a route is available between the source and destination of a connection. This does not guarantee that the available route should actually be used, however. For example, consider the scenario shown below, a simplified web and application server configuration hosted in two data centers. For the sake of clarity, global and local load balancers have been omitted from the diagram, as have redundant web and application servers in each data center.

Network routing using preferred (solid lines) and sub-optimal (dashed lines) alternate route

Network routing using preferred (solid lines) and sub-optimal (dashed lines) alternate route

Assume that customer requests are routed to the web server in the internet facing DMZ and that the web server acts as a proxy for the application servers on the internal network. If the customer’s request enters data center A, the preferred route from web server 1 to application server 1 is shown in solid black lines. Let us also assume that the alternate route shown in the dashed line is available, but is suboptimal – it involves several more network hops and additional latency.

As long as the preferred route is available this network topology works as expected. Alternatively, consider a failure mode where the connection between the internet DMZ and internal network in data center A fails (for example, an incorrect firewall rule is created). The alternate route from web server 1 and application server 1 becomes active, routing the request through data center 2 adding significant latency to the request. Because requests are taking longer, active (open) connections to the web server accumulate which causes performance problems and possibly some connections to be denied as the web server runs out of resources. This failure mode results in many customers having a degraded experience due to the latency between the web and application server and eventually will cause failures.

If the suboptimal alternate route had not been available, the connection between the web and application server would have been broken immediately. As we will see in a future post, workload management tools like load balancers combined with intelligent “health checks” in the application can quickly detect this condition and stop traffic from being routed to the failing components. In the scenario where web server 1 cannot connect to application server 1, “fast failure” is preferred to the prolonged degraded connectivity condition that exists when traffic is routed through data center 2.

This is another example of simple being better. Intelligent routing adds complexity without reducing the risk of failure and probably increases the amount of time needed to troubleshoot the problem. The degraded connectivity condition seen above is an example of a problem we will see frequently that I  call “sick but not dead”. This class of failure is always problematic – it can trick monitoring that should detect failures into believing the system is healthy and it makes identifying the root cause of a failure much more difficult.

Network redundancy solutions can also cause problems at the server and network interface. Different operating system vendors use different terms, but most operating systems allow for the pairing of network interfaces for redundancy. (In Solaris this is called IPMP; in Linux, “teaming”; on AIX, “Etherchannel” or “Network Interface Backup”.) All of these solutions are valuable to improve resiliency as long as care is taken to ensure that the paired network interfaces are cabled to physically distinct switches. All too often, redundant, paired network interfaces are cabled to the same switch even though redundant switches are available with a shared VLAN. Similar to the testing of redundant PDUs by unplugging server power supplies, redundant network interfaces should be tested by unplugging network cables. Again, organizing such a test is difficult after a server is in production, so always plan for failure testing before a solution goes live.

DNS, routing protocols and redundant network interfaces are instructive examples in why it is important to consider core network services that may otherwise go unnoticed. When building a new solution, it is very important to examine the configuration of the network itself and question whether network resiliency features will improve the solution. Wherever possible, simplify the network configuration and ensure that sick but not dead scenarios are avoided in favor of fast failure.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • LinkedIn
  • Slashdot
  • Twitter
  • Reddit