Twitter Updates

-->

Resilient Network Infrastructure, Part 2

While the effect the network has on solutions are usually understood in general terms, specific details are often unavailable. One reason is the difficulty of replicating production network conditions in test environments. Subtle changes in network performance that are difficult to detect can have significant effects on the performance and stability of systems. Likewise, seemingly insignificant changes in the operating environment of a solution (such as changes in customer behavior or scheduled operations like backups) can have drastic effects on the network. For the same reason a sub-optimal route affected the performance of the system shown in the example in part 1, a change in network performance that introduced the same degree of latency would have the same effect.

Understand the Effect of the Network on Solution Resiliency

Frequently problems arise in new solutions because production networks are inherently more complex that the network used for testing: more devices, more variability in usage and performance, and often more distributed. One of the most important – and difficult – characteristics to understand is the effect of latency on a solution. The easy solution is also the most expensive because it requires building out an exact replica of the production network. Since this is often prohibitively expensive, creative solutions may be used to simulate production.

The most difficult network configuration to simulate is geographical distribution. If a solution on the production network will be distributed between data centers in Colorado and Massachusetts but you have only one test environment in Kansas, your test results will not accurately reflect production. One way to approximate the effect of this distribution is to mirror the production VLAN configuration. Assuming you have an application server VLAN in Colorado and another in Massachusetts, one option is to create two VLANs in Kansas and divide application servers between them. To accurately simulate performance, measure the production latency during peak network utilization then simulate that latency between the two VLANs. (Open source tools exist to do this at the network interface level and network appliances and traffic shapers can perform the same function at the switch level.) Database connections – including those that support replication – are particularly sensitive to this kind of latency, so if only one component can be tested in this manner, the database is usually the best bet.

Another culprit of network resiliency failures is due to unexpected impact from firewalls. These effects usually fall into one of the following categories:

  • Unexpected latency introduced by firewall interfaces (a variation of the theme above)
  • Inconsistent rule configuration causing failures (usually closely correlated with poorly tested or managed infrastructure changes)
  • Unusual interactions between the firewall, network and application

While the first two issues are relatively straight-forward, “unusual interactions” are (as you might imagine!) hard to predict. A real-world examples is in order to illustrate how these unusual interactions may manifest themselves. If an application makes a network connection that traverses a firewall but is not very “chatty”, the firewall may timeout the connection even though the application believes the connection is still alive. This may lead to intermittent problems that are difficult to diagnose: database connections that fail unexpectedly, network mounted file systems that suddenly become unavailable, and so on. Frequent activity on the system masks these types of failures, so in many cases this type of problem will crop up during low periods of utilization. It is difficult to predict exactly where and when this type of problem will occur, so a good practice to ask when working on the physical deployment of a solution is to analyze all of the network devices that critical connections traverse and ask what may lead to that connection be unexpectedly terminated. Ways to approach this problem will be discussed in a future blog post.

One way to prevent these problems from reaching production is to test with the same firewall configuration as production. Software-based firewalls are usually inexpensive and easy to simulate in test environments, but network appliance (dedicated hardware-based) firewalls can be expensive to purchase and maintain in a test environment. Have a test configuration that mirrors production is valuable because it also enabled infrastructure teams to test firewall rules before they are implemented in production. While substituting a network appliance firewall with a software firewall in a test environment is better than nothing at all, different firewall products behave differently and such a configuration doesn’t enable true testing of firewall rules.

Network Performance Is Not Static

Even the most comprehensive analysis of the effect of the network on a solution can be undone due to the dynamic nature of production networks. User volumes grow, backup schedules change, new devices are added to the network and suddenly a database transaction that took 100ms when the solution went live takes 300ms only a few months later. While 200ms may not sound like much, a three-fold increase in response time at the database layer can ripple through the system causing a thread in your application to spin in a wait state longer, thus causing resources on the server to be consumed for longer periods of time and cause resource starvation and timeouts that cause failures.

Early warning of changing network conditions is a necessity. As we will see in future posts, understanding changes in the environment is an element of operational resiliency. In the spirit of an integrated approach to resiliency, it is not enough to assume that a network operations team knows what a particular solution requires to be resilient. This is the benefit of a holistic approach to resiliency: through the analysis and testing described above and later in this book, the resiliency practitioner can determine what parameters influence the resiliency of a solution and what ranges of performance are acceptable. This information will enable operations teams and monitoring applications to know when (and, in many cases, before) a system is approaching a dangerous threshold.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • LinkedIn
  • Slashdot
  • Twitter
  • Reddit