Twitter Updates

-->

Elements of Resilient Infrastructure

In this series of posts I’ll discuss the foundations of resilient infrastructure, including facilities, network, server hardware, workload management, monitoring, and presentation, application, messaging and database servers as well as why seemingly resilient implementations of these components can fail to provide resilient solutions. Our focus will be on the patterns used when implementing these components rather than vendor-specific details. Each vendor implementation may be a little different, but the methods for creating resiliency when designing and configuring a piece of infrastructure will be very similar. The focus will be on which patterns are valuable and how those patterns relate to the other domains in the book to provide resilient solutions.

One recurring theme throughout is the notion that simple is usually better. A challenge when building and maintaining complex systems is striking the optimal balance between too much complexity (which makes failure modes harder to predict and troubleshoot) and too much simplicity (such that there is very little flexibility to deal with failures or maintain the system). Complexity can creep into our solution in insidious ways: a “high availability” clustering solution that makes troubleshooting intermittent failures difficult; dynamic routing in a network to provide redundancy that introduces variability that causes sporadic timeouts; and database failover approaches that make returning to the pre-failure configuration difficult. These approaches are not always bad, but the complexity introduced by these design decisions must be weighed against the value they provide and alternatives that are less sophisticated but easier to manage.

Facilities

The number of facilities (data centers or hosting locations) that a solution is hosted in is an obvious factor in the resiliency of an application. If a solution is only hosted in a single data center and the entire data center experiences a failure or disaster, the solution clearly fails. Assuming that a solution can horizontally scale across a small number (two to four) facilities, the more relevant question from a resiliency perspective is how the application should be sized to maximize resiliency and minimize waste.

Questions to Consider When Determining Facility Needs for Resiliency

The number of facilities that are used to host a solution affects how systems and components will scale across sites, the amount of infrastructure needed for redun- dancy and, as we will see in later chapters, the application design. Some critical questions to ask when considering the number of facilities:

  1. Is there a range of the number of facilities in which this solution can be hosted? For example, is it possible to deploy the solution in two, three or four data cen- ters? Or are two data centers the most that are available, thus constraining your choices?
  2. If there is a range of the number of facilities that can be used, how many simultaneous facility failures should the solution be able to tolerate? For example, if the solution will be hosted in three data centers, the solution should probably be able to tolerate the loss of at least one data center. Should it be also able to tolerate the loss of two?
  3. How does the data center topology affect other aspects of the solution? For example, presentation and application servers are typically much easier to scale horizontally than database servers.
  4. Will all the components scale equally well across data centers, or will some components, like a database server, scale only to two locations?
  5. How will load-balancing affect the number of facilities? Does load-balancing between facilities exist only at the “front door” (so once traffic comes to a particular data center it will never leave) or could traffic between system components be load-balanced to different data centers? How does this affect the application design?

Having control over the number of facilities allows for much greater control over the total number of components needed to obtain the same level of resiliency. Let’s consider a simplified solution that we want to deploy in multiple data centers. We know that we require four servers to be available to handle peak load on the system and we want to be able to tolerate the failure of one data center without impacting our customers.

By increasing the number of data centers we’ve reduced the total number of servers needed to provide the same level of resiliency – the ability to tolerate the failure of one data center. This is one example of the need to find the optimal balance for simplicity: two data centers with four servers each results in some server waste, but deploying the same solution in nine data centers with one server each probably intro- duces a lot of unnecessary overhead and is highly impractical.

The three data center choice offers a good balance in our example scenario. We gain 25% efficiency in our server cost while only marginally increasing the complexity from a data center deployment view. Unfortunately, you may not always have the luxury of choosing how many facil- ities are available to deploy a solution. Perhaps your company or client only has two data centers, or constraints within the components you’re deploying limit your choices. Whatever the reason, it is still important to consider the questions above.

If there are multiple types of components being deployed (e.g. web servers, applica- tion servers, middleware servers), how does the number of data centers affect their configurations? Load balancing? Database replication? Messaging services? How will maintenance (configuration or code changes, software upgrades, etc.) to the solution be handled in the given topology? How does the number of data centers affect monitoring of the solution? I explore these questions in more detail in a later posting.

It’s also important to consider these questions in the context of third-party ven- dors on which your solution relies. While vendor contracts often specify penalties for failing to meet their service level agreement, these penalties rarely make up for the true cost of failure. Moreover, a vendor’s configuration may influence your design. If the solution you’re building will be deployed in three data centers at your company but needs to connect to two data centers at an external vendor’s site, how does this affect the connectivity between the sites? Many vendors may be reluctant to share implementation details of their infrastructure after contracts are signed, so it is critically important to negotiate for these details as part of your supplier management process.

It’s also important to consider how the environment within a facility may improve or degrade the resiliency of a solution. It is common for data centers to have multiple power distribution units (PDU), receive power from multiple power grids, have redundant ISP connections, and so on. However, failures still can and do occur, often because the equipment in the facility is not implemented to take advantage of these redundancy in the facility. For example, a rack full of servers with dual power supplies may have each power supply receiving power from the same PDU despite multiple PDUs being available. Likewise, circuits from multiple ISPs may be provisioned into the data center, but all of the servers in a solution that require ISP connectivity may only have access to one of those circuits.

In the next post, I’ll explore how the network may affect resiliency in unexpected ways.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • LinkedIn
  • Slashdot
  • Twitter
  • Reddit