Twitter Updates

-->

Infrastructure Resiliency, Server Hardware and Workload Management

Approaches to sizing and scaling server hardware vary from vendor to vendor and between distributed and mainframe technology, but some general principles apply. One area of frequent disagreement in some organizations is the decision between a few, large-frame distributed systems or many smaller (commodity or near-commodity) servers. Google has proven the commodity hardware approach in the search domain but this approach works well in many other commercial domains where tasks are easily parallelized and not computationally intensive. Many commercial applications meet this criteria: rich internet applications, multi-user client-server applications, middleware/integration solutions and some batch processing applications can often be implemented in this way. Designing applications to work on small, commodity hardware has several benefits.

Server Hardware Resiliency

As with facilities, the more units of hardware processing is spread across, the smaller the impact of any one of those units failing. While logical partitioning of large distributed systems enables simulating smaller systems, adding capacity to these systems can often be more intrusive than simply adding another commodity server to a rack. Another difference is that scaling large distributed systems is periodically costly when an additional frame is needed whereas the cost of scaling commodity hardware is linear.

Another benefit to having many smaller units of hardware is the ability to use small groupings of hardware (and the software that runs on them) to pilot new functionality or changes. Alternatively, hardware and software can be segmented to insulate different groups of customers from affecting each other. (This is particularly useful when a solution is providing a shared service to different internal business lines, external customers that dramatically different value to the business and are stratified by value, and so on.)

A Fortune 100 insurance company may have an online self-service application used by individual insurance customers with relatively small policies and institutional customers who pay very expensive premiums. The business may decide that individual customers are tolerant of the site being unavailable but institutional customers demand continuous availability. Having two completely isolated groupings of components that are used only by one type of customer provides much more flexibility. This idea of “islands of functionality” has other benefits – we’ll revisit the topic shortly.

Commodity hardware can also be easier to manage than large-frame distributed systems. There is some overhead associated with managing more devices, but automating common management tasks (such as software deployments, configuration changes, and server maintenance tasks) minimizes this overhead with the added benefit of minimizing the likelihood of server configurations drifting out of sync with each other. Managing commodity hardware is also an easier skill set to find than familiarity with “big-iron” distributed system.

When considering the size of hardware to use for a particular solution, it’s important to consider how the application will affect the physical deployment. A relatively stateless application may scale very well across many small servers by leveraging simple load balancing techniques whereas a complex, stateful application may benefit from larger servers. Physical deployment options may, in turn, affect how an application is designed. In the case of the complex, stateful application, the effect of the application’s requirements on the physical topology may reveal that it is better to maintain the state of customer’s session in a large distributed caching solution. Again, this highlights the benefits of a holistic approach to thinking about resiliency while designing the solution. The concept of leveraging a distributed cache will be addressed in a future post and may further influence decisions about server hardware.

Commodity hardware isn’t appropriate for every scenario, so identifying the ideal hardware needs to be an explicit consideration during the design process. There may be cases where commodity servers can be used for tasks that horizontally scale nearly linearly (like presentation, application and middleware servers) but larger hardware is needed for database servers or computationally intensive analytics that cannot be parallelized.

Workload Management

The design and implementation of workload management features greatly influences the resiliency of a solution. There are several mechanisms that can be used to manage workload: load balancers are one of the most common, but configuring application, database and messaging components to naturally distribute load is equally important. Under normal operating conditions, these mechanisms and techniques provide the solution with a way to manage the load be handled by a particular component, usually by attempting to distribute the load equally or to the component which can satisfy a request the fastest. In failure modes, however, workload management solutions are the first line of defense in minimizing the extent of the failure.

The combination of DNS load balancers (”global load balancers”) and local load balancers (that distribute load within a LAN environment) are fundamental components to the multi-facility implementations discussed in my previous post. The capabilities that many of these products provide to be able to detect failure are also critical to resiliency. For example, most load balancers have a variety of mechanisms (”health checks”) to determine the health of the resources they are load balancing across. These health checks vary in sophistication: at their most basic, a load balancer may ping a device while more advanced implementations offer sophisticated scripting capabilities that can be used to look for specific content in the device’s response to adjust the load-balancer’s behavior.

As a guideline, using basic ICMP or ping health checks are not sufficient. Many failure modes exist where a device’s TCP/IP stack is fully functional but the software that uses that stack is failing. As a result, every load balancer configuration should be tailored to obtain the most accurate information possible about the state of the components for which it is managing load. In many cases, this requires the creation of custom code within the application to provide the load balancer information about the application’s health. Alternatively, custom health check code in the application tier can be used to indirectly monitor the health of other components like the database, messaging server, cache, and so.

For example, consider “application A” that relies on the availability of two databases and a JMS connection to “application B” to operate correctly. The developer of application A creates a health check that performs a simple (very fast!) query against each of the databases and produce a test JMS message to application B. The developer of application B creates a similar health check capability that consumes test JMS message, checks its own internal resources, and responds. This health check is used by the load balancer to periodically confirm that application A is operating correctly. The way the health check is created and exposed depends on the system: a J2EE application server may create a simple JSP page, a .NET application may create a simple ASPX page, and a client-server application may create a script that could be invoked over a terminal connection.

The utility of this health check cannot be overstated. The load balancers that use these health checks are now far more informed about the internal health of the application than a simple ICMP check would provide. The health check can also provide status to monitoring tools and generate alerts to support teams when a failure occurs. Real-time management and reporting dashboards can be created to show the operating condition of the solution. The response time of the health check is often another indicator of health. Even if all of the checks in the example above are working, there is a significant difference between the health check responding in 0.1 seconds and 1.1 seconds. This type of response time trending can act as an early warning system to alert support staff to an impending failure or used to automatically route traffic away from a particular device.

These types of health checks are also easy to extend to provide additional operational controls to support teams. For example, it is trivial to add logic to the health check above to look at a file on the local file system of the application to determine if support teams may want to stop traffic from being routed to a server even though the application is healthy. This type of functionality can be used to route traffic away from servers for maintenance purposes. It could also be used to cause a global load balancer to stop distributing the IP address of a group of locally load balanced servers but permit customers who already have the address cached to complete their work. The flexibility provided by this feature is extremely valuable and should be part of every solution – there is virtually no reason not to do it.

As important as load balancers and health checks are to workload management, they are just one factor in the resiliency of an application. Understanding (and, when possible, influencing) the application’s requirements with respect to persisting connections is also very important. Consider a web application that uses manages session state only on local servers. Session state is not replicated across every application server running the application.  Let us assume for the moment that the session management configuration is deliberate because of performance concerns or because the technology used doesn’t support replication and cannot be changed. How might this influence our workload management approach?

To solve the problem, we need an approach to keep a session in one data center from start to finish. Since the behavior exhibited by ISP proxies and DNS caching makes this difficult in our current configuration, we need to eliminate the possibility of a session jumping from one data center to the other. One possible solution is to create “data center specific” domain names. Let us assume that the customers come to the domain www.highlyresilientsite.com. The application could generate links from the landing page of that domain that direct the customer’s browser to www1.highlyresilientsite.com or www2.highlyresilientsite.com which correspond to data center A and B respectively. Even if the customer’s ISP suddenly load balances the customer to another proxy server, the session will remain in the same data center. This affects our DNS configuration but it also places a new requirement on the application: the ability to “know” which data center domain names are valid and should be served. (This will be a recurring theme: resiliency requires a holistic approach to design!)

Another aspect to workload management is the design of messaging. Messaging infrastructure generally needs to work in collaboration with the application workload management approach. For example, if the application tier of a solution is stateful and requires session persistence. If the application uses a messaging service (for example, IBM WebSphere MQ or Microsoft MSMQ) to communicate with another component that is also stateful, this implies a very different design that communication that is completely stateless. Part of the design process of a solution that involves messaging must be considering the nature of the messages being sent by the application and how the messaging service can be configured to provide resiliency without violating assumptions about the state of the component receiving the messages.

The configuration of databases also influences workload management. In some cases, managing workload on a database can be done by using replication or redundancy features to manage load and insulate components from affecting one another. A common approach to this is replicating data from a production database to a reporting database. Another pattern for managing load is create multiple database listeners used for different segments of customers or purposes. Similar to our example of the insurance company, using distinct groups of hardware for different types of customers, controlling access to the database through multiple listeners provides flexibility in how the database is used. For example, assume a database used to support the insurance application had three listeners: one for use by a segment of application servers servicing individual customers, one for use by a separate segment of application servers servicing institutional customers, and a third for use when real-time reporting is needed. If the database experienced performance problems, administrators could simply disable the listener for real-time reporting to attempt to relieve some load and maintain availability for customers.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • LinkedIn
  • Slashdot
  • Twitter
  • Reddit