Resilient solutions require resilient infrastructure and resilient application design and development practices. Earlier posts discussed factors in a solution’s infrastructure that help increase resiliency, but the application that runs in that infrastructure must also be resilient or else the solution will fail. Software architects and developers generally recognize that their applications must tolerate failure, but these approaches are usually focused on conditions that exist within their code rather than external factors. Another consideration of resilient application design is how to limit the scope of failures; it is far better to have a fraction of functions fail quickly while the rest of the application’s functions continue to operate normally than to cause all functions to fail or operate in a degraded fashion.
The practices outlined in this series of posts are not substitutes for good code quality and common fault tolerant development practices; rather they augment those already known patterns. For example, Pullum describes many techniques for fault avoidance, prevention, removal and prediction in software while Koren and Krishna outline a systems-centric approach to hardware and software fault-tolerance. Code construction practices like those described by McConnell benefit resiliency. While preventing and eliminating defects is an obvious direct benefit, the advantages of creating high quality code also speed debugging of problems that cause failures and make adapting to new failure conditions easier.
In the next series of posts, I’ll assume that these foundational practices are being applied in the solution and turn our focus to a holistic approach to application design that considers interactions with other elements in the solution. These approaches are largely independent of each other but can be grouped into three general categories: instrumenting and reacting to operating conditions, coding for upstream and downstream component dependencies, and expecting the unexpected. Throughout these posts l’ll refer to “components” and “applications”. In this context, a component is an element of an application or solution that by itself may not serve a business function. For example, a component might be a logging subsystem in an application or a database server. Conversely, an application should be taken to be a collection of code (one or more components) that operate together to perform some function.
Instrumenting and Reacting to Operating Conditions
In order for an application to tolerate failures it must be able to measure and, in some cases, react to changes in operating conditions. If an application is unaware that a failure is occurring, there is little the application will be able to do itself to prevent or minimize the effects of the failure. Therefore, it is important that the application has the ability to measure operating characteristic that may affect its operation. These operating conditions include the number of users on the system, awareness of failures of “downstream” components (other applications or components on which the application is dependent), availability of critical resources, the rate of change of a particular operating characteristic, and so on. In some cases, the application may not be able to react, but the awareness of such a condition may enable logging of useful information. In other cases, the application may be able to incorporate “autonomic” responses to attempt to mitigate the failure. In this post, I’ll discuss approaches to implementing these types of features.
Instrumenting and Responding to Load
One of the most important factors affecting an application’s resiliency is the load it is under or volume of requests it is receiving. In a user interface oriented application (like a client/server or web application), this may be measured in the number of active users, sessions, or connections. In a more batch oriented, data processing application, this may be the workload waiting for processing. (Integration or middleware applications also experience this condition, but we will address integration resiliency in detail in the future posts.) Whatever the primary measure is for the load on your application, it must be measured.
One important distinction to make when measuring load is to measure what matters. A client/server application with 100 clients sitting at a “login” screen is not the same as 100 users actively using the application after logging in. Likewise, an online banking web application with 100 customers browsing static marketing pages on the site is not the same as 100 authenticated customers viewing account details and making bill payments. This is why measuring load is an application design and development concern – merely counting the number of open TCP/IP connections or active sessions on an application server is not accurate for measuring the actual impact to your application.
As a result you must make provisions for instrumenting load. It may be valuable to consider using a singleton pattern to create a common object for storing this kind of data. You can then create a simple counter that will be incremented when an event occurs that adds load to the system and decrements when load is taken away from the system. For example, in a J2EE application using a typical model-view-controller pattern, the controller handling login requests could call a method to increment the counter when a login is successful. Upon logoff or a session timeout event a similar controller would decrement the counter.
One consideration with this type of approach is the scope of what is being measured. In many cases, the application can only measure how many events are occurring within its local instance. For example, in a client/server application you may have several different server instances load-balancing traffic and each instance is only aware of logins on that instance. In this scenario it is necessary to aggregate the load information from each server instance. It is generally better to use application code to do this aggregation rather than trying to store and operate on an aggregated counter in a single location (like a database table). While the database approach is tempting because of its centralized location and ease of access, the frequency of updates to this value and the possibility of a deadlock or race condition affecting the accuracy of the measurement and the performance of the application is generally too high.
In a client/server application, one possible approach is to expose an administrative RPC call to get the number of active users. In a web application, each application server may have a servlet that returns the current count from the singleton object with the counter. It is then possible to write some code that calls each individual server instance and totals the number of active users. While aggregating the total number of users across all instances may seem frustrating, there is tremendous value in having the load metric at an instance level. By monitoring volume on an instance-by-instance basis, it is very easy to detect situations where there is an imbalance in load. This can be an early warning of a load balancing/workload management problem, or it may bring awareness to external factors that degrade the effectiveness of a particular load balancing scheme. (Having the load metric available on an instance-by-instance basis of an application server would help identify the fact that load is skewed to one particular server or a set of servers sharing a load-balancer. This type of information is very helpful to diagnose the root cause of failures related to DNS or load-balancing failure modes.)
By enabling each application instance to have an awareness about its local load it is also possible for the application to change its behavior based on load. If a particular application instance can handle no more than 500 active users, it may be preferable to allow those 500 users to continue using the application while denying access to any additional users until the load diminishes. While this may result in some failures for users trying to login, it is a far better experience for the users who were already using the application. (This assumes that the application performance would begin to degrade significantly when more than 500 users were using it, such that all users would pay a performance penalty or experience errors.) Alternatively, it may be that the application can continue to allow users in but may restrict access to certain functions that require significantly more resources than the rest of the application.
I’ll refer to this functionality as “gating”, as it is conceptually the same as a gate to an amusement park that can hold a limited number of people. Adding gating functionality is trivial once the load “counting” mechanism is in place. Rather than blindly incrementing the counter, the application first checks the current value of the counter and compares it to a configurable parameter indicating the maximum value before turning away users. If additional load will not exceed the maximum, the application continues normally. If additional load would exceed the maximum, then some other action is taken (either attempting to send the request to a server with more free capacity for load or generating an error to the user). The maximum value parameter is set by stress testing the application and monitoring the load metric to determine what level of load causes the application to malfunction.
Exposing the load metric also enables monitoring of load on a near real-time basis. If an application has implemented gating, monitoring tools can be configured to alarm when volume is approaching maximum load. It is also possible to monitor the rate of change in load. Rapid changes in load may indicate some kind of failure (e.g. customers experiencing problems in their session and quickly logging back in) or something more nefarious like a denial of service attack.
Many of the examples of gating I’ve discussed have defined load as the number of users, but applications that are not user or session based also benefit from this approach. If a data processing application must process a queue of files in a fixed length of time, the load may be the number of files and the maximum value may be the largest number of files that can be processed in time. A stateless web services application may measure the number of SOAP or REST requests over a short period of time. A number-crunching application may measure the size of a data set or be able to predict the complexity of the operation. Whatever the case, the fundamental approach is the same:
- Instrument the application to measure load
- Expose the load metric so that monitoring or operational tools have a near real-time measurement of load
- Determine the maximum load the application can handle without failing
- Implement gating functionality to the application can guard against unsafe loads
- Proactively monitor load to identify problems before they affect customers
In my next post, I’ll discuss developing code for effective logging, monitoring and troubleshooting.