The last few posts have focused on specific elements of infrastructure. This post will address two cross-cutting concerns that affect the resiliency of all infrastructure components: monitoring and the use of “islands of functionality” to minimize the effects of failures and speed troubleshooting.
Monitoring
Monitoring the health of every component at every layer of a solution is absolutely mandatory to create a resilient solution. If your solution is not instrumented so its health can be determined in near real-time or generate alerts automatically when its health changes, the solution is inherently not resilient – it’s that simple. As a result, defining the monitoring requirements of the solution is a critical part of solution design. In future posts I will discuss tools that help identify which aspects of a solution are important to monitor. From an infrastructure perspective, it’s important to know what monitoring capabilities exist and any gaps that need to be addressed before a solution is implemented. In particular, the following questions may help identify critical needs and gaps:
- What tools exist to monitor the health of the network and servers?
- Is it possible to tune thresholds in those tools so that they match dangerous thresholds in this solution?
- How can dangerous thresholds be identified so that they can be set correctly in production? (For example, in a test environment? Using production equipment before going live?)
- Will I be able to test my monitoring thresholds and alerting in a test environment?
- Do the monitoring tools generate alerts? If so, who receives them? How do the recipients of monitoring alerts know how to respond?
- What mechanisms exist to instrument applications for errors? (For example, file-based log monitors that can “watch” for exceptions or platform-specific monitoring frameworks like JMX.) Do the application developers know which conditions should trigger monitoring events? Is there a standard approach or specification for the developers to implement in their code when those conditions occur?
- Are component-specific tools available to monitor for conditions like long running transactions on a database or high queue depth in a messaging backbone?
Monitoring tools do not need to be expensive or elaborate. While many enterprise monitoring solutions are both expensive and elaborate, comprehensive monitoring can be accomplished without them. The health checks described in the workload management section are a good example; some simple scripting to invoke health checks and generate email alerts is inexpensive and ensures several failure modes are detected. Dashboards can be built using open source tools that quickly convey the health of the overall solution. Whatever technology is used, the goal is to have a single place to go to see the health of the solution. The view can be as simple as red/yellow/green indicators on a web page that shows every component’s health.
As we will see throughout future posts, improving the resiliency of existing solutions is dependent on having comprehensive data about health and where failures are occurring. It is extremely important to get monitoring right from the beginning when creating new systems.
Islands of Functionality
At various points in the last few posts I’ve discussed the value of arranging groups of functionality in ways that can be discretely controlled. The ability to segment hardware (and, by extension, the software that runs on it) provides flexibility to control how those components are used. The configuration of database listeners to segment users of the database was shown as a way to manage failures. Creating data center specific domain names to control session persistence provides better resiliency. These are all examples of a more general concept of “islands of functionality”.
An island of functionality is a small collection of components of a solution that can be managed as a unit. As an example, consider again the online insurance application deployed in the configuration shown in the figure below. Boxes labeled beginning with a W are web servers, anĀ A are application servers, and DBL are database listeners. Note that two web servers are grouped together with two application servers. For any given web server there are only two possible application servers to which a request can be routed and for any given application server there are only two web servers from which a request could have originated. Two database listeners are configured on each database.
This configuration creates islands of functionality in a couple of ways. First, small groupings of web and application servers can be managed as a group. If A1 or A3 need to have maintenance performed on them, only two web servers are affected rather than four (the entire data center) or eight (both data centers, if we wanted to enable all web servers to send traffic to all application servers). Second, if a problem occurs on any server in this environment, the scope of the problem should be limited to one quarter of the environment. For example, if unusual errors are occurring on A3 but the root cause appears to be some other component, it’s very likely it could only be W1, W3, A1, or DBL1 causing the problem. Third, this configuration allows us to separate individual customers from institutional customers by employing a configuration that routes individual customers to the odd numbered servers and institutional customers to the even numbered servers. Finally, as we will see in future posts, this kind of configuration also lends itself to easier operational routines such as software upgrades and configuration changes.
