|
|
In my first post on the promise of SOA, one of the constraining organizational factors that prevents full realization of the benefits of SOA investments I mentioned was:
Project management methodologies and software development processes that are based on waterfall approaches to building software and usually highly dependent on integrated testing.
Large enterprises usually have well-established project management processes, often built around a waterfall approach to the software development lifecycle. These processes serve a very valuable process in a non-SOA environment, as these organizations generally have complex, interdependent systems with a large amount of change occurring in parallel. Rigorous, formal waterfall development processes help mitigate the risk of change in this type of environment. Unfortunately, these processes also inhibit the flexibility SOA promises. Continue reading The Promise of SOA Continued: Project Management and SDLC Practices to Utilize SOA
The last few posts have focused on specific elements of infrastructure. This post will address two cross-cutting concerns that affect the resiliency of all infrastructure components: monitoring and the use of “islands of functionality” to minimize the effects of failures and speed troubleshooting.
Monitoring
Monitoring the health of every component at every layer of a solution is absolutely mandatory to create a resilient solution. If your solution is not instrumented so its health can be determined in near real-time or generate alerts automatically when its health changes, the solution is inherently not resilient – it’s that simple. As a result, defining the monitoring requirements of the solution is a critical part of solution design. In future posts I will discuss tools that help identify which aspects of a solution are important to monitor. From an infrastructure perspective, it’s important to know what monitoring capabilities exist and any gaps that need to be addressed before a solution is implemented. In particular, the following questions may help identify critical needs and gaps:
- What tools exist to monitor the health of the network and servers?
- Is it possible to tune thresholds in those tools so that they match dangerous thresholds in this solution?
- How can dangerous thresholds be identified so that they can be set correctly in production? (For example, in a test environment? Using production equipment before going live?)
- Will I be able to test my monitoring thresholds and alerting in a test environment?
- Do the monitoring tools generate alerts? If so, who receives them? How do the recipients of monitoring alerts know how to respond?
- What mechanisms exist to instrument applications for errors? (For example, file-based log monitors that can “watch” for exceptions or platform-specific monitoring frameworks like JMX.) Do the application developers know which conditions should trigger monitoring events? Is there a standard approach or specification for the developers to implement in their code when those conditions occur?
- Are component-specific tools available to monitor for conditions like long running transactions on a database or high queue depth in a messaging backbone?
Monitoring tools do not need to be expensive or elaborate. While many enterprise monitoring solutions are both expensive and elaborate, comprehensive monitoring can be accomplished without them. The health checks described in the workload management section are a good example; some simple scripting to invoke health checks and generate email alerts is inexpensive and ensures several failure modes are detected. Dashboards can be built using open source tools that quickly convey the health of the overall solution. Whatever technology is used, the goal is to have a single place to go to see the health of the solution. The view can be as simple as red/yellow/green indicators on a web page that shows every component’s health.
As we will see throughout future posts, improving the resiliency of existing solutions is dependent on having comprehensive data about health and where failures are occurring. It is extremely important to get monitoring right from the beginning when creating new systems.
Islands of Functionality
At various points in the last few posts I’ve discussed the value of arranging groups of functionality in ways that can be discretely controlled. The ability to segment hardware (and, by extension, the software that runs on it) provides flexibility to control how those components are used. The configuration of database listeners to segment users of the database was shown as a way to manage failures. Creating data center specific domain names to control session persistence provides better resiliency. These are all examples of a more general concept of “islands of functionality”.
An island of functionality is a small collection of components of a solution that can be managed as a unit. As an example, consider again the online insurance application deployed in the configuration shown in the figure below. Boxes labeled beginning with a W are web servers, an A are application servers, and DBL are database listeners. Note that two web servers are grouped together with two application servers. For any given web server there are only two possible application servers to which a request can be routed and for any given application server there are only two web servers from which a request could have originated. Two database listeners are configured on each database.
 Insurance application using islands of functionality
This configuration creates islands of functionality in a couple of ways. First, small groupings of web and application servers can be managed as a group. If A1 or A3 need to have maintenance performed on them, only two web servers are affected rather than four (the entire data center) or eight (both data centers, if we wanted to enable all web servers to send traffic to all application servers). Second, if a problem occurs on any server in this environment, the scope of the problem should be limited to one quarter of the environment. For example, if unusual errors are occurring on A3 but the root cause appears to be some other component, it’s very likely it could only be W1, W3, A1, or DBL1 causing the problem. Third, this configuration allows us to separate individual customers from institutional customers by employing a configuration that routes individual customers to the odd numbered servers and institutional customers to the even numbered servers. Finally, as we will see in future posts, this kind of configuration also lends itself to easier operational routines such as software upgrades and configuration changes.
Approaches to sizing and scaling server hardware vary from vendor to vendor and between distributed and mainframe technology, but some general principles apply. One area of frequent disagreement in some organizations is the decision between a few, large-frame distributed systems or many smaller (commodity or near-commodity) servers. Google has proven the commodity hardware approach in the search domain but this approach works well in many other commercial domains where tasks are easily parallelized and not computationally intensive. Many commercial applications meet this criteria: rich internet applications, multi-user client-server applications, middleware/integration solutions and some batch processing applications can often be implemented in this way. Designing applications to work on small, commodity hardware has several benefits.
Server Hardware Resiliency
As with facilities, the more units of hardware processing is spread across, the smaller the impact of any one of those units failing. While logical partitioning of large distributed systems enables simulating smaller systems, adding capacity to these systems can often be more intrusive than simply adding another commodity server to a rack. Another difference is that scaling large distributed systems is periodically costly when an additional frame is needed whereas the cost of scaling commodity hardware is linear.
Another benefit to having many smaller units of hardware is the ability to use small groupings of hardware (and the software that runs on them) to pilot new functionality or changes. Alternatively, hardware and software can be segmented to insulate different groups of customers from affecting each other. (This is particularly useful when a solution is providing a shared service to different internal business lines, external customers that dramatically different value to the business and are stratified by value, and so on.)
A Fortune 100 insurance company may have an online self-service application used by individual insurance customers with relatively small policies and institutional customers who pay very expensive premiums. The business may decide that individual customers are tolerant of the site being unavailable but institutional customers demand continuous availability. Having two completely isolated groupings of components that are used only by one type of customer provides much more flexibility. This idea of “islands of functionality” has other benefits – we’ll revisit the topic shortly.
Commodity hardware can also be easier to manage than large-frame distributed systems. There is some overhead associated with managing more devices, but automating common management tasks (such as software deployments, configuration changes, and server maintenance tasks) minimizes this overhead with the added benefit of minimizing the likelihood of server configurations drifting out of sync with each other. Managing commodity hardware is also an easier skill set to find than familiarity with “big-iron” distributed system.
When considering the size of hardware to use for a particular solution, it’s important to consider how the application will affect the physical deployment. A relatively stateless application may scale very well across many small servers by leveraging simple load balancing techniques whereas a complex, stateful application may benefit from larger servers. Physical deployment options may, in turn, affect how an application is designed. In the case of the complex, stateful application, the effect of the application’s requirements on the physical topology may reveal that it is better to maintain the state of customer’s session in a large distributed caching solution. Again, this highlights the benefits of a holistic approach to thinking about resiliency while designing the solution. The concept of leveraging a distributed cache will be addressed in a future post and may further influence decisions about server hardware.
Commodity hardware isn’t appropriate for every scenario, so identifying the ideal hardware needs to be an explicit consideration during the design process. There may be cases where commodity servers can be used for tasks that horizontally scale nearly linearly (like presentation, application and middleware servers) but larger hardware is needed for database servers or computationally intensive analytics that cannot be parallelized.
Workload Management
The design and implementation of workload management features greatly influences the resiliency of a solution. There are several mechanisms that can be used to manage workload: load balancers are one of the most common, but configuring application, database and messaging components to naturally distribute load is equally important. Under normal operating conditions, these mechanisms and techniques provide the solution with a way to manage the load be handled by a particular component, usually by attempting to distribute the load equally or to the component which can satisfy a request the fastest. In failure modes, however, workload management solutions are the first line of defense in minimizing the extent of the failure.
The combination of DNS load balancers (”global load balancers”) and local load balancers (that distribute load within a LAN environment) are fundamental components to the multi-facility implementations discussed in my previous post. The capabilities that many of these products provide to be able to detect failure are also critical to resiliency. For example, most load balancers have a variety of mechanisms (”health checks”) to determine the health of the resources they are load balancing across. These health checks vary in sophistication: at their most basic, a load balancer may ping a device while more advanced implementations offer sophisticated scripting capabilities that can be used to look for specific content in the device’s response to adjust the load-balancer’s behavior.
As a guideline, using basic ICMP or ping health checks are not sufficient. Many failure modes exist where a device’s TCP/IP stack is fully functional but the software that uses that stack is failing. As a result, every load balancer configuration should be tailored to obtain the most accurate information possible about the state of the components for which it is managing load. In many cases, this requires the creation of custom code within the application to provide the load balancer information about the application’s health. Alternatively, custom health check code in the application tier can be used to indirectly monitor the health of other components like the database, messaging server, cache, and so.
For example, consider “application A” that relies on the availability of two databases and a JMS connection to “application B” to operate correctly. The developer of application A creates a health check that performs a simple (very fast!) query against each of the databases and produce a test JMS message to application B. The developer of application B creates a similar health check capability that consumes test JMS message, checks its own internal resources, and responds. This health check is used by the load balancer to periodically confirm that application A is operating correctly. The way the health check is created and exposed depends on the system: a J2EE application server may create a simple JSP page, a .NET application may create a simple ASPX page, and a client-server application may create a script that could be invoked over a terminal connection.
The utility of this health check cannot be overstated. The load balancers that use these health checks are now far more informed about the internal health of the application than a simple ICMP check would provide. The health check can also provide status to monitoring tools and generate alerts to support teams when a failure occurs. Real-time management and reporting dashboards can be created to show the operating condition of the solution. The response time of the health check is often another indicator of health. Even if all of the checks in the example above are working, there is a significant difference between the health check responding in 0.1 seconds and 1.1 seconds. This type of response time trending can act as an early warning system to alert support staff to an impending failure or used to automatically route traffic away from a particular device.
These types of health checks are also easy to extend to provide additional operational controls to support teams. For example, it is trivial to add logic to the health check above to look at a file on the local file system of the application to determine if support teams may want to stop traffic from being routed to a server even though the application is healthy. This type of functionality can be used to route traffic away from servers for maintenance purposes. It could also be used to cause a global load balancer to stop distributing the IP address of a group of locally load balanced servers but permit customers who already have the address cached to complete their work. The flexibility provided by this feature is extremely valuable and should be part of every solution – there is virtually no reason not to do it.
As important as load balancers and health checks are to workload management, they are just one factor in the resiliency of an application. Understanding (and, when possible, influencing) the application’s requirements with respect to persisting connections is also very important. Consider a web application that uses manages session state only on local servers. Session state is not replicated across every application server running the application. Let us assume for the moment that the session management configuration is deliberate because of performance concerns or because the technology used doesn’t support replication and cannot be changed. How might this influence our workload management approach?
To solve the problem, we need an approach to keep a session in one data center from start to finish. Since the behavior exhibited by ISP proxies and DNS caching makes this difficult in our current configuration, we need to eliminate the possibility of a session jumping from one data center to the other. One possible solution is to create “data center specific” domain names. Let us assume that the customers come to the domain www.highlyresilientsite.com. The application could generate links from the landing page of that domain that direct the customer’s browser to www1.highlyresilientsite.com or www2.highlyresilientsite.com which correspond to data center A and B respectively. Even if the customer’s ISP suddenly load balances the customer to another proxy server, the session will remain in the same data center. This affects our DNS configuration but it also places a new requirement on the application: the ability to “know” which data center domain names are valid and should be served. (This will be a recurring theme: resiliency requires a holistic approach to design!)
Another aspect to workload management is the design of messaging. Messaging infrastructure generally needs to work in collaboration with the application workload management approach. For example, if the application tier of a solution is stateful and requires session persistence. If the application uses a messaging service (for example, IBM WebSphere MQ or Microsoft MSMQ) to communicate with another component that is also stateful, this implies a very different design that communication that is completely stateless. Part of the design process of a solution that involves messaging must be considering the nature of the messages being sent by the application and how the messaging service can be configured to provide resiliency without violating assumptions about the state of the component receiving the messages.
The configuration of databases also influences workload management. In some cases, managing workload on a database can be done by using replication or redundancy features to manage load and insulate components from affecting one another. A common approach to this is replicating data from a production database to a reporting database. Another pattern for managing load is create multiple database listeners used for different segments of customers or purposes. Similar to our example of the insurance company, using distinct groups of hardware for different types of customers, controlling access to the database through multiple listeners provides flexibility in how the database is used. For example, assume a database used to support the insurance application had three listeners: one for use by a segment of application servers servicing individual customers, one for use by a separate segment of application servers servicing institutional customers, and a third for use when real-time reporting is needed. If the database experienced performance problems, administrators could simply disable the listener for real-time reporting to attempt to relieve some load and maintain availability for customers.
While the effect the network has on solutions are usually understood in general terms, specific details are often unavailable. One reason is the difficulty of replicating production network conditions in test environments. Subtle changes in network performance that are difficult to detect can have significant effects on the performance and stability of systems. Likewise, seemingly insignificant changes in the operating environment of a solution (such as changes in customer behavior or scheduled operations like backups) can have drastic effects on the network. For the same reason a sub-optimal route affected the performance of the system shown in the example in part 1, a change in network performance that introduced the same degree of latency would have the same effect.
Understand the Effect of the Network on Solution Resiliency
Frequently problems arise in new solutions because production networks are inherently more complex that the network used for testing: more devices, more variability in usage and performance, and often more distributed. One of the most important – and difficult – characteristics to understand is the effect of latency on a solution. The easy solution is also the most expensive because it requires building out an exact replica of the production network. Since this is often prohibitively expensive, creative solutions may be used to simulate production.
The most difficult network configuration to simulate is geographical distribution. If a solution on the production network will be distributed between data centers in Colorado and Massachusetts but you have only one test environment in Kansas, your test results will not accurately reflect production. One way to approximate the effect of this distribution is to mirror the production VLAN configuration. Assuming you have an application server VLAN in Colorado and another in Massachusetts, one option is to create two VLANs in Kansas and divide application servers between them. To accurately simulate performance, measure the production latency during peak network utilization then simulate that latency between the two VLANs. (Open source tools exist to do this at the network interface level and network appliances and traffic shapers can perform the same function at the switch level.) Database connections – including those that support replication – are particularly sensitive to this kind of latency, so if only one component can be tested in this manner, the database is usually the best bet.
Another culprit of network resiliency failures is due to unexpected impact from firewalls. These effects usually fall into one of the following categories:
- Unexpected latency introduced by firewall interfaces (a variation of the theme above)
- Inconsistent rule configuration causing failures (usually closely correlated with poorly tested or managed infrastructure changes)
- Unusual interactions between the firewall, network and application
While the first two issues are relatively straight-forward, “unusual interactions” are (as you might imagine!) hard to predict. A real-world examples is in order to illustrate how these unusual interactions may manifest themselves. If an application makes a network connection that traverses a firewall but is not very “chatty”, the firewall may timeout the connection even though the application believes the connection is still alive. This may lead to intermittent problems that are difficult to diagnose: database connections that fail unexpectedly, network mounted file systems that suddenly become unavailable, and so on. Frequent activity on the system masks these types of failures, so in many cases this type of problem will crop up during low periods of utilization. It is difficult to predict exactly where and when this type of problem will occur, so a good practice to ask when working on the physical deployment of a solution is to analyze all of the network devices that critical connections traverse and ask what may lead to that connection be unexpectedly terminated. Ways to approach this problem will be discussed in a future blog post.
One way to prevent these problems from reaching production is to test with the same firewall configuration as production. Software-based firewalls are usually inexpensive and easy to simulate in test environments, but network appliance (dedicated hardware-based) firewalls can be expensive to purchase and maintain in a test environment. Have a test configuration that mirrors production is valuable because it also enabled infrastructure teams to test firewall rules before they are implemented in production. While substituting a network appliance firewall with a software firewall in a test environment is better than nothing at all, different firewall products behave differently and such a configuration doesn’t enable true testing of firewall rules.
Network Performance Is Not Static
Even the most comprehensive analysis of the effect of the network on a solution can be undone due to the dynamic nature of production networks. User volumes grow, backup schedules change, new devices are added to the network and suddenly a database transaction that took 100ms when the solution went live takes 300ms only a few months later. While 200ms may not sound like much, a three-fold increase in response time at the database layer can ripple through the system causing a thread in your application to spin in a wait state longer, thus causing resources on the server to be consumed for longer periods of time and cause resource starvation and timeouts that cause failures.
Early warning of changing network conditions is a necessity. As we will see in future posts, understanding changes in the environment is an element of operational resiliency. In the spirit of an integrated approach to resiliency, it is not enough to assume that a network operations team knows what a particular solution requires to be resilient. This is the benefit of a holistic approach to resiliency: through the analysis and testing described above and later in this book, the resiliency practitioner can determine what parameters influence the resiliency of a solution and what ranges of performance are acceptable. This information will enable operations teams and monitoring applications to know when (and, in many cases, before) a system is approaching a dangerous threshold.
Service oriented architecture (SOA) is going to save us all. We all know the drill: Faster time-to-market. Lower total cost of ownership. Loose coupling makes integrating or re-wiring existing capabilities faster and adding new features easier.
Except that in most large companies, just doing SOA doesn’t do any of these things. In reality, sub-optimal time-to-market and high operational costs are caused by many factors; older approaches to system integration is just one them. While an SOA approach certainly helps, it isn’t a silver bullet. It’s easy for technologists to get caught up in the promise of SOA as a solution to common IT challenges. Even active SOA practitioners in growing companies believe – with good reason – that their existing SOA approach will scale with growth. Most large companies have one or more of the following constraining factors that will limit the success of narrowly focused SOA initiatives:
- A lack of maturity with governing shared technology services.
- Project management methodologies and software development processes that are based on waterfall approaches to building software and usually highly dependent on integrated testing.
- Legacy systems that are difficult to service enable simultaneously.
- Infrastructure delivery constraints that make adding or changing hardware time-consuming.
- Testing methodologies that do not take advantage of a highly service-oriented environment.
- Technology solutions that require a mixture of technology platforms with varying capabilities for development methodologies (e.g. waterfall vs. agile) supporting many different projects within the organization.
It is very difficult for SOA to deliver on its promises when one or more of these conditions exist with an organization. Furthermore, changing these conditions while also trying to become more service oriented can add time and cost to the effort. If the stakeholders and sponsors of an SOA initiative have unrealistic expectations about the process, these added wrinkles can result in an otherwise beneficial project being canceled.
While most companies have already jumped on some form of the SOA bandwagon, many are not realizing the benefits because these fundamental challenges have not been addressed or even considered. What most companies really want from SOA is a flexible IT environment. To achieve this flexibility, more than SOA is needed.
I’ll explore each of these in more detail over a series of blog posts, beginning today with governance.
Lack of Maturity in Governing Shared Technology Services
Consider the problem that SOA intends to solve: proliferation of varying technologies implementing different standards that make integrating and changing these “solutions” slow and difficult. The intent of SOA is to decouple service consumers from service providers, abstract knowledge of the implementation of the service from the service’s consumers, and provide common methods for interacting between consumers and providers. In doing so, SOA must be (or become) a shared service to the entire organization, not something individual SOA practitioners create in silos.
Unfortunately, implementing SOA without an SOA governance program just recreates the original problem. A practical, pragmatic approach to governance solves problems like:
- Setting a direction for how services are implemented. Should services be built using SOAP or RESTful implementations? If SOAP is used, what WS-* standards are all service providers and consumers expected to support? Can service providers assume all consumers will implement WS-Security? For either SOAP or RESTful implementations, even basic questions like the use of HTTP vs. HTTPS can create road-blocks to adoption.
- Standardizing schemas, data and message formats. The whole point of SOA is to enable a variety of service providers to expose their operations in a common way. While SOAP/XML and REST provide a mechanism to do this, message format differences can be a significant barrier to reuse. Consider the following:
- Service Provider A implements a complex type called “Address” that has one address line, a city, state and five-digit zipcode
- Service Provider B implements the same complex type using two address lines, a city, state/province, ten-byte alphanumeric postal code, and country code
- Service Provider C provides a service that relies on both service providers A and B to operate on data including addresses
- Integrating these “service oriented” applications is no easier than if we were trying to do an EDI integration or a CORBA integration or a COM+ integration or a COBOL integration; in fact, it may even be worse because the business partners may believe that SOA should’ve fixed this problem.
- Ensuring consistent service/operation packaging and granularity. Consider a an example similar to the one above, except now service providers vary widely in the granularity of the services or operations they provide. One provider always operates on a “customer” entity (encompassing all attributes of the customer) while another always operates on individual customer entity attributes. Again, the reuse of SOA services is undone because the services are incompatible in implementation.
- Preventing duplicate or overlapping services from being created. Yet another variation on the above theme: two service providers create very similar services. The differences between the two may be something as simple as performing an update to some customer data, but the service providers have one or two fields that differ between the interfaces. Why maintain two heavily overlapping service? In most cases, one service should exist and simply be enhanced to add the missing fields.
SOA governance can not and should not be draconian. The objective of governance is not to implement rules from an ivory tower or deliberate on academic issues of building systems using SOA techniques. Good SOA governance can be measure by how well it promotes service reuse, how much (or little) service duplication exists, and how time-to-market for projects using shared services trends over time (SOA is a long-term investment). Practical SOA governance comes in the form of:
- Reasonable, attainable standards with respect to how services are implemented:
- SOAP vs. REST
- Practical use of WS-* standards where they add value
- Common and consistent schemas
- A system of incentives for compliance with governance policies
- A system of record (like a UDDI registry/repository) for recording which services exist and who consumes them
- Agreed upon metrics and goals that quantify the quality of the governance system and the compliance of services
- Sponsorship from technology and business leadership that governance is important
I’ll continue to explore the other constraining factors in future blog posts and how to address them. While the challenges are significant, successfully meeting those challenges enables organization to realize the full potential of SOA.
The network is a critical resource to nearly every enterprise IT solution. While many network resiliency considerations are inherently part of modern networking, some of these features actually undermine resiliency. For example:
- the availability of multiple routes can sometimes introduce unpredictable behavior in applications if a preferred route is unavailable and alternate routes have higher latency;
- switch ports and server network interfaces that have worked correctly in “auto-negotiate” mode suddenly stop working;
- unintentional asymmetric routing result in packets trying to return to the customer through a firewall that never saw the incoming request causing the packet to be dropped
In this section, we’ll examine how to ensure that the network layer of new solutions is built for resiliency and avoid these types of problems.
Consider the Resiliency and Effects of Failure of “Core” Network Services
Some network services fail so infrequently (and are so catastrophic when they do fail) that they are rarely considered explicitly as a failure mode. DNS is one such service: most architects, software developers and server administrators are so conditioned to DNS working that they never consider what happens when it fails. In many cases, DNS is such a foundational service that a failure of the service is fatal to the solution. Insulating against DNS failures requires a two pronged approach.
First, it’s important to understand the resiliency of the DNS service. How much redundancy exists in the DNS infrastructure? How frequently has it failed in the past? Is the scope of a DNS failure limited in some way such as location on the network or zone? In a future post, I’ll discuss some tools that can be used to help assess the resiliency needs of a solution and assess failure modes. If the network team supporting DNS cannot commit to the level of availability required for the solution, or if previous observed failure rates exceed the tolerance of the solution, the DNS service needs to be improved.
Second, the solution must be analyzed to determine which services, components and transactions will fail if DNS is not available. This can be a difficult task because DNS is generally assumed to “just work”, so identifying every place that relies on DNS can be time consuming. One approach to quickly identify DNS dependencies is to intentionally misconfigure the DNS settings on a machine-by-machine or component-by-component basis in a test environment. Since DNS failures usually cause such widespread failures, performing this kind of ”negative test” in a very controlled fashion on small pieces of the solution at a time is usually much more manageable. It’s also important to remember that DNS dependencies are not just introduced in application code; often times web, application and database server configurations rely on DNS for their internal operation.
Ultimately, it may be infeasible or impractical to eliminate dependencies on DNS. However, understanding how DNS-related failure modes manifest themselves is useful for quickly identifying a DNS problem when it occurs. There is a saying in the medical field that is apropos: “When you hear hoofbeats, think horses, not zebras.” This is how most technicians successfully troubleshoot failures. A DNS failure is a zebra, and on the rare instances DNS failures occur, it can take hours to identify the root cause of the problem if operations teams aren’t familiar with the symptoms.
Ironically, the resiliency provided by DNS can present problems in a way unrelated to the availability of DNS itself. Many applications will cache DNS responses or ignore DNS time-to-live (TTL) settings. Cached DNS responses can cause significant problems when DNS load-balancing solutions are used to respond to failures in the environment because it sometimes requires restarting of a process to clear the cache, often resulting in even more failures. Applications that fail to honor DNS TTL experience a similar failure and sometimes are outside of a technologist’s control. One example of this is ISP’s who have proxy servers configured to override DNS TTL. Assume that a global DNS load balancer is load-balancing mydomain.com between two IP addresses with a TTL of zero. It is tempting to assume that removing one IP address from being returned will immediately stop new requests from hitting that address, but requests may continue for quite some time.
Routing protocols present another unique challenge to resiliency. In most cases, routing protocols ensure that a route is available between the source and destination of a connection. This does not guarantee that the available route should actually be used, however. For example, consider the scenario shown below, a simplified web and application server configuration hosted in two data centers. For the sake of clarity, global and local load balancers have been omitted from the diagram, as have redundant web and application servers in each data center.
 Network routing using preferred (solid lines) and sub-optimal (dashed lines) alternate route
Assume that customer requests are routed to the web server in the internet facing DMZ and that the web server acts as a proxy for the application servers on the internal network. If the customer’s request enters data center A, the preferred route from web server 1 to application server 1 is shown in solid black lines. Let us also assume that the alternate route shown in the dashed line is available, but is suboptimal – it involves several more network hops and additional latency.
As long as the preferred route is available this network topology works as expected. Alternatively, consider a failure mode where the connection between the internet DMZ and internal network in data center A fails (for example, an incorrect firewall rule is created). The alternate route from web server 1 and application server 1 becomes active, routing the request through data center 2 adding significant latency to the request. Because requests are taking longer, active (open) connections to the web server accumulate which causes performance problems and possibly some connections to be denied as the web server runs out of resources. This failure mode results in many customers having a degraded experience due to the latency between the web and application server and eventually will cause failures.
If the suboptimal alternate route had not been available, the connection between the web and application server would have been broken immediately. As we will see in a future post, workload management tools like load balancers combined with intelligent “health checks” in the application can quickly detect this condition and stop traffic from being routed to the failing components. In the scenario where web server 1 cannot connect to application server 1, “fast failure” is preferred to the prolonged degraded connectivity condition that exists when traffic is routed through data center 2.
This is another example of simple being better. Intelligent routing adds complexity without reducing the risk of failure and probably increases the amount of time needed to troubleshoot the problem. The degraded connectivity condition seen above is an example of a problem we will see frequently that I call “sick but not dead”. This class of failure is always problematic – it can trick monitoring that should detect failures into believing the system is healthy and it makes identifying the root cause of a failure much more difficult.
Network redundancy solutions can also cause problems at the server and network interface. Different operating system vendors use different terms, but most operating systems allow for the pairing of network interfaces for redundancy. (In Solaris this is called IPMP; in Linux, “teaming”; on AIX, “Etherchannel” or “Network Interface Backup”.) All of these solutions are valuable to improve resiliency as long as care is taken to ensure that the paired network interfaces are cabled to physically distinct switches. All too often, redundant, paired network interfaces are cabled to the same switch even though redundant switches are available with a shared VLAN. Similar to the testing of redundant PDUs by unplugging server power supplies, redundant network interfaces should be tested by unplugging network cables. Again, organizing such a test is difficult after a server is in production, so always plan for failure testing before a solution goes live.
DNS, routing protocols and redundant network interfaces are instructive examples in why it is important to consider core network services that may otherwise go unnoticed. When building a new solution, it is very important to examine the configuration of the network itself and question whether network resiliency features will improve the solution. Wherever possible, simplify the network configuration and ensure that sick but not dead scenarios are avoided in favor of fast failure.
Joel Spolsky recently wrote an article about “The Duct Tape Programmer” in which he espouses the benefits of a pragmatic approach to creating (and thus shipping) software:
Duct tape programmers are pragmatic. Zawinski popularized Richard Gabriel’s precept of Worse is Better. A 50%-good solution that people actually have solves more problems and survives longer than a 99% solution that nobody has because it’s in your lab where you’re endlessly polishing the damn thing. Shipping is a feature. A really important feature. Your product must have it.
I think this approach makes sense to a degree, though there are certainly some good counter-arguments that have been made. What interested me about his post, though, was how approaches to “shipping software” can sometimes differ in large enterprises when compared to shipping commercial software to end-users or to producing applications in small- or medium-sized business.
This isn’t to say that creating a good product – for any audience, in any setting – is easy. It’s not. The types of challenges, though, are different. This is why Joel’s comments help highlight the differences between developing “enterprise applications” – by which I mean applications, increasingly in the form of rich internet applications, that are consumed by millions of customers – and more traditional COTS applications or smaller internet-based offerings. Ability to manage scale is a feature just like shipping software.
The reason for the difference is that the scale (in terms of users) amplifies all aspects of an application: obviously its ability to handle increasing volume, but also bad (and good) design decisions, the ability to react to new requirements/features, and the quality of all those “pragmatic” decisions. The latitude to make misjudgments when trying to being pragmatic with a web-based application supporting a million concurrent users is much more constrained than one that supports three hundred concurrent users.
The challenge for the managers, developers and architects of enterprise-scale applications is how to avoid having to make these kinds of decisions in the first place. This is one area where Joel absolutely nails it: simplicity is key. The number one way to avoid having to make difficult, pragmatic trade-offs is to keep the solution simple. In a series of upcoming posts on resiliency, I’ll explore this theme in the context of infrastructure, software development and integration services.
But what if you already have a complex solution? What if an aspect of your solution is complex by necessity? How do you know which trade-offs are “safe” and which will cause failures, customer frustration, or slow time-to-market for future features? Experience counts for a lot, but is it good business to make these decisions on instinct? No – but lots of companies do it.
While there are very experienced, talented folks who can make these decisions by gut feel, they’re few and far between. It’s not a repeatable process. It can’t be explained to shareholders. It can’t be quantified. Analysis and data are required to make informed trade-offs rather than instinctual gambles on what will work. This is why an integrated approach to solution architecture, software design and development, infrastructure support, people management and process control is required. These decisions get made based on data from detailed failure-mode analysis. They’re supported by data collected from the operating environment about user behavior. They’re mitigated through tightly controlled processes and a quality of communication that is difficult to achieve in your typical Fortune 100 company.
The most ideal condition is that the people making trade-offs about features, functionality, and complexity know exactly how the value of each and every transaction or feature used by a user.
- The business “value” (revenue generated, costs avoided, etc.) of each transaction/feature
- The frequency of use of each transaction/feature
- The likelihood of a particular failure mode occurring, which transactions/features are affected, how it will be detected, and how long it takes to fix
While this pinnacle of knowledge cannot always be achieved, it can be approximated more easily than many people believe. It requires changing how design is approached, buy-in from business partners, and the ability to spend time during the design process to perform the necessary analysis. It’s not easy, but neither is competing at enterprise-scale.
This is why simplicity is important: the less complex a solution is, the easier it is to gain this insight. Spending time on simplicity pays off.
In this series of posts I’ll discuss the foundations of resilient infrastructure, including facilities, network, server hardware, workload management, monitoring, and presentation, application, messaging and database servers as well as why seemingly resilient implementations of these components can fail to provide resilient solutions. Our focus will be on the patterns used when implementing these components rather than vendor-specific details. Each vendor implementation may be a little different, but the methods for creating resiliency when designing and configuring a piece of infrastructure will be very similar. The focus will be on which patterns are valuable and how those patterns relate to the other domains in the book to provide resilient solutions.
One recurring theme throughout is the notion that simple is usually better. A challenge when building and maintaining complex systems is striking the optimal balance between too much complexity (which makes failure modes harder to predict and troubleshoot) and too much simplicity (such that there is very little flexibility to deal with failures or maintain the system). Complexity can creep into our solution in insidious ways: a “high availability” clustering solution that makes troubleshooting intermittent failures difficult; dynamic routing in a network to provide redundancy that introduces variability that causes sporadic timeouts; and database failover approaches that make returning to the pre-failure configuration difficult. These approaches are not always bad, but the complexity introduced by these design decisions must be weighed against the value they provide and alternatives that are less sophisticated but easier to manage.
Facilities
The number of facilities (data centers or hosting locations) that a solution is hosted in is an obvious factor in the resiliency of an application. If a solution is only hosted in a single data center and the entire data center experiences a failure or disaster, the solution clearly fails. Assuming that a solution can horizontally scale across a small number (two to four) facilities, the more relevant question from a resiliency perspective is how the application should be sized to maximize resiliency and minimize waste.
Questions to Consider When Determining Facility Needs for Resiliency
The number of facilities that are used to host a solution affects how systems and components will scale across sites, the amount of infrastructure needed for redun- dancy and, as we will see in later chapters, the application design. Some critical questions to ask when considering the number of facilities:
- Is there a range of the number of facilities in which this solution can be hosted? For example, is it possible to deploy the solution in two, three or four data cen- ters? Or are two data centers the most that are available, thus constraining your choices?
- If there is a range of the number of facilities that can be used, how many simultaneous facility failures should the solution be able to tolerate? For example, if the solution will be hosted in three data centers, the solution should probably be able to tolerate the loss of at least one data center. Should it be also able to tolerate the loss of two?
- How does the data center topology affect other aspects of the solution? For example, presentation and application servers are typically much easier to scale horizontally than database servers.
- Will all the components scale equally well across data centers, or will some components, like a database server, scale only to two locations?
- How will load-balancing affect the number of facilities? Does load-balancing between facilities exist only at the “front door” (so once traffic comes to a particular data center it will never leave) or could traffic between system components be load-balanced to different data centers? How does this affect the application design?
Having control over the number of facilities allows for much greater control over the total number of components needed to obtain the same level of resiliency. Let’s consider a simplified solution that we want to deploy in multiple data centers. We know that we require four servers to be available to handle peak load on the system and we want to be able to tolerate the failure of one data center without impacting our customers.
-
-
Solution requiring four servers to handle peak traffic in a three data center configuration
-
-
Solution requiring four servers to handle peak traffic in a two data center configuration
By increasing the number of data centers we’ve reduced the total number of servers needed to provide the same level of resiliency – the ability to tolerate the failure of one data center. This is one example of the need to find the optimal balance for simplicity: two data centers with four servers each results in some server waste, but deploying the same solution in nine data centers with one server each probably intro- duces a lot of unnecessary overhead and is highly impractical.
The three data center choice offers a good balance in our example scenario. We gain 25% efficiency in our server cost while only marginally increasing the complexity from a data center deployment view. Unfortunately, you may not always have the luxury of choosing how many facil- ities are available to deploy a solution. Perhaps your company or client only has two data centers, or constraints within the components you’re deploying limit your choices. Whatever the reason, it is still important to consider the questions above.
If there are multiple types of components being deployed (e.g. web servers, applica- tion servers, middleware servers), how does the number of data centers affect their configurations? Load balancing? Database replication? Messaging services? How will maintenance (configuration or code changes, software upgrades, etc.) to the solution be handled in the given topology? How does the number of data centers affect monitoring of the solution? I explore these questions in more detail in a later posting.
It’s also important to consider these questions in the context of third-party ven- dors on which your solution relies. While vendor contracts often specify penalties for failing to meet their service level agreement, these penalties rarely make up for the true cost of failure. Moreover, a vendor’s configuration may influence your design. If the solution you’re building will be deployed in three data centers at your company but needs to connect to two data centers at an external vendor’s site, how does this affect the connectivity between the sites? Many vendors may be reluctant to share implementation details of their infrastructure after contracts are signed, so it is critically important to negotiate for these details as part of your supplier management process.
It’s also important to consider how the environment within a facility may improve or degrade the resiliency of a solution. It is common for data centers to have multiple power distribution units (PDU), receive power from multiple power grids, have redundant ISP connections, and so on. However, failures still can and do occur, often because the equipment in the facility is not implemented to take advantage of these redundancy in the facility. For example, a rack full of servers with dual power supplies may have each power supply receiving power from the same PDU despite multiple PDUs being available. Likewise, circuits from multiple ISPs may be provisioned into the data center, but all of the servers in a solution that require ISP connectivity may only have access to one of those circuits.
In the next post, I’ll explore how the network may affect resiliency in unexpected ways.
As technology solutions continue to increase in complexity, organizations often respond by creating teams with deep technical expertise to design, build and maintain their technology assets. One side effect of deep technical expertise is narrowing breadth of knowledge. While most IT professionals start their careers with broad technical knowledge (though perhaps not experience), as one’s experience and interest deepens in one particular domain, the breadth of knowledge – by necessity – shrinks. This side effect is rarely seen as negative; in fact, deep technical expertise is often – and rightly – held in high regard.
Unfortunately, lack of breadth presents serious risks to the quality of our solutions. The complexity within technology domains (driving us to create deep technical expertise) also generates complexity in the interfaces between these domains. For the purposes of this discussion, I’m referring to domains in a high level, coarse-grained sense like application development, application servers, network infrastructure, supporting application components (databases, middleware, and the like), storage solutions, and so on. These domains are individually complex, often to the point where there are sub-specialties within them. (Point in case: most large organizations have some network engineers who specialize in load balancing while others are experts in network design/engineering.)
Some will argue that while these individual domains are complex, the internal complexity is abstracted from the interfaces to other components thus hiding the “inner workings”. While this is often a design goal, it is rarely fully realized. It’s easy to believe this well-meaning but dangerous fallacy, especially since so many IT professionals are “classically educated” as software developers. Any college educated CompSci or CIS professional learned the importance of object-orientation, interfaces, abstraction, and so on. Our trust in abstraction is so conditioned that it just feels like it should work in other domains. It seems logical, but it just doesn’t scale to the breadth of systems and degree of complexity outside of a pure software engineering paradigm.
An exploration of the reasons why abstraction doesn’t scale could be an entire series of articles in its own right, but a cursory treatment may help convince skeptics. First, abstraction in an object-oriented software engineering context is completely implemented within a single system (like a programming language). The layer of abstraction and the complexity on either side of that abstraction share common tools, semantics and structure. This commonality reduces the overall complexity of the solution and the difficulty associated with making the abstraction work. Purists will argue that commonality doesn’t matter that much. Evidence that this is not true can be found by comparing the difficulty in integrating two software components both written in Java with reasonable software standards with the difficulty in integrating a software component written in Java with another component written in .NET using web services. The myriad of “standards” for web services highlight the difference between these scenarios.
Second, abstraction within software engineering is rooted in programming languages that exhibit a high degree of precision with respect to their semantics and syntax and, in comparison to broad IT “solutions” are relatively simple. Java, for example, has only 50 keywords and a handful of syntax rules that can be used to implement abstraction. Compare this to the average load-balancing solution, network switching infrastructure, or application server configuration, all of which can be configured in highly variable (and novel) ways. This difference in complexity makes abstraction a much more difficult task. In a software engineering world, it’s the difference between abstraction for a simple framework (something like implementing MVC) and abstraction for an operating system’s threading and memory management libraries.
Third, standards and patterns for abstraction in software engineering are well established and commonly understood. Creating standards and identifying patterns is easier in software because of the previous two points. Standards and patterns for integrating components from different domains (e.g. making load-balancers, web servers, application servers, and database servers work together) do exist and may be commonly understood, but are not so detailed or so precise that they reduce complexity or completely hide the inner workings of each individual component.
If we can agree that abstraction doesn’t really work between domains and that individual domains are so complex that they require deep technical expertise, we must then acknowledge that the integration of these components is a significant concern in its own right. This is what architecture – especially solution architecture – is really about. So called “PowerPoint architecture” or domain specific architecture (provided by software architects, storage solution architects, etc.) is not a substitute for holistic solution architecture that defines how disparate components from different domains will interact. Make no mistake: PowerPoint architecture and domain-specific architecture have their place. Domain-specific architecture must be part of the solution delivery process. Unfortunately, it is too often the focus, usually at the expense of good solution architecture.
Solution architects need to balance breadth and depth of technical knowledge to be effective. This means that not every solution architect need come from a heavy software architecture or software engineering background. Instead, a good solution architect understands a wide-range of technology domains and experience in putting them together in a variety of settings.
A classic example of a problem where this kind of solution architecture really makes a difference is in geographically distributed web applications that are transactional and stateful in nature. The developer or software architect will rely on the application server’s services for managing session state and persistence. The infrastructure/hosting teams will rely on load-balancing solutions for “session stickiness” to keep a customer “stuck” to a particular web and application server for the duration of the session. Easy enough, except as soon as you launch the application in production, you start getting reports of customers complaining that they’re losing their sessions, having to “start over” in multi-step transaction flows, or other intermittent, unpredictable behavior. What happened?
Large ISPs like Comcast or AOL have multiple proxy servers from which a customer’s HTTP session may originate. During the customer’s session, the ISP may internally load-balance the customer to a different proxy server, causing the source IP address to change. Your load-balancer session stickiness didn’t account for this, the user got load-balanced to a different web or application server, and the session state couldn’t be rebuilt.
There are many variations on this theme… Maybe the load-balancer uses SSL ID, but the ISP’s proxy had a different A record cached for your site, so the user ended up in another data center. Perhaps your web servers can’t route traffic to the application server that “knows” the customer’s state. Or you’ve really done your homework and built a global cache to manage state, but the cache didn’t replicate fast enough. The bottom line is that there are a variety of scenarios in which the interaction between the load-balancing infrastructure, application server configuration, and application code determine the actual customer experience.
This is where a good solution architect will save the day. Your deep technical SMEs are still invaluable, but detecting the possibility of scenarios like the one above requires breadth of knowledge and thorough understanding of the characteristics the solution must have rather than any individual component. The need for solution architecture is very real. Doing it well has a tangible effect on the quality of technology solutions and mitigates the risks from deep technical expertise creating silos of domains. We cannot get away from the need for our really sharp SMEs, nor should we want to. However, we must acknowledge that our solutions demand attention to integrating disparate components in increasingly complex ways.
Everyone in the IT business – particularly developers – are familiar with testing. Testing is the mechanism by which organizations perform quality assurance. The good news is that testing is so engrained in software development organizations that some level of testing is almost always performed. The bad news is that software testing is really just one aspect of testing the entire solution; the things you’re not testing when you do software QA can just as easily sink the ship. There are two other important aspects which are critical to testing to ensure a solution is resilient.
First, the infrastructure must be tested. ”But,” you say, “I test my infrastructure in the process of testing the software!” This may be true to varying degrees depending on the types of software testing that are performed. However, many details of the infrastructure are difficult to test or unique to a particular environment. Server configurations may match between your test environment and production, but firewall rules are most certainly different. It’s very difficult to know that a test server is configured exactly the same as a production server – do you know with absolute certainty that your test server and your production server have exactly the same startup configuration? Kernel tuning parameters? Fiber channel storage configuration? If you audit your environment, I promise the vast majority of organizations will find differences.
These factors may not seem that important, and in many “sunny day” scenarios, they’re probably not. It’s when conditions inevitably vary from normal that these variations rear their ugly head. It’s precisely these times when you don’t want to be left wondering why your application is suddenly failing, only to discover after hours of your sysadmin pulling her hair out that one production server’s NIC has a default gateway configured incorrectly.
Dealing with this situation requires a multi-prong approach. First, periodic audits of configurable items on all servers needs to be standard operating procedure. Second, new production environments need to be tested in the same way you would test in a performance testing environment. Third, existing production environments undergoing change should have predefined methods for periodic verification. For example, if a production environment has a change (e.g. new code, new server configuration, patches), there should be a way to “test” these changes on a small subset of all the production servers. This requires planning in advance, which is why architecture and planning for resiliency is so important. When two or more identical production environments exist (hopefully always!), take each one offline periodically and test them.
Similar to infrastructure, architectural items also need to be verified. It would be unthinkable to not test functional requirements of your application, so why wouldn’t you also test architectural requirements? In particular, architectural requirements that affect the quality of your application are absolutely critical. For example, if you have redundancy, failover, or the ability for a component to run in a degraded mode built into the structure of your solution, they must be tested. Similar to functional requirements, these architectural requirements need to have test plans and have traceability through design artifacts.
With so much focus on “functional” requirements, many organizations lose focus of the “non-functional” requirements. Calling the latter non-functional does a great disservice to these important details; they’re really quality requirements. The overall quality of the environment is a function of many inputs: software, infrastructure, architecture, and testing of all three.
|
|