Twitter Updates

-->

Resiliency, Architecture and the Importance of Testing

Everyone in the IT business – particularly developers – are familiar with testing.  Testing is the mechanism by which organizations perform quality assurance.  The good news is that testing is so engrained in software development organizations that some level of testing is almost always performed.  The bad news is that software testing is really just one aspect of testing the entire solution; the things you’re not testing when you do software QA can just as easily sink the ship.  There are two other important aspects which are critical to testing to ensure a solution is resilient.

First, the infrastructure must be tested.  ”But,” you say, “I test my infrastructure in the process of testing the software!”  This may be true to varying degrees depending on the types of software testing that are performed.  However, many details of the infrastructure are difficult to test or unique to a particular environment.  Server configurations may match between your test environment and production, but firewall rules are most certainly different.  It’s very difficult to know that a test server is configured exactly the same as a production server – do you know with absolute certainty that your test server and your production server have exactly the same startup configuration?  Kernel tuning parameters?  Fiber channel storage configuration?  If you audit your environment, I promise the vast majority of organizations will find differences.

These factors may not seem that important, and in many “sunny day” scenarios, they’re probably not.  It’s when conditions inevitably vary from normal that these variations rear their ugly head.  It’s precisely these times when you don’t want to be left wondering why your application is suddenly failing, only to discover after hours of your sysadmin pulling her hair out that one production server’s NIC has a default gateway configured incorrectly.

Dealing with this situation requires a multi-prong approach.  First, periodic audits of configurable items on all servers needs to be standard operating procedure.  Second, new production environments need to be tested in the same way you would test in a performance testing environment.  Third, existing production environments undergoing change should have predefined methods for periodic verification.  For example, if a production environment has a change (e.g. new code, new server configuration, patches), there should be a way to “test” these changes on a small subset of all the production servers.  This requires planning in advance, which is why architecture and planning for resiliency is so important.  When two or more identical production environments exist (hopefully always!), take each one offline periodically and test them.

Similar to infrastructure, architectural items also need to be verified.  It would be unthinkable to not test functional requirements of your application, so why wouldn’t you also test architectural requirements?  In particular, architectural requirements that affect the quality of your application are absolutely critical.  For example, if you have redundancy, failover, or the ability for a component to run in a degraded mode built into the structure of your solution, they must be tested.  Similar to functional requirements, these architectural requirements need to have test plans and have traceability through design artifacts.

With so much focus on “functional” requirements, many organizations lose focus of the “non-functional” requirements.  Calling the latter non-functional does a great disservice to these important details; they’re really quality requirements.  The overall quality of the environment is a function of many inputs: software, infrastructure, architecture, and testing of all three.

What is resiliency?

One of the subjects that I deal with frequently is resiliency; specifically, the resiliency of technology solutions.  But what does it mean to be resilient?  Fundamentally, it means that a system or solution needs to be engineered with these goals in mind:

  1. The entire solution is designed to continue to function as normally as possible in the face of failure.
  2. When failures occur, they are invisible to the customer.
  3. If a failure must be visible to the customer, the solution provides the highest level of service possible (in other words, compartmentalize failures).
This sounds straight forward in theory but is rarely so in practice.  Why?  There are many contributing factors, and I’ll be dealing with these in detail in subsequent posts.  Some are obvious: resiliency adds cost, implementation costs must be balanced with business value and time-to-market pressures, and the fact that future failures are much more abstract the current business needs.  Despite these challenges, many organizations try to do the right thing by investing in the construction of resilient solutions that ultimately fail.  
 
These scenarios are the ones that are particularly frustrating, leaving very knowledgeable technologists wondering why such a robust system failed.  In such cases, the answers are usually much more subtle: complexity of systems lead to difficulty identifying failure modes, quantifying specific resiliency needs is rarely systematic, control plans are inadequate or absent leading to the development of new and unpredictable types of failures.  In all these cases, if you’ve ended up in such a scenario, it’s difficult or impossible to even quantify the operational, reputation and financial risks posed to the business – you just don’t know what you don’t know.
 
On this site, I’ll discuss these and other quandaries that threaten the stability of critical enterprise infrastructure.  Business no longer have the luxury of tolerating unreliable technology.  Five to ten years ago, the internet and related technologies were seen as new and unique – the virtual “wild west”.  Because these “enabling technologies” were viewed as somehow separate from the services and products that businesses provided, failure of the technology was not a direct reflection on the quality of the product or the capability of the provider.  Now, those enabling technologies have faded into the background – they are no longer new and exotic.  Customers expect mobile banking solutions on their cell phone to “just work”,  just as land line telephone customers expect dial-tone or homeowners expect power from their electrical outlets.  Failure of technology now equates to failure of the business.
 
Resiliency is the mechanism to ensure that our solutions meet these demands.  Resiliency may not be easy, but it is necessary.