Twitter Updates

-->

Resiliency, Architecture and the Importance of Testing

Everyone in the IT business – particularly developers – are familiar with testing.  Testing is the mechanism by which organizations perform quality assurance.  The good news is that testing is so engrained in software development organizations that some level of testing is almost always performed.  The bad news is that software testing is really just one aspect of testing the entire solution; the things you’re not testing when you do software QA can just as easily sink the ship.  There are two other important aspects which are critical to testing to ensure a solution is resilient.

First, the infrastructure must be tested.  ”But,” you say, “I test my infrastructure in the process of testing the software!”  This may be true to varying degrees depending on the types of software testing that are performed.  However, many details of the infrastructure are difficult to test or unique to a particular environment.  Server configurations may match between your test environment and production, but firewall rules are most certainly different.  It’s very difficult to know that a test server is configured exactly the same as a production server – do you know with absolute certainty that your test server and your production server have exactly the same startup configuration?  Kernel tuning parameters?  Fiber channel storage configuration?  If you audit your environment, I promise the vast majority of organizations will find differences.

These factors may not seem that important, and in many “sunny day” scenarios, they’re probably not.  It’s when conditions inevitably vary from normal that these variations rear their ugly head.  It’s precisely these times when you don’t want to be left wondering why your application is suddenly failing, only to discover after hours of your sysadmin pulling her hair out that one production server’s NIC has a default gateway configured incorrectly.

Dealing with this situation requires a multi-prong approach.  First, periodic audits of configurable items on all servers needs to be standard operating procedure.  Second, new production environments need to be tested in the same way you would test in a performance testing environment.  Third, existing production environments undergoing change should have predefined methods for periodic verification.  For example, if a production environment has a change (e.g. new code, new server configuration, patches), there should be a way to “test” these changes on a small subset of all the production servers.  This requires planning in advance, which is why architecture and planning for resiliency is so important.  When two or more identical production environments exist (hopefully always!), take each one offline periodically and test them.

Similar to infrastructure, architectural items also need to be verified.  It would be unthinkable to not test functional requirements of your application, so why wouldn’t you also test architectural requirements?  In particular, architectural requirements that affect the quality of your application are absolutely critical.  For example, if you have redundancy, failover, or the ability for a component to run in a degraded mode built into the structure of your solution, they must be tested.  Similar to functional requirements, these architectural requirements need to have test plans and have traceability through design artifacts.

With so much focus on “functional” requirements, many organizations lose focus of the “non-functional” requirements.  Calling the latter non-functional does a great disservice to these important details; they’re really quality requirements.  The overall quality of the environment is a function of many inputs: software, infrastructure, architecture, and testing of all three.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • LinkedIn
  • Slashdot
  • Twitter
  • Reddit