Twitter Updates

Enterprise Application Design vs. Shipping Consumer Software: The Case for Simplicity

Joel Spolsky recently wrote an article about “The Duct Tape Programmer” in which he espouses the benefits of a pragmatic approach to creating (and thus shipping) software:

Duct tape programmers are pragmatic. Zawinski popularized Richard Gabriel’s precept of Worse is Better. A 50%-good solution that people actually have solves more problems and survives longer than a 99% solution that nobody has because it’s in your lab where you’re endlessly polishing the damn thing. Shipping is a feature. A really important feature. Your product must have it.

I think this approach makes sense to a degree, though there are certainly some good counter-arguments that have been made.  What interested me about his post, though, was how approaches to “shipping software” can sometimes differ in large enterprises when compared to shipping commercial software to end-users or to producing applications in small- or medium-sized business.

This isn’t to say that creating a good product – for any audience, in any setting – is easy.  It’s not.  The types of challenges, though, are different.  This is why Joel’s comments help highlight the differences between developing “enterprise applications” – by which I mean applications, increasingly in the form of rich internet applications, that are consumed by millions of customers – and more traditional COTS applications or smaller internet-based offerings.  Ability to manage scale is a feature just like shipping software.

The reason for the difference is that the scale (in terms of users) amplifies all aspects of an application: obviously its ability to handle increasing volume, but also bad (and good) design decisions, the ability to react to new requirements/features, and the quality of all those “pragmatic” decisions.  The latitude to make misjudgments when trying to being pragmatic with a web-based application supporting a million concurrent users is much more constrained than one that supports three hundred concurrent users.

The challenge for the managers, developers and architects of enterprise-scale applications is how to avoid having to make these kinds of decisions in the first place.  This is one area where Joel absolutely nails it: simplicity is key.  The number one way to avoid having to make difficult, pragmatic trade-offs is to keep the solution simple.  In a series of upcoming posts on resiliency, I’ll explore this theme in the context of infrastructure, software development and integration services.

But what if you already have a complex solution?  What if an aspect of your solution is complex by necessity?  How do you know which trade-offs are “safe” and which will cause failures, customer frustration, or slow time-to-market for future features?  Experience counts for a lot, but is it good business to make these decisions on instinct?  No – but lots of companies do it.

While there are very experienced, talented folks who can make these decisions by gut feel, they’re few and far between.  It’s not a repeatable process.  It can’t be explained to shareholders.  It can’t be quantified.  Analysis and data are required to make informed trade-offs rather than instinctual gambles on what will work.  This is why an integrated approach to solution architecture, software design and development, infrastructure support, people management and process control is required.  These decisions get made based on data from detailed failure-mode analysis.  They’re supported by data collected from the operating environment about user behavior.  They’re mitigated through tightly controlled processes and a quality of communication that is difficult to achieve in your typical Fortune 100 company.

The most ideal condition is that the people making trade-offs about features, functionality, and complexity know exactly how the value of each and every transaction or feature used by a user.

  • The business “value” (revenue generated, costs avoided, etc.) of each transaction/feature
  • The frequency of use of each transaction/feature
  • The likelihood of a particular failure mode occurring, which transactions/features are affected, how it will be detected, and how long it takes to fix

While this pinnacle of knowledge cannot always be achieved, it can be approximated more easily than many people believe.  It requires changing how design is approached, buy-in from business partners, and the ability to spend time during the design process to perform the necessary analysis.  It’s not easy, but neither is competing at enterprise-scale.

This is why simplicity is important: the less complex a solution is, the easier it is to gain this insight.  Spending time on simplicity pays off.

Elements of Resilient Infrastructure

In this series of posts I’ll discuss the foundations of resilient infrastructure, including facilities, network, server hardware, workload management, monitoring, and presentation, application, messaging and database servers as well as why seemingly resilient implementations of these components can fail to provide resilient solutions. Our focus will be on the patterns used when implementing these components rather than vendor-specific details. Each vendor implementation may be a little different, but the methods for creating resiliency when designing and configuring a piece of infrastructure will be very similar. The focus will be on which patterns are valuable and how those patterns relate to the other domains in the book to provide resilient solutions.

One recurring theme throughout is the notion that simple is usually better. A challenge when building and maintaining complex systems is striking the optimal balance between too much complexity (which makes failure modes harder to predict and troubleshoot) and too much simplicity (such that there is very little flexibility to deal with failures or maintain the system). Complexity can creep into our solution in insidious ways: a “high availability” clustering solution that makes troubleshooting intermittent failures difficult; dynamic routing in a network to provide redundancy that introduces variability that causes sporadic timeouts; and database failover approaches that make returning to the pre-failure configuration difficult. These approaches are not always bad, but the complexity introduced by these design decisions must be weighed against the value they provide and alternatives that are less sophisticated but easier to manage.

Facilities

The number of facilities (data centers or hosting locations) that a solution is hosted in is an obvious factor in the resiliency of an application. If a solution is only hosted in a single data center and the entire data center experiences a failure or disaster, the solution clearly fails. Assuming that a solution can horizontally scale across a small number (two to four) facilities, the more relevant question from a resiliency perspective is how the application should be sized to maximize resiliency and minimize waste.

Questions to Consider When Determining Facility Needs for Resiliency

The number of facilities that are used to host a solution affects how systems and components will scale across sites, the amount of infrastructure needed for redun- dancy and, as we will see in later chapters, the application design. Some critical questions to ask when considering the number of facilities:

  1. Is there a range of the number of facilities in which this solution can be hosted? For example, is it possible to deploy the solution in two, three or four data cen- ters? Or are two data centers the most that are available, thus constraining your choices?
  2. If there is a range of the number of facilities that can be used, how many simultaneous facility failures should the solution be able to tolerate? For example, if the solution will be hosted in three data centers, the solution should probably be able to tolerate the loss of at least one data center. Should it be also able to tolerate the loss of two?
  3. How does the data center topology affect other aspects of the solution? For example, presentation and application servers are typically much easier to scale horizontally than database servers.
  4. Will all the components scale equally well across data centers, or will some components, like a database server, scale only to two locations?
  5. How will load-balancing affect the number of facilities? Does load-balancing between facilities exist only at the “front door” (so once traffic comes to a particular data center it will never leave) or could traffic between system components be load-balanced to different data centers? How does this affect the application design?

Having control over the number of facilities allows for much greater control over the total number of components needed to obtain the same level of resiliency. Let’s consider a simplified solution that we want to deploy in multiple data centers. We know that we require four servers to be available to handle peak load on the system and we want to be able to tolerate the failure of one data center without impacting our customers.

By increasing the number of data centers we’ve reduced the total number of servers needed to provide the same level of resiliency – the ability to tolerate the failure of one data center. This is one example of the need to find the optimal balance for simplicity: two data centers with four servers each results in some server waste, but deploying the same solution in nine data centers with one server each probably intro- duces a lot of unnecessary overhead and is highly impractical.

The three data center choice offers a good balance in our example scenario. We gain 25% efficiency in our server cost while only marginally increasing the complexity from a data center deployment view. Unfortunately, you may not always have the luxury of choosing how many facil- ities are available to deploy a solution. Perhaps your company or client only has two data centers, or constraints within the components you’re deploying limit your choices. Whatever the reason, it is still important to consider the questions above.

If there are multiple types of components being deployed (e.g. web servers, applica- tion servers, middleware servers), how does the number of data centers affect their configurations? Load balancing? Database replication? Messaging services? How will maintenance (configuration or code changes, software upgrades, etc.) to the solution be handled in the given topology? How does the number of data centers affect monitoring of the solution? I explore these questions in more detail in a later posting.

It’s also important to consider these questions in the context of third-party ven- dors on which your solution relies. While vendor contracts often specify penalties for failing to meet their service level agreement, these penalties rarely make up for the true cost of failure. Moreover, a vendor’s configuration may influence your design. If the solution you’re building will be deployed in three data centers at your company but needs to connect to two data centers at an external vendor’s site, how does this affect the connectivity between the sites? Many vendors may be reluctant to share implementation details of their infrastructure after contracts are signed, so it is critically important to negotiate for these details as part of your supplier management process.

It’s also important to consider how the environment within a facility may improve or degrade the resiliency of a solution. It is common for data centers to have multiple power distribution units (PDU), receive power from multiple power grids, have redundant ISP connections, and so on. However, failures still can and do occur, often because the equipment in the facility is not implemented to take advantage of these redundancy in the facility. For example, a rack full of servers with dual power supplies may have each power supply receiving power from the same PDU despite multiple PDUs being available. Likewise, circuits from multiple ISPs may be provisioned into the data center, but all of the servers in a solution that require ISP connectivity may only have access to one of those circuits.

In the next post, I’ll explore how the network may affect resiliency in unexpected ways.

The dangers of narrow subject matter expertise and the case for solution architecture

As technology solutions continue to increase in complexity, organizations often respond by creating teams with deep technical expertise to design, build and maintain their technology assets.  One side effect of deep technical expertise is narrowing breadth of knowledge.  While most IT professionals start their careers with broad technical knowledge (though perhaps not experience), as one’s experience and interest deepens in one particular domain, the breadth of knowledge – by necessity – shrinks.  This side effect is rarely seen as negative; in fact, deep technical expertise is often – and rightly – held in high regard.  

Unfortunately, lack of breadth presents serious risks to the quality of our solutions.  The complexity within technology domains (driving us to create deep technical expertise) also generates complexity in the interfaces between these domains.  For the purposes of this discussion, I’m referring to domains in a high level, coarse-grained sense like application development, application servers, network infrastructure, supporting application components (databases, middleware, and the like), storage solutions, and so on. These domains are individually complex, often to the point where there are sub-specialties within them.  (Point in case: most large organizations have some network engineers who specialize in load balancing while others are experts in network design/engineering.)

Some will argue that while these individual domains are complex, the internal complexity is abstracted from the interfaces to other components thus hiding the “inner workings”.  While this is often a design goal, it is rarely fully realized.  It’s easy to believe this well-meaning but dangerous fallacy, especially since so many IT professionals are “classically educated” as software developers.  Any college educated CompSci or CIS professional learned the importance of object-orientation, interfaces, abstraction, and so on.  Our trust in abstraction is so conditioned that it just feels like it should work in other domains.  It seems logical, but it just doesn’t scale to the breadth of systems and degree of complexity outside of a pure software engineering paradigm.

An exploration of the reasons why abstraction doesn’t scale could be an entire series of articles in its own right, but a cursory treatment may help convince skeptics.  First, abstraction in an object-oriented software engineering context is completely implemented within a single system (like a programming language).  The layer of abstraction and the complexity on either side of that abstraction share common tools, semantics and structure.  This commonality reduces the overall complexity of the solution and the difficulty associated with making the abstraction work.  Purists will argue that commonality doesn’t matter that much.  Evidence that this is not true can be found by comparing the difficulty in integrating two software components both written in Java with reasonable software standards with the difficulty in integrating a software component written in Java with another component written in .NET using web services.  The myriad of “standards” for web services highlight the difference between these scenarios.

Second, abstraction within software engineering is rooted in programming languages that exhibit a high degree of precision with respect to their semantics and syntax and, in comparison to broad IT “solutions” are relatively simple.  Java, for example, has only 50 keywords and a handful of syntax rules that can be used to implement abstraction.  Compare this to the average load-balancing solution, network switching infrastructure, or application server configuration, all of which can be configured in highly variable (and novel) ways.  This difference in complexity makes abstraction a much more difficult task.  In a software engineering world, it’s the difference between abstraction for a simple framework (something like implementing MVC) and abstraction for an operating system’s threading and memory management libraries.

Third, standards and patterns for abstraction in software engineering are well established and commonly understood.  Creating standards and identifying patterns is easier in software because of the previous two points.  Standards and patterns for integrating components from different domains (e.g. making load-balancers, web servers, application servers, and database servers work together) do exist and may be commonly understood, but are not so detailed or so precise that they reduce complexity or completely hide the inner workings of each individual component.

If we can agree that abstraction doesn’t really work between domains and that individual domains are so complex that they require deep technical expertise, we must then acknowledge that the integration of these components is a significant concern in its own right.  This is what architecture – especially solution architecture – is really about.  So called “PowerPoint architecture” or domain specific architecture (provided by software architects, storage solution architects, etc.) is not a substitute for holistic solution architecture that defines how disparate components from different domains will interact.  Make no mistake: PowerPoint architecture and domain-specific architecture have their place. Domain-specific architecture must be part of the solution delivery process.  Unfortunately, it is too often the focus, usually at the expense of good solution architecture.

Solution architects need to balance breadth and depth of technical knowledge to be effective.  This means that not every solution architect need come from a heavy software architecture or software engineering background.  Instead, a good solution architect understands a wide-range of technology domains and experience in putting them together in a variety of settings.  

A classic example of a problem where this kind of solution architecture really makes a difference is in geographically distributed web applications that are transactional and stateful in nature.  The developer or software architect will rely on the application server’s services for managing session state and persistence.  The infrastructure/hosting teams will rely on load-balancing solutions for “session stickiness” to keep a customer “stuck” to a particular web and application server for the duration of the session.  Easy enough, except as soon as you launch the application in production, you start getting reports of customers complaining that they’re losing their sessions, having to “start over” in multi-step transaction flows, or other intermittent, unpredictable behavior.  What happened?

Large ISPs like Comcast or AOL have multiple proxy servers from which a customer’s HTTP session may originate.  During the customer’s session, the ISP may internally load-balance the customer to a different proxy server, causing the source IP address to change.  Your load-balancer session stickiness didn’t account for this, the user got load-balanced to a different web or application server, and the session state couldn’t be rebuilt.  

There are many variations on this theme…  Maybe the load-balancer uses SSL ID, but the ISP’s proxy had a different A record cached for your site, so the user ended up in another data center.  Perhaps your web servers can’t route traffic to the application server that “knows” the customer’s state.  Or you’ve really done your homework and built a global cache to manage state, but the cache didn’t replicate fast enough.  The bottom line is that there are a variety of scenarios in which the interaction between the load-balancing infrastructure, application server configuration, and application code determine the actual customer experience.

This is where a good solution architect will save the day.  Your deep technical SMEs are still invaluable, but detecting the possibility of scenarios like the one above requires breadth of knowledge and thorough understanding of the characteristics the solution must have rather than any individual component.  The need for solution architecture is very real.  Doing it well has a tangible effect on the quality of technology solutions and mitigates the risks from deep technical expertise creating silos of domains.  We cannot get away from the need for our really sharp SMEs, nor should we want to.  However, we must acknowledge that our solutions demand attention to integrating disparate components in increasingly complex ways.

Resiliency, Architecture and the Importance of Testing

Everyone in the IT business – particularly developers – are familiar with testing.  Testing is the mechanism by which organizations perform quality assurance.  The good news is that testing is so engrained in software development organizations that some level of testing is almost always performed.  The bad news is that software testing is really just one aspect of testing the entire solution; the things you’re not testing when you do software QA can just as easily sink the ship.  There are two other important aspects which are critical to testing to ensure a solution is resilient.

First, the infrastructure must be tested.  ”But,” you say, “I test my infrastructure in the process of testing the software!”  This may be true to varying degrees depending on the types of software testing that are performed.  However, many details of the infrastructure are difficult to test or unique to a particular environment.  Server configurations may match between your test environment and production, but firewall rules are most certainly different.  It’s very difficult to know that a test server is configured exactly the same as a production server – do you know with absolute certainty that your test server and your production server have exactly the same startup configuration?  Kernel tuning parameters?  Fiber channel storage configuration?  If you audit your environment, I promise the vast majority of organizations will find differences.

These factors may not seem that important, and in many “sunny day” scenarios, they’re probably not.  It’s when conditions inevitably vary from normal that these variations rear their ugly head.  It’s precisely these times when you don’t want to be left wondering why your application is suddenly failing, only to discover after hours of your sysadmin pulling her hair out that one production server’s NIC has a default gateway configured incorrectly.

Dealing with this situation requires a multi-prong approach.  First, periodic audits of configurable items on all servers needs to be standard operating procedure.  Second, new production environments need to be tested in the same way you would test in a performance testing environment.  Third, existing production environments undergoing change should have predefined methods for periodic verification.  For example, if a production environment has a change (e.g. new code, new server configuration, patches), there should be a way to “test” these changes on a small subset of all the production servers.  This requires planning in advance, which is why architecture and planning for resiliency is so important.  When two or more identical production environments exist (hopefully always!), take each one offline periodically and test them.

Similar to infrastructure, architectural items also need to be verified.  It would be unthinkable to not test functional requirements of your application, so why wouldn’t you also test architectural requirements?  In particular, architectural requirements that affect the quality of your application are absolutely critical.  For example, if you have redundancy, failover, or the ability for a component to run in a degraded mode built into the structure of your solution, they must be tested.  Similar to functional requirements, these architectural requirements need to have test plans and have traceability through design artifacts.

With so much focus on “functional” requirements, many organizations lose focus of the “non-functional” requirements.  Calling the latter non-functional does a great disservice to these important details; they’re really quality requirements.  The overall quality of the environment is a function of many inputs: software, infrastructure, architecture, and testing of all three.

What is resiliency?

One of the subjects that I deal with frequently is resiliency; specifically, the resiliency of technology solutions.  But what does it mean to be resilient?  Fundamentally, it means that a system or solution needs to be engineered with these goals in mind:

  1. The entire solution is designed to continue to function as normally as possible in the face of failure.
  2. When failures occur, they are invisible to the customer.
  3. If a failure must be visible to the customer, the solution provides the highest level of service possible (in other words, compartmentalize failures).
This sounds straight forward in theory but is rarely so in practice.  Why?  There are many contributing factors, and I’ll be dealing with these in detail in subsequent posts.  Some are obvious: resiliency adds cost, implementation costs must be balanced with business value and time-to-market pressures, and the fact that future failures are much more abstract the current business needs.  Despite these challenges, many organizations try to do the right thing by investing in the construction of resilient solutions that ultimately fail.  
 
These scenarios are the ones that are particularly frustrating, leaving very knowledgeable technologists wondering why such a robust system failed.  In such cases, the answers are usually much more subtle: complexity of systems lead to difficulty identifying failure modes, quantifying specific resiliency needs is rarely systematic, control plans are inadequate or absent leading to the development of new and unpredictable types of failures.  In all these cases, if you’ve ended up in such a scenario, it’s difficult or impossible to even quantify the operational, reputation and financial risks posed to the business – you just don’t know what you don’t know.
 
On this site, I’ll discuss these and other quandaries that threaten the stability of critical enterprise infrastructure.  Business no longer have the luxury of tolerating unreliable technology.  Five to ten years ago, the internet and related technologies were seen as new and unique – the virtual “wild west”.  Because these “enabling technologies” were viewed as somehow separate from the services and products that businesses provided, failure of the technology was not a direct reflection on the quality of the product or the capability of the provider.  Now, those enabling technologies have faded into the background – they are no longer new and exotic.  Customers expect mobile banking solutions on their cell phone to “just work”,  just as land line telephone customers expect dial-tone or homeowners expect power from their electrical outlets.  Failure of technology now equates to failure of the business.
 
Resiliency is the mechanism to ensure that our solutions meet these demands.  Resiliency may not be easy, but it is necessary.