Twitter Updates

Validating existing model records in Rails

One problem that crops up as a Ruby on Rails application evolves is that new attributes are added to existing models, sometimes with new validation rules. For example, you may have a Customer model with name, address, email and phone number, then realize you want to add an attribute to store email marketing preferences many months later. If you add a validation to the new attribute, any new records will be appropriately validated, but the old records won’t have any value. In some cases you may be able to provide a default value in the database migration but this might not always be desirable or possible. These old model records that no longer pass validation can cause serious problems in an application, especially if you validate records through model associations and you have validation errors that aren’t expected.

To work around this, I created a quick little rake task that will find all your model classes, find all the model records and run validation on each record. In instances where the validation fails it will output the model name and the ID of the record that failed validation.

namespace :validate do
  desc "Validates all model records"

  task :models => :environment do
    Dir.glob(Rails.root.to_s + '/app/models/*.rb').each { |file| require file }
    objects = ObjectSpace.each_object(Class) do |klass|
      if ActiveRecord::Base > klass
        begin
          all_records = klass.all
        rescue
          puts "Error querying model #{klass} - maybe the table doesn't exist?"
          next
        end
        all_records.each do |r|
          if !r.valid?
            puts "Invalid record in model #{klass} with id #{r.id}!"
          end
        end
      end
    end
  end
end

Also available here on Gist.

Testing RESTful Rails controllers by POSTing XML

I ran into a problem this morning trying to test a Rails controller with an XML document that needed to made in an HTTP POST request. It turns out that it’s actually pretty easy using Rails IntegrationTest and a couple small tweaks:

class MyXmlPostClass < ActionController::IntegrationTest
  test "should successfully post XML" do
    # assume you have an XML object named xml_request
    @headers ||= {}
    @headers['HTTP_ACCEPT'] = @headers['CONTENT_TYPE'] = 'application/xml'
    post '/controller_path', xml_request.to_s, @headers
    # use standard assert on response object
  end
end

That’s it!

Five Things I Wish I'd Known About Ruby & Rails

Six months ago, I decided to learn Ruby and the Ruby on Rails framework. It was a whim. I dove in feet-first without spending too much time focusing on studying the language. To force myself to solve real problems, I decided to write a real application. Now, with six months of knowledge, here are the things I had known when I started:

1. Naming conventions matter.

Rails is built around the idea of convention over configuration. To that end, naming conventions matter. Rails will do a lot for you, but it needs some hints and relies on well-established Ruby conventions. For example, class names begin with an uppercase letter and generally separate distinct words by changing case (rather than underscores). So, a class for service requests might be named ServiceRequests. If the class represents a model, Rails must map the model to the underlying RDBMS – in doing so, it uses underscores to identify distinct words. Thus, the model implemented in class ServiceRequests is persisted to a table called service_requests. If you inadvertently start by naming your database tables without underscores, you’d end up with a model with the rather ugly name of Servicerequests.

Similarly, Rails uses language inflection rules to attempt to use plural or singular versions of the nouns in your models. So, a singular representation of a service request would be ServiceRequest, where as a plural representation would be ServiceRequests. This works well for English words with standard inflections, but not so well for irregular words. For example, Rails would by default inflect inventory incorrectly to inventorys. These inflections are important in the naming of controllers and in describing the relationships between models. In the case of the inventory model, it’s the difference between saying belongs_to :inventory and has_many :inventories.

Knowing where to Rails inflection rules is important in cases like this – you use files in your config/initializers directory with something like this:

ActiveSupport::Inflector.inflections do |inflect|

inflect.uncountable %w( inventory )

end

(This tells Rails that inventory is an uncountable word, so both the plural and singular version is represented as just plain “inventory”.)

A brief overview of Ruby and Rails naming conventions are here. More information about Rails inflection rules are in this blog entry.

2. Knowing the idioms of Ruby saves time.

Probably more than most other modern languages with the exception of Perl, knowing the idioms of Ruby will save you time. A lot of time. Since most programmers think in their most recent language for the first few months of writing in a new language, it’s natural to fall into the habit of writing code in Ruby thee way you would’ve done it in whatever you’re used to. This can lead to some unnecessarily complicate code that is very un-Ruby.

A basic example is string concatenation. You may be tempted to write:

message_body = “Your user id ” + user_id + ” is now ready for use at ” + site_name “.”

Idiomatic Ruby would use:

message_body = “Your user id #{user_id} is now ready for use at #{site_name}.”

It’s tempting to think that this isn’t that big of a deal. String concatenation the Ruby way provides only marginal benefit in terms of readability and code length, right? That may be true, but there are dozens of examples of similar time savers. Things like built in methods to iterate over collections, using Proc and lambda functions as anonymous procedures, accessor helpers built in to Ruby’s classes to create getter/setter like functions automatically, and  the ||= operator that is essentially a “short-circuit” type operator that returns the left-hand side if it evaluates to true, otherwise it performs an assignment to the left-hand side from the right-hand side.

The bottom line is that it really does pay to spend some time learning the subtleties of Ruby that make it different from other languages. While it’s difficult to teach idioms, there is a great overview in this StackOverflow question.

3.Don’t get too far without learning about testing in Rails.

When I started with Ruby and Rails, I dismissed the idea of learning testing first. I know about unit and functional testing from other languages and I’m familiar with Test Driven Development methodologies. I just figured I’d learn the language and frameworks first and worry about testing later.

I did learn about testing and, in retrospect, I really should’ve taken the time to learn this first. There are lots of testing options with Ruby and Rails and the right one for a project can vary quite a bit. When you’re learning though, just use what comes out of the box. It’s very good. And while you may already know about testing and TDD, it’s instructive to try it the Rails way. Fixtures provide test data for you to use. Tests are split up between functional, integration and performance tests. The Rails Guide on the subject is a good primer – it won’t give you everything you need to know, but it will absolutely give you what you need to start.

There are lots of options when it comes to testing in Rails. My personal recommendation is to read up on shoulda, RSpec, Selenium and Cucumber. Don’t get too caught up in the philosophical debates on the tools. Just understand what is out there, write some tests as you learn, and you’ll be much better equipped to experiment with the wide range of testing frameworks.

There are two tools that are useful no matter what testing framework or approach you use: autotest and rcov. Autotest makes continuous testing painless and it has fantastic integration on Mac platforms with Growl. The rcov tool provides test coverage reports on a source file-by-file level as well as aggregate across your project.

4. Do not write a single line of code until you’ve installed RVM.

RVM, the Ruby Version Manager, is a simple to use tool for managic separate Ruby environments including the version of Ruby and Ruby gems. This makes switching between different configurations of your Ruby and Rails environments painless. It doesn’t seem like that big of a deal, but as versions of Ruby or Rails change, and as you want to experiment with different Ruby libraries, this will make life much easier.

The Ruby and Rails community continues to evolve quickly and being able to keep independent environments is also helpful so you can try new releases without risking negative consequences on an existing installation. RVM is easy to use, even for a beginner, and the 15-30 minutes you spend setting it up will pay off in the long run.

5. Know all the places Rails stashes code so that you Don’t Repeat Yourself.

One of the principle tenants of Ruby and Rails is “Don’t Repeat Yourself” or just DRY. In line with this principle, Rails has a variety of places for code to live so it can be reused. It’s helpful to know what your options are so that you can know the best place to put your code as well as know all the places to look in other people’s code for little nuggets. I spent hours digging through code on Github trying to figure out how something worked because of this.

Here are a couple examples of where code hides out in Rails applications:

  • helpers/ – Helper classes are used to augment views. When you find that you’re repeating code in different actions of the same controller or even within the same view, helpers are a good way to abstract code from the view.
  • config/ – While not specifically a place to store reusable code, some applications will tweak or override default behavior of Ruby or Rails in the config directory. One common example of this is Devise, the gem that is used for providing a rich authentication framework.
  • vendor/ – Used for some Rails libraries.
  • public/javascripts/application.js – This JavaScript file is included with the javascript_include_tag :defaults tag in a view. In addition, Rails will include other JavaScript libraries when you use :defaults though the specifics vary by version of Rails. Generally speaking, you don’t want to do this – it can lead to pages needing to load JavaScript that isn’t necessary, and in some cases, can cause you to load JavaScript that conflicts with your own code. (In Rails 2.x, this happens frequently with conflicts between Prototype.js and jQuery libraries.)

Another confusing aspect to beginners is the inheritance that happens with controllers and helpers: everything in your controllers/application_controller.rb and helpers/application_helper.rb files are automatically included in all controllers and helpers throughout your application while all others are specific to a particular controller.

Conclusion

I’m only six months in to my Ruby and Rails adventure but I’ve really enjoyed learning. Struggling through challenges is a great way to learn so while it may have been helpful to know the things above, I probably wouldn’t have the same appreciation for the language, framework and community as I do now. While I hope this is helpful to someone who is just getting started, it can’t replace hard earned experience. Have patience, don’t give up when you feel like you can’t figure something out, and in six months you’ll be blown away at what you’ve learned.

Elements of Resilient Software

Resilient solutions require resilient infrastructure and resilient application design and development practices. Earlier posts discussed factors in a solution’s infrastructure that help increase resiliency, but the application that runs in that infrastructure must also be resilient or else the solution will fail. Software architects and developers generally recognize that their applications must tolerate failure, but these approaches are usually focused on conditions that exist within their code rather than external factors. Another consideration of resilient application design is how to limit the scope of failures; it is far better to have a fraction of functions fail quickly while the rest of the application’s functions continue to operate normally than to cause all functions to fail or operate in a degraded fashion.

The practices outlined in this series of posts are not substitutes for good code quality and common fault tolerant development practices; rather they augment those already known patterns. For example, Pullum describes many techniques for fault avoidance, prevention, removal and prediction in software while Koren and Krishna outline a systems-centric approach to hardware and software fault-tolerance. Code construction practices like those described by McConnell benefit resiliency. While preventing and eliminating defects is an obvious direct benefit, the advantages of creating high quality code also speed debugging of problems that cause failures and make adapting to new failure conditions easier.

In the next series of posts, I’ll assume that these foundational practices are being applied in the solution and turn our focus to a holistic approach to application design that considers interactions with other elements in the solution. These approaches are largely independent of each other but can be grouped into three general categories: instrumenting and reacting to operating conditions, coding for upstream and downstream component dependencies, and expecting the unexpected. Throughout these posts l’ll refer to “components” and “applications”. In this context, a component is an element of an application or solution that by itself may not serve a business function. For example, a component might be a logging subsystem in an application or a database server. Conversely, an application should be taken to be a collection of code (one or more components) that operate together to perform some function.

Instrumenting and Reacting to Operating Conditions

In order for an application to tolerate failures it must be able to measure and, in some cases, react to changes in operating conditions. If an application is unaware that a failure is occurring, there is little the application will be able to do itself to prevent or minimize the effects of the failure. Therefore, it is important that the application has the ability to measure operating characteristic that may affect its operation. These operating conditions include the number of users on the system, awareness of failures of “downstream” components (other applications or components on which the application is dependent), availability of critical resources, the rate of change of a particular operating characteristic, and so on. In some cases, the application may not be able to react, but the awareness of such a condition may enable logging of useful information. In other cases, the application may be able to incorporate “autonomic” responses to attempt to mitigate the failure. In this post, I’ll discuss approaches to implementing these types of features.

Instrumenting and Responding to Load

One of the most important factors affecting an application’s resiliency is the load it is under or volume of requests it is receiving. In a user interface oriented application (like a client/server or web application), this may be measured in the number of active users, sessions, or connections. In a more batch oriented, data processing application, this may be the workload waiting for processing. (Integration or middleware applications also experience this condition, but we will address integration resiliency in detail in the future posts.) Whatever the primary measure is for the load on your application, it must be measured.

One important distinction to make when measuring load is to measure what matters. A client/server application with 100 clients sitting at a “login” screen is not the same as 100 users actively using the application after logging in. Likewise, an online banking web application with 100 customers browsing static marketing pages on the site is not the same as 100 authenticated customers viewing account details and making bill payments. This is why measuring load is an application design and development concern – merely counting the number of open TCP/IP connections or active sessions on an application server is not accurate for measuring the actual impact to your application.

As a result you must make provisions for instrumenting load. It may be valuable to consider using a singleton pattern to create a common object for storing this kind of data. You can then create a simple counter that will be incremented when an event occurs that adds load to the system and decrements when load is taken away from the system. For example, in a J2EE application using a typical model-view-controller pattern, the controller handling login requests could call a method to increment the counter when a login is successful. Upon logoff or a session timeout event a similar controller would decrement the counter.

One consideration with this type of approach is the scope of what is being measured. In many cases, the application can only measure how many events are occurring within its local instance. For example, in a client/server application you may have several different server instances load-balancing traffic and each instance is only aware of logins on that instance. In this scenario it is necessary to aggregate the load information from each server instance. It is generally better to use application code to do this aggregation rather than trying to store and operate on an aggregated counter in a single location (like a database table). While the database approach is tempting because of its centralized location and ease of access, the frequency of updates to this value and the possibility of a deadlock or race condition affecting the accuracy of the measurement and the performance of the application is generally too high.

In a client/server application, one possible approach is to expose an administrative RPC call to get the number of active users. In a web application, each application server may have a servlet that returns the current count from the singleton object with the counter. It is then possible to write some code that calls each individual server instance and totals the number of active users. While aggregating the total number of users across all instances may seem frustrating, there is tremendous value in having the load metric at an instance level. By monitoring volume on an instance-by-instance basis, it is very easy to detect situations where there is an imbalance in load. This can be an early warning of a load balancing/workload management problem, or it may bring awareness to external factors that degrade the effectiveness of a particular load balancing scheme. (Having the load metric available on an instance-by-instance basis of an application server would help identify the fact that load is skewed to one particular server or a set of servers sharing a load-balancer. This type of information is very helpful to diagnose the root cause of failures related to DNS or load-balancing failure modes.)

By enabling each application instance to have an awareness about its local load it is also possible for the application to change its behavior based on load. If a particular application instance can handle no more than 500 active users, it may be preferable to allow those 500 users to continue using the application while denying access to any additional users until the load diminishes. While this may result in some failures for users trying to login, it is a far better experience for the users who were already using the application. (This assumes that the application performance would begin to degrade significantly when more than 500 users were using it, such that all users would pay a performance penalty or experience errors.) Alternatively, it may be that the application can continue to allow users in but may restrict access to certain functions that require significantly more resources than the rest of the application.

I’ll refer to this functionality as “gating”, as it is conceptually the same as a gate to an amusement park that can hold a limited number of people. Adding gating functionality is trivial once the load “counting” mechanism is in place. Rather than blindly incrementing the counter, the application first checks the current value of the counter and compares it to a configurable parameter indicating the maximum value before turning away users. If additional load will not exceed the maximum, the application continues normally. If additional load would exceed the maximum, then some other action is taken (either attempting to send the request to a server with more free capacity for load or generating an error to the user). The maximum value parameter is set by stress testing the application and monitoring the load metric to determine what level of load causes the application to malfunction.

Exposing the load metric also enables monitoring of load on a near real-time basis. If an application has implemented gating, monitoring tools can be configured to alarm when volume is approaching maximum load. It is also possible to monitor the rate of change in load. Rapid changes in load may indicate some kind of failure (e.g. customers experiencing problems in their session and quickly logging back in) or something more nefarious like a denial of service attack.

Many of the examples of gating I’ve discussed have defined load as the number of users, but applications that are not user or session based also benefit from this approach. If a data processing application must process a queue of files in a fixed length of time, the load may be the number of files and the maximum value may be the largest number of files that can be processed in time. A stateless web services application may measure the number of SOAP or REST requests over a short period of time. A number-crunching application may measure the size of a data set or be able to predict the complexity of the operation. Whatever the case, the fundamental approach is the same:

  1. Instrument the application to measure load
  2. Expose the load metric so that monitoring or operational tools have a near real-time measurement of load
  3. Determine the maximum load the application can handle without failing
  4. Implement gating functionality to the application can guard against unsafe loads
  5. Proactively monitor load to identify problems before they affect customers

In my next post, I’ll discuss developing code for effective logging, monitoring and troubleshooting.

The Promise of SOA Continued: Project Management and SDLC Practices to Utilize SOA

In my first post on the promise of SOA, one of the constraining organizational factors that prevents full realization of the benefits of SOA investments I mentioned was:

Project management methodologies and software development processes that are based on waterfall approaches to building software and usually highly dependent on integrated testing.

Large enterprises usually have well-established project management processes, often built around a waterfall approach to the software development lifecycle.  These processes serve a very valuable process in a non-SOA environment, as these organizations generally have complex, interdependent systems with a large amount of change occurring in parallel.  Rigorous, formal waterfall development processes help mitigate the risk of change in this type of environment.  Unfortunately, these processes also inhibit the flexibility SOA promises. Continue reading The Promise of SOA Continued: Project Management and SDLC Practices to Utilize SOA

Bolster Infrastructure Resiliency with Monitoring and Islands of Functionality

The last few posts have focused on specific elements of infrastructure. This post will address two cross-cutting concerns that affect the resiliency of all infrastructure components: monitoring and the use of “islands of functionality” to minimize the effects of failures and speed troubleshooting.

Monitoring

Monitoring the health of every component at every layer of a solution is absolutely mandatory to create a resilient solution. If your solution is not instrumented so its health can be determined in near real-time or generate alerts automatically when its health changes, the solution is inherently not resilient – it’s that simple. As a result, defining the monitoring requirements of the solution is a critical part of solution design. In future posts I will discuss tools that help identify which aspects of a solution are important to monitor. From an infrastructure perspective, it’s important to know what monitoring capabilities exist and any gaps that need to be addressed before a solution is implemented. In particular, the following questions may help identify critical needs and gaps:

  1. What tools exist to monitor the health of the network and servers?
  2. Is it possible to tune thresholds in those tools so that they match dangerous thresholds in this solution?
  3. How can dangerous thresholds be identified so that they can be set correctly in production? (For example, in a test environment? Using production equipment before going live?)
  4. Will I be able to test my monitoring thresholds and alerting in a test environment?
  5. Do the monitoring tools generate alerts? If so, who receives them? How do the recipients of monitoring alerts know how to respond?
  6. What mechanisms exist to instrument applications for errors? (For example, file-based log monitors that can “watch” for exceptions or platform-specific monitoring frameworks like JMX.) Do the application developers know which conditions should trigger monitoring events? Is there a standard approach or specification for the developers to implement in their code when those conditions occur?
  7. Are component-specific tools available to monitor for conditions like long running transactions on a database or high queue depth in a messaging backbone?

Monitoring tools do not need to be expensive or elaborate. While many enterprise monitoring solutions are both expensive and elaborate, comprehensive monitoring can be accomplished without them. The health checks described in the workload management section are a good example; some simple scripting to invoke health checks and generate email alerts is inexpensive and ensures several failure modes are detected. Dashboards can be built using open source tools that quickly convey the health of the overall solution. Whatever technology is used, the goal is to have a single place to go to see the health of the solution. The view can be as simple as red/yellow/green indicators on a web page that shows every component’s health.
As we will see throughout future posts, improving the resiliency of existing solutions is dependent on having comprehensive data about health and where failures are occurring. It is extremely important to get monitoring right from the beginning when creating new systems.

Islands of Functionality

At various points in the last few posts I’ve discussed the value of arranging groups of functionality in ways that can be discretely controlled. The ability to segment hardware (and, by extension, the software that runs on it) provides flexibility to control how those components are used. The configuration of database listeners to segment users of the database was shown as a way to manage failures. Creating data center specific domain names to control session persistence provides better resiliency. These are all examples of a more general concept of “islands of functionality”.

An island of functionality is a small collection of components of a solution that can be managed as a unit. As an example, consider again the online insurance application deployed in the configuration shown in the figure below. Boxes labeled beginning with a W are web servers, an A are application servers, and DBL are database listeners. Note that two web servers are grouped together with two application servers. For any given web server there are only two possible application servers to which a request can be routed and for any given application server there are only two web servers from which a request could have originated. Two database listeners are configured on each database.

Insurance application using islands of functionality

Insurance application using islands of functionality

This configuration creates islands of functionality in a couple of ways. First, small groupings of web and application servers can be managed as a group. If A1 or A3 need to have maintenance performed on them, only two web servers are affected rather than four (the entire data center) or eight (both data centers, if we wanted to enable all web servers to send traffic to all application servers). Second, if a problem occurs on any server in this environment, the scope of the problem should be limited to one quarter of the environment. For example, if unusual errors are occurring on A3 but the root cause appears to be some other component, it’s very likely it could only be W1, W3, A1, or DBL1 causing the problem. Third, this configuration allows us to separate individual customers from institutional customers by employing a configuration that routes individual customers to the odd numbered servers and institutional customers to the even numbered servers. Finally, as we will see in future posts, this kind of configuration also lends itself to easier operational routines such as software upgrades and configuration changes.

Infrastructure Resiliency, Server Hardware and Workload Management

Approaches to sizing and scaling server hardware vary from vendor to vendor and between distributed and mainframe technology, but some general principles apply. One area of frequent disagreement in some organizations is the decision between a few, large-frame distributed systems or many smaller (commodity or near-commodity) servers. Google has proven the commodity hardware approach in the search domain but this approach works well in many other commercial domains where tasks are easily parallelized and not computationally intensive. Many commercial applications meet this criteria: rich internet applications, multi-user client-server applications, middleware/integration solutions and some batch processing applications can often be implemented in this way. Designing applications to work on small, commodity hardware has several benefits.

Server Hardware Resiliency

As with facilities, the more units of hardware processing is spread across, the smaller the impact of any one of those units failing. While logical partitioning of large distributed systems enables simulating smaller systems, adding capacity to these systems can often be more intrusive than simply adding another commodity server to a rack. Another difference is that scaling large distributed systems is periodically costly when an additional frame is needed whereas the cost of scaling commodity hardware is linear.

Another benefit to having many smaller units of hardware is the ability to use small groupings of hardware (and the software that runs on them) to pilot new functionality or changes. Alternatively, hardware and software can be segmented to insulate different groups of customers from affecting each other. (This is particularly useful when a solution is providing a shared service to different internal business lines, external customers that dramatically different value to the business and are stratified by value, and so on.)

A Fortune 100 insurance company may have an online self-service application used by individual insurance customers with relatively small policies and institutional customers who pay very expensive premiums. The business may decide that individual customers are tolerant of the site being unavailable but institutional customers demand continuous availability. Having two completely isolated groupings of components that are used only by one type of customer provides much more flexibility. This idea of “islands of functionality” has other benefits – we’ll revisit the topic shortly.

Commodity hardware can also be easier to manage than large-frame distributed systems. There is some overhead associated with managing more devices, but automating common management tasks (such as software deployments, configuration changes, and server maintenance tasks) minimizes this overhead with the added benefit of minimizing the likelihood of server configurations drifting out of sync with each other. Managing commodity hardware is also an easier skill set to find than familiarity with “big-iron” distributed system.

When considering the size of hardware to use for a particular solution, it’s important to consider how the application will affect the physical deployment. A relatively stateless application may scale very well across many small servers by leveraging simple load balancing techniques whereas a complex, stateful application may benefit from larger servers. Physical deployment options may, in turn, affect how an application is designed. In the case of the complex, stateful application, the effect of the application’s requirements on the physical topology may reveal that it is better to maintain the state of customer’s session in a large distributed caching solution. Again, this highlights the benefits of a holistic approach to thinking about resiliency while designing the solution. The concept of leveraging a distributed cache will be addressed in a future post and may further influence decisions about server hardware.

Commodity hardware isn’t appropriate for every scenario, so identifying the ideal hardware needs to be an explicit consideration during the design process. There may be cases where commodity servers can be used for tasks that horizontally scale nearly linearly (like presentation, application and middleware servers) but larger hardware is needed for database servers or computationally intensive analytics that cannot be parallelized.

Workload Management

The design and implementation of workload management features greatly influences the resiliency of a solution. There are several mechanisms that can be used to manage workload: load balancers are one of the most common, but configuring application, database and messaging components to naturally distribute load is equally important. Under normal operating conditions, these mechanisms and techniques provide the solution with a way to manage the load be handled by a particular component, usually by attempting to distribute the load equally or to the component which can satisfy a request the fastest. In failure modes, however, workload management solutions are the first line of defense in minimizing the extent of the failure.

The combination of DNS load balancers (”global load balancers”) and local load balancers (that distribute load within a LAN environment) are fundamental components to the multi-facility implementations discussed in my previous post. The capabilities that many of these products provide to be able to detect failure are also critical to resiliency. For example, most load balancers have a variety of mechanisms (”health checks”) to determine the health of the resources they are load balancing across. These health checks vary in sophistication: at their most basic, a load balancer may ping a device while more advanced implementations offer sophisticated scripting capabilities that can be used to look for specific content in the device’s response to adjust the load-balancer’s behavior.

As a guideline, using basic ICMP or ping health checks are not sufficient. Many failure modes exist where a device’s TCP/IP stack is fully functional but the software that uses that stack is failing. As a result, every load balancer configuration should be tailored to obtain the most accurate information possible about the state of the components for which it is managing load. In many cases, this requires the creation of custom code within the application to provide the load balancer information about the application’s health. Alternatively, custom health check code in the application tier can be used to indirectly monitor the health of other components like the database, messaging server, cache, and so.

For example, consider “application A” that relies on the availability of two databases and a JMS connection to “application B” to operate correctly. The developer of application A creates a health check that performs a simple (very fast!) query against each of the databases and produce a test JMS message to application B. The developer of application B creates a similar health check capability that consumes test JMS message, checks its own internal resources, and responds. This health check is used by the load balancer to periodically confirm that application A is operating correctly. The way the health check is created and exposed depends on the system: a J2EE application server may create a simple JSP page, a .NET application may create a simple ASPX page, and a client-server application may create a script that could be invoked over a terminal connection.

The utility of this health check cannot be overstated. The load balancers that use these health checks are now far more informed about the internal health of the application than a simple ICMP check would provide. The health check can also provide status to monitoring tools and generate alerts to support teams when a failure occurs. Real-time management and reporting dashboards can be created to show the operating condition of the solution. The response time of the health check is often another indicator of health. Even if all of the checks in the example above are working, there is a significant difference between the health check responding in 0.1 seconds and 1.1 seconds. This type of response time trending can act as an early warning system to alert support staff to an impending failure or used to automatically route traffic away from a particular device.

These types of health checks are also easy to extend to provide additional operational controls to support teams. For example, it is trivial to add logic to the health check above to look at a file on the local file system of the application to determine if support teams may want to stop traffic from being routed to a server even though the application is healthy. This type of functionality can be used to route traffic away from servers for maintenance purposes. It could also be used to cause a global load balancer to stop distributing the IP address of a group of locally load balanced servers but permit customers who already have the address cached to complete their work. The flexibility provided by this feature is extremely valuable and should be part of every solution – there is virtually no reason not to do it.

As important as load balancers and health checks are to workload management, they are just one factor in the resiliency of an application. Understanding (and, when possible, influencing) the application’s requirements with respect to persisting connections is also very important. Consider a web application that uses manages session state only on local servers. Session state is not replicated across every application server running the application.  Let us assume for the moment that the session management configuration is deliberate because of performance concerns or because the technology used doesn’t support replication and cannot be changed. How might this influence our workload management approach?

To solve the problem, we need an approach to keep a session in one data center from start to finish. Since the behavior exhibited by ISP proxies and DNS caching makes this difficult in our current configuration, we need to eliminate the possibility of a session jumping from one data center to the other. One possible solution is to create “data center specific” domain names. Let us assume that the customers come to the domain www.highlyresilientsite.com. The application could generate links from the landing page of that domain that direct the customer’s browser to www1.highlyresilientsite.com or www2.highlyresilientsite.com which correspond to data center A and B respectively. Even if the customer’s ISP suddenly load balances the customer to another proxy server, the session will remain in the same data center. This affects our DNS configuration but it also places a new requirement on the application: the ability to “know” which data center domain names are valid and should be served. (This will be a recurring theme: resiliency requires a holistic approach to design!)

Another aspect to workload management is the design of messaging. Messaging infrastructure generally needs to work in collaboration with the application workload management approach. For example, if the application tier of a solution is stateful and requires session persistence. If the application uses a messaging service (for example, IBM WebSphere MQ or Microsoft MSMQ) to communicate with another component that is also stateful, this implies a very different design that communication that is completely stateless. Part of the design process of a solution that involves messaging must be considering the nature of the messages being sent by the application and how the messaging service can be configured to provide resiliency without violating assumptions about the state of the component receiving the messages.

The configuration of databases also influences workload management. In some cases, managing workload on a database can be done by using replication or redundancy features to manage load and insulate components from affecting one another. A common approach to this is replicating data from a production database to a reporting database. Another pattern for managing load is create multiple database listeners used for different segments of customers or purposes. Similar to our example of the insurance company, using distinct groups of hardware for different types of customers, controlling access to the database through multiple listeners provides flexibility in how the database is used. For example, assume a database used to support the insurance application had three listeners: one for use by a segment of application servers servicing individual customers, one for use by a separate segment of application servers servicing institutional customers, and a third for use when real-time reporting is needed. If the database experienced performance problems, administrators could simply disable the listener for real-time reporting to attempt to relieve some load and maintain availability for customers.

Resilient Network Infrastructure, Part 2

While the effect the network has on solutions are usually understood in general terms, specific details are often unavailable. One reason is the difficulty of replicating production network conditions in test environments. Subtle changes in network performance that are difficult to detect can have significant effects on the performance and stability of systems. Likewise, seemingly insignificant changes in the operating environment of a solution (such as changes in customer behavior or scheduled operations like backups) can have drastic effects on the network. For the same reason a sub-optimal route affected the performance of the system shown in the example in part 1, a change in network performance that introduced the same degree of latency would have the same effect.

Understand the Effect of the Network on Solution Resiliency

Frequently problems arise in new solutions because production networks are inherently more complex that the network used for testing: more devices, more variability in usage and performance, and often more distributed. One of the most important – and difficult – characteristics to understand is the effect of latency on a solution. The easy solution is also the most expensive because it requires building out an exact replica of the production network. Since this is often prohibitively expensive, creative solutions may be used to simulate production.

The most difficult network configuration to simulate is geographical distribution. If a solution on the production network will be distributed between data centers in Colorado and Massachusetts but you have only one test environment in Kansas, your test results will not accurately reflect production. One way to approximate the effect of this distribution is to mirror the production VLAN configuration. Assuming you have an application server VLAN in Colorado and another in Massachusetts, one option is to create two VLANs in Kansas and divide application servers between them. To accurately simulate performance, measure the production latency during peak network utilization then simulate that latency between the two VLANs. (Open source tools exist to do this at the network interface level and network appliances and traffic shapers can perform the same function at the switch level.) Database connections – including those that support replication – are particularly sensitive to this kind of latency, so if only one component can be tested in this manner, the database is usually the best bet.

Another culprit of network resiliency failures is due to unexpected impact from firewalls. These effects usually fall into one of the following categories:

  • Unexpected latency introduced by firewall interfaces (a variation of the theme above)
  • Inconsistent rule configuration causing failures (usually closely correlated with poorly tested or managed infrastructure changes)
  • Unusual interactions between the firewall, network and application

While the first two issues are relatively straight-forward, “unusual interactions” are (as you might imagine!) hard to predict. A real-world examples is in order to illustrate how these unusual interactions may manifest themselves. If an application makes a network connection that traverses a firewall but is not very “chatty”, the firewall may timeout the connection even though the application believes the connection is still alive. This may lead to intermittent problems that are difficult to diagnose: database connections that fail unexpectedly, network mounted file systems that suddenly become unavailable, and so on. Frequent activity on the system masks these types of failures, so in many cases this type of problem will crop up during low periods of utilization. It is difficult to predict exactly where and when this type of problem will occur, so a good practice to ask when working on the physical deployment of a solution is to analyze all of the network devices that critical connections traverse and ask what may lead to that connection be unexpectedly terminated. Ways to approach this problem will be discussed in a future blog post.

One way to prevent these problems from reaching production is to test with the same firewall configuration as production. Software-based firewalls are usually inexpensive and easy to simulate in test environments, but network appliance (dedicated hardware-based) firewalls can be expensive to purchase and maintain in a test environment. Have a test configuration that mirrors production is valuable because it also enabled infrastructure teams to test firewall rules before they are implemented in production. While substituting a network appliance firewall with a software firewall in a test environment is better than nothing at all, different firewall products behave differently and such a configuration doesn’t enable true testing of firewall rules.

Network Performance Is Not Static

Even the most comprehensive analysis of the effect of the network on a solution can be undone due to the dynamic nature of production networks. User volumes grow, backup schedules change, new devices are added to the network and suddenly a database transaction that took 100ms when the solution went live takes 300ms only a few months later. While 200ms may not sound like much, a three-fold increase in response time at the database layer can ripple through the system causing a thread in your application to spin in a wait state longer, thus causing resources on the server to be consumed for longer periods of time and cause resource starvation and timeouts that cause failures.

Early warning of changing network conditions is a necessity. As we will see in future posts, understanding changes in the environment is an element of operational resiliency. In the spirit of an integrated approach to resiliency, it is not enough to assume that a network operations team knows what a particular solution requires to be resilient. This is the benefit of a holistic approach to resiliency: through the analysis and testing described above and later in this book, the resiliency practitioner can determine what parameters influence the resiliency of a solution and what ranges of performance are acceptable. This information will enable operations teams and monitoring applications to know when (and, in many cases, before) a system is approaching a dangerous threshold.

Flexible IT, SOA and Solving the Real Problem

Service oriented architecture (SOA) is going to save us all.  We all know the drill:  Faster time-to-market.  Lower total cost of ownership.  Loose coupling makes integrating or re-wiring existing capabilities faster and adding new features easier.

Except that in most large companies, just doing SOA doesn’t do any of these things.  In reality, sub-optimal time-to-market and high operational costs are caused by many factors; older approaches to system integration is just one them.  While an SOA approach certainly helps, it isn’t a silver bullet.  It’s easy for technologists to get caught up in the promise of SOA as a solution to common IT challenges.  Even active SOA practitioners in growing companies believe – with good reason – that their existing SOA approach will scale with growth.  Most large companies have one or more of the following constraining factors that will limit the success of narrowly focused SOA initiatives:

  1. A lack of maturity with governing shared technology services.
  2. Project management methodologies and software development processes that are based on waterfall approaches to building software and usually highly dependent on integrated testing.
  3. Legacy systems that are difficult to service enable simultaneously.
  4. Infrastructure delivery constraints that make adding or changing hardware time-consuming.
  5. Testing methodologies that do not take advantage of a highly service-oriented environment.
  6. Technology solutions that require a mixture of technology platforms with varying capabilities for development methodologies (e.g. waterfall vs. agile) supporting many different projects within the organization.

It is very difficult for SOA to deliver on its promises when one or more of these conditions exist with an organization.  Furthermore, changing these conditions while also trying to become more service oriented can add time and cost to the effort.  If the stakeholders and sponsors of an SOA initiative have unrealistic expectations about the process, these added wrinkles can result in an otherwise beneficial project being canceled.

While most companies have already jumped on some form of the SOA bandwagon, many are not realizing the benefits because these fundamental challenges have not been addressed or even considered.  What most companies really want from SOA is a flexible IT environment.  To achieve this flexibility, more than SOA is needed.

I’ll explore each of these in more detail over a series of blog posts, beginning today with governance.

Lack of Maturity in Governing Shared Technology Services

Consider the problem that SOA intends to solve: proliferation of varying technologies implementing different standards that make integrating and changing these “solutions” slow and difficult.  The intent of SOA is to decouple service consumers from service providers, abstract knowledge of the implementation of the service from the service’s consumers, and provide common methods for interacting between consumers and providers.  In doing so, SOA must be (or become) a shared service to the entire organization, not something individual SOA practitioners create in silos.

Unfortunately, implementing SOA without an SOA governance program just recreates the original problem.  A practical, pragmatic approach to governance solves problems like:

  • Setting a direction for how services are implemented. Should services be built using SOAP or RESTful implementations?  If SOAP is used, what WS-* standards are all service providers and consumers expected to support?  Can service providers assume all consumers will implement WS-Security?  For either SOAP or RESTful implementations, even basic questions like the use of HTTP vs. HTTPS can create road-blocks to adoption.
  • Standardizing schemas, data and message formats. The whole point of SOA is to enable a variety of service providers to expose their operations in a common way.  While SOAP/XML and REST provide a mechanism to do this, message format differences can be a significant barrier to reuse.  Consider the following:
    • Service Provider A implements a complex type called “Address” that has one address line, a city, state and five-digit zipcode
    • Service Provider B implements the same complex type using two address lines, a city, state/province, ten-byte alphanumeric postal code, and country code
    • Service Provider C provides a service that relies on both service providers A and B to operate on data including addresses
    • Integrating these “service oriented” applications is no easier than if we were trying to do an EDI integration or a CORBA integration or a COM+ integration or a COBOL integration; in fact, it may even be worse because the business partners may believe that SOA should’ve fixed this problem.
  • Ensuring consistent service/operation packaging and granularity. Consider a an example similar to the one above, except now service providers vary widely in the granularity of the services or operations they provide.  One provider always operates on a “customer” entity (encompassing all attributes of the customer) while another always operates on individual customer entity attributes.  Again, the reuse of SOA services is undone because the services are incompatible in implementation.
  • Preventing duplicate or overlapping services from being created. Yet another variation on the above theme: two service providers create very similar services.  The differences between the two may be something as simple as performing an update to some customer data, but the service providers have one or two fields that differ between the interfaces.  Why maintain two heavily overlapping service?  In most cases, one service should exist and simply be enhanced to add the missing fields.

SOA governance can not and should not be draconian.  The objective of governance is not to implement rules from an ivory tower or deliberate on academic issues of building systems using SOA techniques.  Good SOA governance can be measure by how well it promotes service reuse, how much (or little) service duplication exists, and how time-to-market for projects using shared services trends over time (SOA is a long-term investment).  Practical SOA governance comes in the form of:

  • Reasonable, attainable standards with respect to how services are implemented:
    • SOAP vs. REST
    • Practical use of WS-* standards where they add value
    • Common and consistent schemas
  • A system of incentives for compliance with governance policies
  • A system of record (like a UDDI registry/repository) for recording which services exist and who consumes them
  • Agreed upon metrics and goals that quantify the quality of the governance system and the compliance of services
  • Sponsorship from technology and business leadership that governance is important

I’ll continue to explore the other constraining factors in future blog posts and how to address them.  While the challenges are significant, successfully meeting those challenges enables organization to realize the full potential of SOA.

Resilient Network Infrastructure

The network is a critical resource to nearly every enterprise IT solution. While many network resiliency considerations are inherently part of modern networking, some of these features actually undermine resiliency. For example:

  • the availability of multiple routes can sometimes introduce unpredictable behavior in applications if a preferred route is unavailable and alternate routes have higher latency;
  • switch ports and server network interfaces that have worked correctly in “auto-negotiate” mode suddenly stop working;
  • unintentional asymmetric routing result in packets trying to return to the customer through a firewall that never saw the incoming request causing the packet to be dropped

In this section, we’ll examine how to ensure that the network layer of new solutions is built for resiliency and avoid these types of problems.

Consider the Resiliency and Effects of Failure of “Core” Network Services

Some network services fail so infrequently (and are so catastrophic when they do fail) that they are rarely considered explicitly as a failure mode. DNS is one such service: most architects, software developers and server administrators are so conditioned to DNS working that they never consider what happens when it fails. In many cases, DNS is such a foundational service that a failure of the service is fatal to the solution. Insulating against DNS failures requires a two pronged approach.

First, it’s important to understand the resiliency of the DNS service. How much redundancy exists in the DNS infrastructure? How frequently has it failed in the past? Is the scope of a DNS failure limited in some way such as location on the network or zone?  In a future post, I’ll discuss some tools that can be used to help assess the resiliency needs of a solution and assess failure modes.  If the network team supporting DNS cannot commit to the level of availability required for the solution, or if previous observed failure rates exceed the tolerance of the solution, the DNS service needs to be improved.

Second, the solution must be analyzed to determine which services, components and transactions will fail if DNS is not available. This can be a difficult task because DNS is generally assumed to “just work”, so identifying every place that relies on DNS can be time consuming. One approach to quickly identify DNS dependencies is to intentionally misconfigure the DNS settings on a machine-by-machine or component-by-component basis in a test environment. Since DNS failures usually cause such widespread failures, performing this kind of  ”negative test” in a very controlled fashion on small pieces of the solution at a time is usually much more manageable. It’s also important to remember that DNS dependencies are not just introduced in application code; often times web, application and database server configurations rely on DNS for their internal operation.

Ultimately, it may be infeasible or impractical to eliminate dependencies on DNS. However, understanding how DNS-related failure modes manifest themselves is useful for quickly identifying a DNS problem when it occurs. There is a saying in the medical field that is apropos: “When you hear hoofbeats, think horses, not zebras.” This is how most technicians successfully troubleshoot failures. A DNS failure is a zebra, and on the rare instances DNS failures occur, it can take hours to identify the root cause of the problem if operations teams aren’t familiar with the symptoms.

Ironically, the resiliency provided by DNS can present problems in a way unrelated to the availability of DNS itself. Many applications will cache DNS responses or ignore DNS time-to-live (TTL) settings. Cached DNS responses can cause significant problems when DNS load-balancing solutions are used to respond to failures in the environment because it sometimes requires restarting of a process to clear the cache, often resulting in even more failures. Applications that fail to honor DNS TTL experience a similar failure and sometimes are outside of a technologist’s control. One example of this is ISP’s who have proxy servers configured to override DNS TTL. Assume that a global DNS load balancer is load-balancing mydomain.com between two IP addresses with a TTL of zero. It is tempting to assume that removing one IP address from being returned will immediately stop new requests from hitting that address, but requests may continue for quite some time.

Routing protocols present another unique challenge to resiliency. In most cases, routing protocols ensure that a route is available between the source and destination of a connection. This does not guarantee that the available route should actually be used, however. For example, consider the scenario shown below, a simplified web and application server configuration hosted in two data centers. For the sake of clarity, global and local load balancers have been omitted from the diagram, as have redundant web and application servers in each data center.

Network routing using preferred (solid lines) and sub-optimal (dashed lines) alternate route

Network routing using preferred (solid lines) and sub-optimal (dashed lines) alternate route

Assume that customer requests are routed to the web server in the internet facing DMZ and that the web server acts as a proxy for the application servers on the internal network. If the customer’s request enters data center A, the preferred route from web server 1 to application server 1 is shown in solid black lines. Let us also assume that the alternate route shown in the dashed line is available, but is suboptimal – it involves several more network hops and additional latency.

As long as the preferred route is available this network topology works as expected. Alternatively, consider a failure mode where the connection between the internet DMZ and internal network in data center A fails (for example, an incorrect firewall rule is created). The alternate route from web server 1 and application server 1 becomes active, routing the request through data center 2 adding significant latency to the request. Because requests are taking longer, active (open) connections to the web server accumulate which causes performance problems and possibly some connections to be denied as the web server runs out of resources. This failure mode results in many customers having a degraded experience due to the latency between the web and application server and eventually will cause failures.

If the suboptimal alternate route had not been available, the connection between the web and application server would have been broken immediately. As we will see in a future post, workload management tools like load balancers combined with intelligent “health checks” in the application can quickly detect this condition and stop traffic from being routed to the failing components. In the scenario where web server 1 cannot connect to application server 1, “fast failure” is preferred to the prolonged degraded connectivity condition that exists when traffic is routed through data center 2.

This is another example of simple being better. Intelligent routing adds complexity without reducing the risk of failure and probably increases the amount of time needed to troubleshoot the problem. The degraded connectivity condition seen above is an example of a problem we will see frequently that I  call “sick but not dead”. This class of failure is always problematic – it can trick monitoring that should detect failures into believing the system is healthy and it makes identifying the root cause of a failure much more difficult.

Network redundancy solutions can also cause problems at the server and network interface. Different operating system vendors use different terms, but most operating systems allow for the pairing of network interfaces for redundancy. (In Solaris this is called IPMP; in Linux, “teaming”; on AIX, “Etherchannel” or “Network Interface Backup”.) All of these solutions are valuable to improve resiliency as long as care is taken to ensure that the paired network interfaces are cabled to physically distinct switches. All too often, redundant, paired network interfaces are cabled to the same switch even though redundant switches are available with a shared VLAN. Similar to the testing of redundant PDUs by unplugging server power supplies, redundant network interfaces should be tested by unplugging network cables. Again, organizing such a test is difficult after a server is in production, so always plan for failure testing before a solution goes live.

DNS, routing protocols and redundant network interfaces are instructive examples in why it is important to consider core network services that may otherwise go unnoticed. When building a new solution, it is very important to examine the configuration of the network itself and question whether network resiliency features will improve the solution. Wherever possible, simplify the network configuration and ensure that sick but not dead scenarios are avoided in favor of fast failure.