Production behaviour testing
Thursday, January 28th, 2010I’ve been thinking about monitoring production sites and applications for a while now. Network equipment hands out more stats than you can shake a stick at over SNMP, and that’s been the basis of many a successful monitoring setup.
Unfortunately the state of the network does not directly help if you’re really trying to understand issues that an end-user may be having. So we all got on board with monitoring the state of the machines the apps are running on, tracking memory, CPU, interrupts and so on.
Bad app, good system
Oddly enough, this doesn’t help too much either. One of the first profiling jobs I did for a customer consisted of collecting months worth of stats from an ICL machine running SVR4, and throwing it all into a spreadsheet to analyse. Luckily it graphed easily, and we could see memory maxing out as users came online, recovering partially at lunchtime, and completely at the end of the day. Swap space followed the same pattern; so the customer bought more memory (which wasn’t a cheap option in those days). That prevented the machine from dipping into swap, but didn’t help the application performance. Other indicators seemed to show that the system was in a reasonable state, except load average was high (which in those days was basically IO wait related).
Eventually I called on friends in the ICL performance centre; they knew the application in question to be troublesome, as it had been naïvely ported from another platform and was basically just inefficient with resources; from memory I think it was distrusting the filesystem to do its job and repeatedly opening & closing files for frequent small writes. No tuning on the host could accommodate that (although these days I’d be tempted to give the application a virtual filesystem, or something).
Of course, we had insufficient resources (tools, time, money & indeed knowledge) to examine this further, leaving the customer’s problems unresolved.
When do we debug an app?
Now, with access to the source and a debugger, or possibly these days with just Dtrace in hand, we can prod and poke at the internals of a running application to see just what is happening. But we only do this when enough users are unhappy enough to complain. What we should be doing is monitoring the performance of applications themselves, as they are running on real production systems. That way we can see when acceptable performance turns into unacceptable performance, as the real environment around the app changes.
Given that a good application release should include simulated load testing, our interest should lie in the actual user-experienced performance on the live production systems. These are affected by all sorts of factors that are not under the control of the app; database delays, other processes on the server hogging resources …
Web application performance use-case testing
As a extension of the current thinking about Behaviour Driven Infrastructure testing (go and read up on Lindsay Holmwood’s thoughts on this, thanks for your presentation at LCA2010), we should be looking at driving use-case tests through the production infrastructure, treating it as a black box (the way the user does), and gathering performance stats on each step.
I’ve been looking at cucumber-nagios as a way to get started on this, given that Nagios is the canonical status monitoring tool, and the domain-specific language that cucumber uses is highly business-readable. However, although Nagios is often good-enough as a performance collecting service, I need more granularity than a single test scenario. I want per-step timings …
Per-step performance
A typical cucumber test run of a web application looks like this :-
$ cucumber inodepages.feature
Feature: inode.co.nz
It should be up
And the home page should introduce Inode
And the Contact page should have my PGP key fingerprint
Scenario: Visiting home and contact pages # webhome.feature:6
When I go to http://inode.co.nz # _steps.rb:1
Then I should see "Inode is a small IT consultancy & services company based in Dunedin NZ. " # _steps.rb:1
Scenario: Visiting the contact page via the home page # webhome.feature:11
When I go to http://inode.co.nz # _steps.rb:1
And I follow "Contact" # _steps.rb:9
Then I should see "B50F\302\240BE3B\302\240D49B\302\2403A8A\302\2409CC3 8966\302\2409374\302\24082CD\302\240C982\302\2400605" # _steps.rb:1
2 scenarios (2 passed)
6 steps (6 passed)
0m0.790s
This gives a total time value, which fits in with performance targets, but doesn’t do so well with actually showing us which steps have changed over time. Looking in to the guts of the cucumber steps (which inherently are unique per project), it seems reasonable that we can record our own timing values. Here’s a part of a steps file from a cucumber-nagios project :-
When /^I go to (.*)$/ do |path| visit path end When /^I submit the form named "(.*)"$/ do |name| submit_form(name) end
The visit path and submit_form calls are methods of Webrat. It’s not especially difficult to wrap each of these calls with some extra code to collect the delta time for each call into webrat, and then to spit that out somewhere.
Of course it would take someone more aware of what’s going on inside the various modules to know whether this is the best place to take such action, but to my mind it’s the beginnings of great performance data … now we have to decide what to do with it. I’m thinking it will make an interesting plugin for collectd …
