Wednesday, July 7th, 2010
It is quite possible to run Nagios3’s web interface directly from Nginx and a FastCGI server, rather than having to involve a web application server like Apache. This is useful if you want to preserve memory on your machine, for example.
First of all, we ask Nginx to serve the static files for the Nagios web interface. In Debian/Ubuntu, these live in /usr/share/nagios3/htdocs and /usr/share/nagios3/stylesheets, which is a little awkward, but just the sort of thing that the rewrite command is for …
location / {
root /usr/share/nagios3/htdocs;
index index.html;
rewrite ^/nagios3/stylesheets/(.*)$ /../stylesheets/$1 break;
rewrite ^/nagios3/(.*)$ /$1 break;
}
Next, we tell Nginx to send requests for CGI pages down to a FastCGI server :-
location ~ \.cgi$ {
root /usr/lib/cgi-bin/nagios3;
include /etc/nginx/fastcgi_params;
rewrite ^/cgi-bin/nagios3/(.*)$ /$1;
auth_basic "Nagios";
auth_basic_user_file /etc/nagios3/htpasswd.users;
fastcgi_pass 127.0.1.1:8998;
fastcgi_param SCRIPT_FILENAME /usr/lib/cgi-bin/nagios3$fastcgi_script_name;
fastcgi_param AUTH_USER $remote_user;
fastcgi_param REMOTE_USER $remote_user;
}
We need to make sure that these requests are under authentication, and that we pass the authenticated username to the CGI script properly, hence the auth_basic and fastcgi_param AUTH_USER lines.
That’s Nginx taken care of, but we also need to make sure there’s a generic FastCGI server running on the specified address/port. No configuration is necessary, as we’re passing everything we need, including the script name. fcgiwrap comes recommended on the Nginx wiki.
/usr/bin/spawn-fcgi -a 127.0.1.1 -p 8998 \
-u www-data -g www-data \
-f /usr/local/bin/fcgiwrap -P /var/run/fcgiwrap.pid
And that’s all you need!
Posted in Uncategorized | No Comments »
Tuesday, March 30th, 2010
By default, the statusmap of Nagios does not include any useful icons for your equipment. There are some icon collections on http://www.monitoringexchange.org/, and a handy set can be installed with the nagios-images package.
However, there will come a time when you want your own customised icons, and information on how to produce them is hard to come by. The status map is assembled as nagios starts with the GD library, and therefore the recommended format for icons is .gd2, but strictly speaking that’s an intermediate format and most tools don’t give you any options to create them.
Start by creating an SVG files in Inkscape. Make the icon around 40×40 pixels; this small size won’t matter at all while editing in Inkscape, as vectors scale very well! Avoid using black — the next steps in the toolchain will treat black as transparent. Instead use a blackish colour, such as #010101.
Just before saving the file, go to “Document Properties” and set the page size to fit the selected icon.
When the SVG file is ready, save it. We’re finished with Inkscape now, and now we drop to the command-line to use rsvg-convert from the librsvg2-bin package. By default this will create a PNG file, which we can use as the Nagios icon_image. Then we use netpbm tools to grab the transparent part of the PNG, and write out a GD2 image
$ rsvg-convert example.svg > example.png
$ pngtopnm example.png | pnmtopng -transparent =rgb:00/00/00 2>/dev/null | pngtogd2 /proc/self/fd/0 example.gd2 0 1
The pngtogd2 command comes from libgd-tools, and is slightly annoying in that it won’t take STDIN … so we fake it using /proc/self/fd/0. The “0 1″ parameters set the chunk size and format, and are cargo-culted from the web and Nagios documentation. The black areas in the PNG are marked in GD2 as being transparent.
So now you can create nice Nagios icons easily. I’ll be publishing a set of basic icons and their source SVG files on code.inode.co.nz soon.
Posted in Technology | No Comments »
Wednesday, March 17th, 2010
Nagios is great, but I still get annoyed with the standard configuration setup. Hosts, Hostgroups, Services … whenever you add a new host, there are a lot of places to touch to get it configured properly.
Most customers think of their network in terms of “hosts” — they add a physical host, they run services on that host, they want to monitor that host. So how can we reduce the amount of config work to be done?
I’ve started to play with a scheme I’m calling ‘capabilities’. A capability is a combination of a hostgroup and a service definition, both with the same name. Here’s the “ssh” capability :-
### SSH {{{
define hostgroup{
hostgroup_name ssh
alias Hosts with ssh on TCP:22
}
define service{
hostgroup_name ssh
service_description ssh
check_command check_ssh
use service.template
}
# }}} ssh
All I have to do in order to have checks against this capability is to add the host to the hostgroup ’ssh’, and I can do this directly from the host object itself …
define host{
host_name benjamin
alias Mail Server
address 10.11.12.13
use host.template
hostgroups ssh,smtp
}
No messing around with specifying this host anywhere else, all the references it needs in order to exist are in one file, and so are the complete set of references to the service checks that will be performed.
Now admittedly, this does add extra hostgroups to the CGI interface, and if you are using that on a regular basis you may not appreciate the extra clutter. However, for the sorts of thing I’m doing, it’s more important to be able to quickly and accurately describe a host & associated checks.
Posted in Technology | No Comments »
Thursday, January 28th, 2010
I’ve been thinking about monitoring production sites and applications for a while now. Network equipment hands out more stats than you can shake a stick at over SNMP, and that’s been the basis of many a successful monitoring setup.
Unfortunately the state of the network does not directly help if you’re really trying to understand issues that an end-user may be having. So we all got on board with monitoring the state of the machines the apps are running on, tracking memory, CPU, interrupts and so on.
Bad app, good system
Oddly enough, this doesn’t help too much either. One of the first profiling jobs I did for a customer consisted of collecting months worth of stats from an ICL machine running SVR4, and throwing it all into a spreadsheet to analyse. Luckily it graphed easily, and we could see memory maxing out as users came online, recovering partially at lunchtime, and completely at the end of the day. Swap space followed the same pattern; so the customer bought more memory (which wasn’t a cheap option in those days). That prevented the machine from dipping into swap, but didn’t help the application performance. Other indicators seemed to show that the system was in a reasonable state, except load average was high (which in those days was basically IO wait related).
Eventually I called on friends in the ICL performance centre; they knew the application in question to be troublesome, as it had been naïvely ported from another platform and was basically just inefficient with resources; from memory I think it was distrusting the filesystem to do its job and repeatedly opening & closing files for frequent small writes. No tuning on the host could accommodate that (although these days I’d be tempted to give the application a virtual filesystem, or something).
Of course, we had insufficient resources (tools, time, money & indeed knowledge) to examine this further, leaving the customer’s problems unresolved.
When do we debug an app?
Now, with access to the source and a debugger, or possibly these days with just Dtrace in hand, we can prod and poke at the internals of a running application to see just what is happening. But we only do this when enough users are unhappy enough to complain. What we should be doing is monitoring the performance of applications themselves, as they are running on real production systems. That way we can see when acceptable performance turns into unacceptable performance, as the real environment around the app changes.
Given that a good application release should include simulated load testing, our interest should lie in the actual user-experienced performance on the live production systems. These are affected by all sorts of factors that are not under the control of the app; database delays, other processes on the server hogging resources …
Web application performance use-case testing
As a extension of the current thinking about Behaviour Driven Infrastructure testing (go and read up on Lindsay Holmwood’s thoughts on this, thanks for your presentation at LCA2010), we should be looking at driving use-case tests through the production infrastructure, treating it as a black box (the way the user does), and gathering performance stats on each step.
I’ve been looking at cucumber-nagios as a way to get started on this, given that Nagios is the canonical status monitoring tool, and the domain-specific language that cucumber uses is highly business-readable. However, although Nagios is often good-enough as a performance collecting service, I need more granularity than a single test scenario. I want per-step timings …
Per-step performance
A typical cucumber test run of a web application looks like this :-
$ cucumber inodepages.feature
Feature: inode.co.nz
It should be up
And the home page should introduce Inode
And the Contact page should have my PGP key fingerprint
Scenario: Visiting home and contact pages # webhome.feature:6
When I go to http://inode.co.nz # _steps.rb:1
Then I should see "Inode is a small IT consultancy & services company based in Dunedin NZ. " # _steps.rb:1
Scenario: Visiting the contact page via the home page # webhome.feature:11
When I go to http://inode.co.nz # _steps.rb:1
And I follow "Contact" # _steps.rb:9
Then I should see "B50F\302\240BE3B\302\240D49B\302\2403A8A\302\2409CC3 8966\302\2409374\302\24082CD\302\240C982\302\2400605" # _steps.rb:1
2 scenarios (2 passed)
6 steps (6 passed)
0m0.790s
This gives a total time value, which fits in with performance targets, but doesn’t do so well with actually showing us which steps have changed over time. Looking in to the guts of the cucumber steps (which inherently are unique per project), it seems reasonable that we can record our own timing values. Here’s a part of a steps file from a cucumber-nagios project :-
When /^I go to (.*)$/ do |path|
visit path
end
When /^I submit the form named "(.*)"$/ do |name|
submit_form(name)
end
The visit path and submit_form calls are methods of Webrat. It’s not especially difficult to wrap each of these calls with some extra code to collect the delta time for each call into webrat, and then to spit that out somewhere.
Of course it would take someone more aware of what’s going on inside the various modules to know whether this is the best place to take such action, but to my mind it’s the beginnings of great performance data … now we have to decide what to do with it. I’m thinking it will make an interesting plugin for collectd …
Posted in Technology | No Comments »