Untitled

Hello fine people of the Ganglia community!

I recently joined the team at a company with a long running Ganglia,
set up many years ago by persons unknown to me, and not documented in
any manner that we can find.  It's still working, but the server it's
on is in a failure state with some bad ram (we think) and we get
missing bits of the graphs.  Aside from this being annoying, we're a
little bit afraid that if we powered the machine off, it may never
come back on - as happens.

So, one of my tasks in recent weeks is to rebuild our Nagios and
Ganglia setups.  And I'm running into a wierd problem, which I will
explain after a brief overview of how we use Ganglia - which isn't
likely to change soon, for a number of reasons, though I have fielded
some suggestions on how we may do things differently and why.

So, there are about a hundred machines, ish, each running gmond
configured to send data unicast to the collector.  I've modified their
configuration such that currently, each sends data to our existing
host and to the new host.  I am receiving at least some data for all
machines, but I am missing quite a bit of data, esp load_one from
almost everything, resulting in lots of broken images where I'd like
to see graphs.

These machines are split into clusters and grids with clusters in
them, and it's .. well .. that's how it is.  It looks something like
this:

 SH Grid
   - Content Grid < (crawl, workflow - clusters)
   - Production Grid < (web, db, search, misc - clusters)
   - Dev QA (Cluster)
   - Corp Xen (Cluster)
   - Infrastructure (Cluster)

So, on the collector host, there are three gmetad processes running:

 gmetad: SH Grid
 gmetad: Content Grid
 gmetad: Production Grid

As well as numerous gmond:

 gmond: crawl
 gmond: workflow
 gmond: web
 gmond: db
 gmond: search
 gmond: misc
 gmond: dev/qa
 gmond: xen
 gmond: infra

The configuration is exactly duplicated from the existing, "working"
host, by which I mean that I am actually using the same configuration
files.  I was using gmetad 3.1 with gmond 3.0, but I decided that even
though that should work and seemed not to be the problem, it wouldn't
hurt to shore up the versions and am currently using both from 3.0.

I have a few problems with this new setup:

 * Grids disappear and reappear sporadically - e.g. the Production
grid is often not on the page, and today when I click through to
production grid it takes me directly to web cluster because it is
apparently not aware of any other clusters.
 * Wierd things happen - I know this is vague, but I'll lead with an
example: when I click "Dev QA" sometimes it is reported as part of
Production Grid, other times as part of Content Grid, when in fact it
is a part of top-level "SH Grid".

I'm sure there is other wierdness, but some of it may come into focus
more if I get past these overwhelming problems.

Thanks in advance for any help that any of you can offer!