Anonymous - BSD Story

Hi BSDNow-Team,

Allan told me to share a story with you, so here it is:

BEGIN OF STORY

I manage a cluster of machines (Dell C6220) at our computer science
department that are intended for computing intensive jobs in the big
data area (i.e. NoSQL "databases").  I took over the cluster from a
colleague who had it running under Ubuntu 12.04.  Now, I'm slowly but
steadily converting it to FreeBSD on ZFS, bhyve and other such niceties.
 I had established a "bridge-head" node for testing purposes and it was
running very well so far. Logins to the cluster are done via the
departments LDAP server (which is out of my control).
The machines in the cluster are pretty popular with students for their
thesis work, research from professors as well as a couple of labs that
make use of the computing ressources.  At the moment, cluster machines
are given exclusively to the users (a policy that I try to change) for a
certain period of time to avoid someone else interfering with
measurements and such.  As you can imagine, this results in a shortness
of nodes relatively quickly.

Recently, we got a request for a node from a student project in the
middle of the semester.  Apparently, they were using a VM from the our
central IT department (VMware vSphere with a moderate NetApp filer for
storage) and were having performance problems (not sure what kind). They
asked us whether they can use a node on the cluster for their project.
They sent me their requirements in an email:

OS: Ubuntu 15.04 or Debian 7 or CentOS 7.1 (CentOS preferred)
* minimum 8 GB memory
* at least 320 GB disk space
* at least Intel Xeon E3 (the more dedicated, the better)
* at least 100 MBit network connectivity

This clearly indicates that they had no idea about the real hardware of
just one of our cluster nodes and were probably just citing what they
previously got from central IT.  Since the cluster has multiple times
the hardware ressources in just one node, I thought it would be a waste
to give them one of these nodes if they have such low requirements anyway.
So I set up a bhyve VM (my very first actually) on my bridge-head
FreeBSD node with a 400 GB ZFS volume.  I configured bhyve to have 2
CPUs (out of 8) and 8 GB RAM (out of 64 GB).  After installing and
configuring CentOS (not a very nice experience for me if you are used to
BSD systems) for them, I gave them the IP address of the bhyve VM and a
local user account.  They were happy with it and are using that machine
for roughly three weeks now without any complaints about performance
whatsoever.  As far as my monitoring tells me they are using that
machine and the ZVOL does have a couple of gigabytes in it (not nearly
as much as they requested, though).

One morning, I decided to upgrade a couple of packages on the
bridge-head system (which is running CURRENT).  Next thing I know, the
openldap24-client package gets upgraded fine, but then errors start
appearing saying that libssl.so.8 was not found.  Remember that the
users on these nodes are authenticated via LDAP only, so the PAM plugin
wasn't working anymore.  Which means that no one (not even root) could
log into the machine anymore, which was running that bhyve VM.  Ooops!

I figured the best way was to reboot the node, risk the downtime of the
bhyve VM and try to fix things in single user mode.  Unfortunately, I
had no recent ZFS boot environment to roll back to.  After booting a
CURRENT snapshot ISO (great that FreeBSD has those!) into the Shell
environment, I was able to zfs import the pool with the altroot=/mnt
option.  Then chrooted into /mnt and was successful in recompiling the
openldap24-client package and rebooted again.  Luckily, I was able to
log in again and even the bhyve VM was unaffected.  No one even
complained about the downtime or that the CentOS was rebooted.

There are certainly better ways to handle this failure scenario, but it
worked nontheless.  I plan to expand this setup further with more
FreeBSD nodes and even more bhyve instances.  ZFS and boot environments
(which I now create before running updates) make disaster recovery very
easy and painless.  Also, I found that although bhyve is still young
(younger that its competitors in the hypervisor space), it is mature
enough to run VMs for your users.  If a student project is having
performance problems on a big vSphere installation and are not
complaining about performance at all when running on bhyve, that says a
lot.

I think they will never figure out that they are not using the real
cluster node and are instead running in a VM that has only a few
ressources from the node. This reminds me of the quote from Scotty in
the Star Trek TNG episode "Relics":

"Do you mind a little advice? Starfleet captains are like children. They
want everything right now and they want it their way. But the secret is
to give them only what they need, not what they want."

I think that does apply to a lot more people than just Starfleet
captains.

END OF STORY


Can you publish that story under an anonymous name, just in case they
find out by watching your show? We won't get into any trouble with it,
it's just a precaution.

Thanks and I look forward to your show every week (now even more)!