Recently in SysAdmin Category

Excellent reference and a good document to review once a  year: 

http://www.yellow-bricks.com/2008/12/11/enableresignature-andor-disallowsnapshotlun/

Gone are the days when every sysadmin had to carry a pager because cell phones just didn't cut it.  These days, SMS is just about as reliable in most environments and essentially free compared to the cost of a paging service and dealing with a 2nd device.

The hiccup, as nearly everyone knows now, is that you have to find a way to get your monitoring server to talk to your cell phone company -- and, honestly, you're cell phone company does not want to make this easy....they'll try to upgrade you to various unlimited SMS/messaging plans and then have an internet SMTP -> SMS gateway that fails constantly.  

I've switched from provider to provider over the last 4-5 years and they are all the same.  Alerts may be reliable for a few months on one or the other, but eventually there comes a time when you didn't get the call because the email -> SMS gateway was down, or worse -- delivers the message a day later.

So, screw the cell phone company, let's use Skype:

Steps I Used to Teach The Hobbit Monitor to Use Skype:

  • Setup Hobbit in a Dedicated Linux Virtual Machine
  • Ensure Hobbit is running under its own non-root userid (actually, no reason not to just use apache userid since hobbit and apache are going to share a great many files and the entire VM is dedicated to hobbit).
  • Setup VM to automatically bootup in runlevel 5 and autostart a new gnome session under apache
  • Install Skype for Linux and the Skype Command Line Tools at: http://www.oberle.org/blog/2007/06/11/sending-sms-with-skype-on-linux/
  • Do not use the standard init system to startup hobbit.  Instead, setup gnome to initiate the Skype Client followed by a hobbit restart at the start of each session.  This will ensure that hobbit has full access to the X authentication environment which it needs to communicate with the X-windows skype client.
  • Create a new script in /usr/local/bin owned by apache that executes the skype command line SMS send tool and passes it the phone number and BB environment variable for the error message itself.
  • Follow the instructions to hobbit-alert.conf to tell hobbit under what condition it should send SMS alerts
  • Make sure you setup a dedicated Skype account for the server, use skype out credit, and have it auto deposit another $10 into the account whenever it runs out of funds.  Also, modify the Skype client settings to not accept any incoming calls/messages/etc.
Voila! 

Seems mostly to be a bugfix + new hardware support patch, although I did see the following nuggets:

  • Intel Pro/1000 gigabit Ethernet device drivers (e1000) in some guests allocate MTU bytes for rx buffers, but tell the device the size of the rx buffer is 2048 bytes. If these buffers fall on the edge of the guest physical memory range, the virtual e1000 device could wedge during rx with the following messages in the VMkernel logs:

    WARNING: Alloc: ppn=0xc0000 out of range: 0x0-0xc0000 (count=3)
    WARNING: P2MCache: GetPhysMemRange failed: PPN 0xc0000 canBlock 0 status Bad parameter.

    This patch fixes this problem.

    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1007041


  • Add experimental support for a new utility, the VMDK Recovery Tool.

    More information available at: http://kb.vmware.com/kb/1007243


The official release notes are available at: http://www.vmware.com/support/vi3/doc/vi3_esx35u3_rel_notes.html


As of tomorrow morning, VM's running on all hosts with ESX 3.5U2 in enterprise configurations will not power on. VMotion/HA/DRS will probably also not work.

Boom.

Apparently, there is some bug in the vmware license management code. VMware is scrambling to figure out what happened and put out a patch.

Running VM's will not be immediately impacted.

There is a major discussion going on in the vmware communities about the issue: http://communities.vmware.com/thread/162377?tstart=0

OK, while we're all remaining calm....just imagine the implications that bugs like this can occur and get past QA testing....5 years down the road, nearly all server apps worldwide running in VM's if you believe a lot of forecasts ......some country decides to initiate cyberwarfare and manages to get a backdoor into whatever is the prevailing hypervisor of the day.....boom. All your VM's are belong to us.

I honestly think a lot of the hype from those who want to build a vm security industry is crap, but god protect us if the baseline code for critical hypervisors like ESX isn't kept secure and regularly audited.

I'd love to find out what happened here.

What regression testing on new releases does vmware do to check for date based bugs? I'd think they'd at least check for simple things like changing the date to 1 year or 1 month in the future.

UPDATE: Frank Wegner has posted the following suggestions:

You can see the latest status here: http://kb.vmware.com/kb/1006716 Please check back often, because it will notify you when this issue has been fixed. Until then the best workaround I can think of is:

* Do nothing
* Turn DRS off
* Avoid VMotion
* Avoid to power off VM's

I'd council against turning DRS off as that actually deletes resource pool settings....instead, set sensitivity to 5 which should effectively disable it w/ minimal impact.

UPDATE 2: VMware Website appears to be having trouble keeping up with people requesting updates.

UPDATE 3: VMware has stated they will have fixes available in 36hrs at the earliest.

UPDATE 4: Anand Mewalal comments:

We used the following workaround to power on the VM's.
Find the host where a VM is located
run ' vmware-cmd -l ' to list the vms.
issue the commands:
service ntpd stop
date -s 08/01/2008
vmware-cmd /vmfs/volumes/vm path/vmname.vmx start
service ntpd start

UPDATE 5: Apparently, there are no easily seen warnings in logs/etc or VC prior to hitting the bug. VC will continue to show the hosts as licensed and no errors will appear in vmkernel log file until you try to start up a new vm, reboot a vm, or reboot the host.

UPDATE 6: Welcome Slashdot readers! I've temporarily disabled comments to allow the server vm to handle the load. Apparently Movable Type 4.1 executes a seperate perl cgi script to handle comments on each page load. Load times might have been slow for the last 45 minutes, but should be OK now.

UPDATE 7: I made some minor corrections to this entry that others have requested.

UPDATE 8: VMware has provided a FAQ about the bug with an updated ETA and a few responses to common complaints.

UPDATE 9: Here are some additional external links to blogs/media covering the bug:

SearchServerVirtualization.com
The Register - Date bugs kills VMware Systems
blog.scottlowe.org
virtualization.info

VMware continues to make a reasonable argument that all software will have major bugs at some point, and customers keeping calm and being patient is the best course for everyone to take at this time.

Most of the blogs are naturally quite upset about the lack of QA for the most part.....but I am seeing just as much anger, if not more, is being generated due to the high wait time for a patch.  In the discussion forums, vmware points out that they want to perform QA on the patch itself prior to release and that they have to distribute the patch in many many languages/locales....still, I have to think that if a bug like this occurred in a major open source project, we'd have a patch from someone even if it was not the vendor public within 12-24hrs?  Of course, it would not be QA'd and may cause more damage than the bug it is trying to fix.   It always comes down to making your choices and taking your chances.

UPDATE 10:  Feedback from Slashdot:

  • "What would be the more correct course of action, would be to set your DRS Cluster to Manual, which is indicating no automation, DRS will not place or move VMs."  Yeah, that sounds better.
  • "Misleading headline.  Nothing gets shutdown."   The "shut down" part was added by the Slashdot editors....if you look at firehose, you'll see the original headline.
  • "Hey douchebag, this is not SMS, is it so hard to hit another 2 keys on your keyboard? Oh and for the love of $DEITY$, please learn basic HTML and use links so I don't have to copy paste text into the address bar."  Uhm, sorry?
We're going to try and turn comments back on for the blog now since it's been 6hrs and traffic has died down quite a bit.

UPDATE 11:  From VMware - "we are on track to deliver an express patch by 6pm, August 12, 2008 PST."   Excellent news!  It's currently almost 3pm here in California, so we're looking at a little more than 3hrs from now.

UPDATE 12: From VMware - "There are two express patches: one for ESX 3.5 Update 2 and one for ESXi 3.5 Update 2. They are specifically targeted for customers who have installed or fully upgraded to ESX/ESXi 3.5 Update 2 or who have applied the ESX350-200806201-UG patch to ESX/ESXi 3.5 or ESX/ESX 3.5 Update 1 hosts. For customers who haven't done either, these express patches should not be applied.

To be noted is that these patches have been validated to work with esxupdate. However, testing with the VMware Update Manager is still under way. In subsequent communications, we will provide confirmation whether the patches work with VMware Update Manger or if a re-spin is required."

To apply the patches, no reboot of ESX/ESXi hosts is required. One can VMotion off running VMs, apply the patches and VMotion the VMs back. If VMotion capability is not available, VMs need to be powered off before the patches are applied and powered back on afterwords"

I suspect that the last sentence above is going to give a lot of admins pause, as their are quite a lot of sites which have converted their entire clusters to 3.5U2.


UPDATE 13:  The official patch availability notice arrives from VMware shortly after 8pm:

"The express patches are now available for download that will resolve the ESX/ESXi 3.5 Update 2 issue which causes the product license to expire as of August 12, 2008. Please go to http://www.vmware.com/go/esxexpresspatches for more information."

Note that they seem to have created a new url which I assume points to some cdn network/etc that will be able to handle the load of everyone trying to download them.

UPDATE 14: A letter from VMware CEO appears at 9pm at http://blogs.vmware.com/console/2008/08/letter-from-vmw.html

UPDATE 15: VMware is requesting that technical responses to the new patches be posted at:  http://communities.vmware.com/thread/162714

Most people seem to be installing the patches fine, although those with fully converted clusters are having to set the time back on all ESX hosts to enable vmotion so that vm's can be migrated to other hosts so that each host can briefly enter maintenance mode to apply.    If time is set back, please make sure that syncing of VM time and host time is disabled for all VM's.

A small but nice set of recommendations to optimize SAN performance under ESX.

If you can get past the first few minutes and the rather downbeat portrayal of the techs involved, this lengthy video does an absolutely brilliant job of exploring dozens of the of the absurd situations that almost all sysadmins find themselves in at one point or another.

The Website Is Down

About this Archive

This page is a archive of recent entries in the SysAdmin category.

Life is the previous category.

Find recent content on the main index or look in the archives to find all content.

Pages

  • images
Powered by Movable Type 4.1