As of tomorrow morning, VM's running on all hosts with ESX 3.5U2 in enterprise configurations will not power on. VMotion/HA/DRS will probably also not work.
Boom.
Apparently, there is some bug in the vmware license management code. VMware is scrambling to figure out what happened and put out a patch.
Running VM's will not be immediately impacted.
There is a major discussion going on in the vmware communities about the issue: http://communities.vmware.com/thread/162377?tstart=0
OK, while we're all remaining calm....just imagine the implications that bugs like this can occur and get past QA testing....5 years down the road, nearly all server apps worldwide running in VM's if you believe a lot of forecasts ......some country decides to initiate cyberwarfare and manages to get a backdoor into whatever is the prevailing hypervisor of the day.....boom. All your VM's are belong to us.
I honestly think a lot of the hype from those who want to build a vm security industry is crap, but god protect us if the baseline code for critical hypervisors like ESX isn't kept secure and regularly audited.
I'd love to find out what happened here.
What regression testing on new releases does vmware do to check for date based bugs? I'd think they'd at least check for simple things like changing the date to 1 year or 1 month in the future.
UPDATE: Frank Wegner has posted the following suggestions:
You can see the latest status here: http://kb.vmware.com/kb/1006716 Please check back often, because it will notify you when this issue has been fixed. Until then the best workaround I can think of is:* Do nothing
* Turn DRS off
* Avoid VMotion
* Avoid to power off VM's
I'd council against turning DRS off as that actually deletes resource pool settings....instead, set sensitivity to 5 which should effectively disable it w/ minimal impact.
UPDATE 2: VMware Website appears to be having trouble keeping up with people requesting updates.
UPDATE 3: VMware has stated they will have fixes available in 36hrs at the earliest.
UPDATE 4: Anand Mewalal comments:
We used the following workaround to power on the VM's.
Find the host where a VM is located
run ' vmware-cmd -l ' to list the vms.
issue the commands:
service ntpd stop
date -s 08/01/2008
vmware-cmd /vmfs/volumes/vm path/vmname.vmx start
service ntpd start
UPDATE 5: Apparently, there are no easily seen warnings in logs/etc or VC prior to hitting the bug. VC will continue to show the hosts as licensed and no errors will appear in vmkernel log file until you try to start up a new vm, reboot a vm, or reboot the host.
UPDATE 6: Welcome Slashdot readers! I've temporarily disabled comments to allow the server vm to handle the load. Apparently Movable Type 4.1 executes a seperate perl cgi script to handle comments on each page load. Load times might have been slow for the last 45 minutes, but should be OK now.
UPDATE 7: I made some minor corrections to this entry that others have requested.
UPDATE 8: VMware has provided a FAQ about the bug with an updated ETA and a few responses to common complaints.
UPDATE 9: Here are some additional external links to blogs/media covering the bug:
SearchServerVirtualization.comVMware continues to make a reasonable argument that all software will have major bugs at some point, and customers keeping calm and being patient is the best course for everyone to take at this time.
The Register - Date bugs kills VMware Systems
blog.scottlowe.org
virtualization.info
Most of the blogs are naturally quite upset about the lack of QA for the most part.....but I am seeing just as much anger, if not more, is being generated due to the high wait time for a patch. In the discussion forums, vmware points out that they want to perform QA on the patch itself prior to release and that they have to distribute the patch in many many languages/locales....still, I have to think that if a bug like this occurred in a major open source project, we'd have a patch from someone even if it was not the vendor public within 12-24hrs? Of course, it would not be QA'd and may cause more damage than the bug it is trying to fix. It always comes down to making your choices and taking your chances.
UPDATE 10: Feedback from Slashdot:
- "What would be the more correct course of action, would be to set your DRS Cluster to Manual, which is indicating no automation, DRS will not place or move VMs." Yeah, that sounds better.
- "Misleading headline. Nothing gets shutdown." The "shut down" part was added by the Slashdot editors....if you look at firehose, you'll see the original headline.
- "Hey douchebag, this is not SMS, is it so hard to hit another 2 keys on your keyboard? Oh and for the love of $DEITY$, please learn basic HTML and use links so I don't have to copy paste text into the address bar." Uhm, sorry?
UPDATE 11: From VMware - "we are on track to deliver an express patch by 6pm, August 12, 2008 PST." Excellent news! It's currently almost 3pm here in California, so we're looking at a little more than 3hrs from now.
UPDATE 12: From VMware - "There are two express patches: one for ESX 3.5 Update 2 and one for ESXi 3.5 Update 2. They are specifically targeted for customers who have installed or fully upgraded to ESX/ESXi 3.5 Update 2 or who have applied the ESX350-200806201-UG patch to ESX/ESXi 3.5 or ESX/ESX 3.5 Update 1 hosts. For customers who haven't done either, these express patches should not be applied.
To be noted is that these patches have been validated to work with esxupdate. However, testing with the VMware Update Manager is still under way. In subsequent communications, we will provide confirmation whether the patches work with VMware Update Manger or if a re-spin is required."
To apply the patches, no reboot of ESX/ESXi hosts is required. One can VMotion off running VMs, apply the patches and VMotion the VMs back. If VMotion capability is not available, VMs need to be powered off before the patches are applied and powered back on afterwords"
I suspect that the last sentence above is going to give a lot of admins pause, as their are quite a lot of sites which have converted their entire clusters to 3.5U2.
UPDATE 13: The official patch availability notice arrives from VMware shortly after 8pm:
"The express patches are now available for download that will resolve the ESX/ESXi 3.5 Update 2 issue which causes the product license to expire as of August 12, 2008. Please go to http://www.vmware.com/go/esxexpresspatches for more information."
Note that they seem to have created a new url which I assume points to some cdn network/etc that will be able to handle the load of everyone trying to download them.
UPDATE 14: A letter from VMware CEO appears at 9pm at http://blogs.vmware.com/console/2008/08/letter-from-vmw.html
UPDATE 15: VMware is requesting that technical responses to the new patches be posted at: http://communities.vmware.com/thread/162714
Most people seem to be installing the patches fine, although those with fully converted clusters are having to set the time back on all ESX hosts to enable vmotion so that vm's can be migrated to other hosts so that each host can briefly enter maintenance mode to apply. If time is set back, please make sure that syncing of VM time and host time is disabled for all VM's.
