As of tomorrow morning, VM's running on all hosts with ESX 3.5U2 in enterprise configurations will not power on. VMotion/HA/DRS will probably also not work.
Boom.
Apparently, there is some bug in the vmware license management code. VMware is scrambling to figure out what happened and put out a patch.
Running VM's will not be immediately impacted.
There is a major discussion going on in the vmware communities about the issue: http://communities.vmware.com/thread/162377?tstart=0
OK, while we're all remaining calm....just imagine the implications that bugs like this can occur and get past QA testing....5 years down the road, nearly all server apps worldwide running in VM's if you believe a lot of forecasts ......some country decides to initiate cyberwarfare and manages to get a backdoor into whatever is the prevailing hypervisor of the day.....boom. All your VM's are belong to us.
I honestly think a lot of the hype from those who want to build a vm security industry is crap, but god protect us if the baseline code for critical hypervisors like ESX isn't kept secure and regularly audited.
I'd love to find out what happened here.
What regression testing on new releases does vmware do to check for date based bugs? I'd think they'd at least check for simple things like changing the date to 1 year or 1 month in the future.
UPDATE: Frank Wegner has posted the following suggestions:
You can see the latest status here: http://kb.vmware.com/kb/1006716 Please check back often, because it will notify you when this issue has been fixed. Until then the best workaround I can think of is:* Do nothing
* Turn DRS off
* Avoid VMotion
* Avoid to power off VM's
I'd council against turning DRS off as that actually deletes resource pool settings....instead, set sensitivity to 5 which should effectively disable it w/ minimal impact.
UPDATE 2: VMware Website appears to be having trouble keeping up with people requesting updates.
UPDATE 3: VMware has stated they will have fixes available in 36hrs at the earliest.
UPDATE 4: Anand Mewalal comments:
We used the following workaround to power on the VM's.
Find the host where a VM is located
run ' vmware-cmd -l ' to list the vms.
issue the commands:
service ntpd stop
date -s 08/01/2008
vmware-cmd /vmfs/volumes/vm path/vmname.vmx start
service ntpd start
UPDATE 5: Apparently, there are no easily seen warnings in logs/etc or VC prior to hitting the bug. VC will continue to show the hosts as licensed and no errors will appear in vmkernel log file until you try to start up a new vm, reboot a vm, or reboot the host.
UPDATE 6: Welcome Slashdot readers! I've temporarily disabled comments to allow the server vm to handle the load. Apparently Movable Type 4.1 executes a seperate perl cgi script to handle comments on each page load. Load times might have been slow for the last 45 minutes, but should be OK now.
UPDATE 7: I made some minor corrections to this entry that others have requested.
UPDATE 8: VMware has provided a FAQ about the bug with an updated ETA and a few responses to common complaints.
UPDATE 9: Here are some additional external links to blogs/media covering the bug:
SearchServerVirtualization.comVMware continues to make a reasonable argument that all software will have major bugs at some point, and customers keeping calm and being patient is the best course for everyone to take at this time.
The Register - Date bugs kills VMware Systems
blog.scottlowe.org
virtualization.info
Most of the blogs are naturally quite upset about the lack of QA for the most part.....but I am seeing just as much anger, if not more, is being generated due to the high wait time for a patch. In the discussion forums, vmware points out that they want to perform QA on the patch itself prior to release and that they have to distribute the patch in many many languages/locales....still, I have to think that if a bug like this occurred in a major open source project, we'd have a patch from someone even if it was not the vendor public within 12-24hrs? Of course, it would not be QA'd and may cause more damage than the bug it is trying to fix. It always comes down to making your choices and taking your chances.
UPDATE 10: Feedback from Slashdot:
- "What would be the more correct course of action, would be to set your DRS Cluster to Manual, which is indicating no automation, DRS will not place or move VMs." Yeah, that sounds better.
- "Misleading headline. Nothing gets shutdown." The "shut down" part was added by the Slashdot editors....if you look at firehose, you'll see the original headline.
- "Hey douchebag, this is not SMS, is it so hard to hit another 2 keys on your keyboard? Oh and for the love of $DEITY$, please learn basic HTML and use links so I don't have to copy paste text into the address bar." Uhm, sorry?
UPDATE 11: From VMware - "we are on track to deliver an express patch by 6pm, August 12, 2008 PST." Excellent news! It's currently almost 3pm here in California, so we're looking at a little more than 3hrs from now.
UPDATE 12: From VMware - "There are two express patches: one for ESX 3.5 Update 2 and one for ESXi 3.5 Update 2. They are specifically targeted for customers who have installed or fully upgraded to ESX/ESXi 3.5 Update 2 or who have applied the ESX350-200806201-UG patch to ESX/ESXi 3.5 or ESX/ESX 3.5 Update 1 hosts. For customers who haven't done either, these express patches should not be applied.
To be noted is that these patches have been validated to work with esxupdate. However, testing with the VMware Update Manager is still under way. In subsequent communications, we will provide confirmation whether the patches work with VMware Update Manger or if a re-spin is required."
To apply the patches, no reboot of ESX/ESXi hosts is required. One can VMotion off running VMs, apply the patches and VMotion the VMs back. If VMotion capability is not available, VMs need to be powered off before the patches are applied and powered back on afterwords"
I suspect that the last sentence above is going to give a lot of admins pause, as their are quite a lot of sites which have converted their entire clusters to 3.5U2.
UPDATE 13: The official patch availability notice arrives from VMware shortly after 8pm:
"The express patches are now available for download that will resolve the ESX/ESXi 3.5 Update 2 issue which causes the product license to expire as of August 12, 2008. Please go to http://www.vmware.com/go/esxexpresspatches for more information."
Note that they seem to have created a new url which I assume points to some cdn network/etc that will be able to handle the load of everyone trying to download them.
UPDATE 14: A letter from VMware CEO appears at 9pm at http://blogs.vmware.com/console/2008/08/letter-from-vmw.html
UPDATE 15: VMware is requesting that technical responses to the new patches be posted at: http://communities.vmware.com/thread/162714
Most people seem to be installing the patches fine, although those with fully converted clusters are having to set the time back on all ESX hosts to enable vmotion so that vm's can be migrated to other hosts so that each host can briefly enter maintenance mode to apply. If time is set back, please make sure that syncing of VM time and host time is disabled for all VM's.

Have not used VMware for a while, but In the past they had some solid products. Used to run linux on a laptop while running windows xp as the virtual machine. Anyways, I am sure VMWare will have a fix soon, in the mean time - All your base are belong to us!
Well, I guess I'll go tell people not to shut down any of their VMs today...
-- Dave
You got the meme wrong. The title of the article should have been "All your VM are belong to us".
Vlad, gandeste pozitiv nu se va intampla nimic
"...I'd think they'd at least check for simple things like changing the date to 1 year or 1 month in the future..."
Are you serious ? do you write code ? do you write a test plan ? do you really think that a test like that is going to be on top of your list.
lets think about this test if we are going to be thorough :
for ( every hour from now to 2037 ) {
reboot()
}
How long will that test take to run ? Some stuff just has to slip through the cracks.
test plans focus on
- known bugs
- things we thought could go wrong
test plans ignore
- things other people we assume already tested
- all the stuff we didn't know could go wrong
(yesh, that's a /big/ list)
Next time you develop an app. write the most detailed test plan you can think of and re-read it.
It if has less than 10,000 tests it is probably incomplete.
esx issues
Right before your update 10 you say, "It always comes down to making you're making your choices and taking your chances."
Am I reading this correctly or are there to many "making" and "you're/your" words in there due to lack of QA? ;)
voodoobones: you got me. I've corrected it. Patch out in 10 minutes :)
Geoff: I've actually written test plans. If one of the critical software routines in ESX is evaluating license server checks, I'd expect at least 1 or 2 simple test routines for it before any major release, yes? Yeah, I know it's hard to even guess all the possible issues that can come up but it's not that uncommon for date issues to hurt software so hopefully the devs would have thought about it.
Eu: I don't speak whatever language that was.
What is a virtual machine? Thanks. You're not the only one that pointed that out.
Dave: good idea.
Eu says "positive thinking will not happen anything" in Romanian
Software vendors need to get over it and realize that all this licensing crap does nothing except make things complicated and cause things to break. It does absolutely nothing to prevent piracy, and probably costs them 10 times as more to design and implement that trash than to just allow the product to be pirated.
PATCH AVAILABLE (00:04h GMT-03)
http://kb.vmware.com/selfservice/dynamickc.do?externalId=1006721&sliceId=1&command=show&forward=nonthreadedKC&kcId=1006721
I'm just wondering what "kept secure" is supposed to mean in this phrase:
"... baseline code for critical hypervisors like ESX isn't kept secure and regularly audited"
A. Cryptographically signed by the publisher?
B. Source never released from the black room?
C. Wrapped up with an integrity watcher, a la the NT Workstation/NT Server license daemon?
D. All of the above?
Seems muddled to me. We invoke the names of things that make us feel better when we're scared, I guess.
Having time-bombed software deployed into production is enough to get people fired. I bet those fired people won't be using ESX in their next job.
In your first update, you say, "I'd council against turning DRS off...." That should be, "I'd counsel against turning DRS off...." For your QA, you probably rely on proofreading, clearly inadequate. You could rely on an automated process, i.e. a spell checker; but this a is bad idea because it cannot catch subtle errors such as homonyms, the error here. Therefore, I suggest adding to your QA process not only automatically checking the spelling of the words you type but also manually checking their meanings, both denotative and connotative.
Glad I could help,
A high school drop-out who feels better about himself by pointing out elementary grammar errors made by CEOs
I thought that only Microsoft made big mistakes like this!
Licensing? Linux? Faking it out with incorrect date settings?
Makes Microsoft look good for a change!
Thanks,
Bubba
It comes down to doing your own QA testing. I work for a large company we to seen the issues from the 3.5U2. But the difference is we seen them in our Test enviroment. We never let updates run in a production enviroment unlesss they have running in our test enviroment for a while. We need to be confident that we it is good for production.