I have ESXi 4.1.0 GA (build-260247) Kernel 4.1.0 (x86_64) installed on Dell PowerEdge 2950 server (two Xeon CPU 5160 @ 3.00GHz with 24GB memory). Datastore is on local storage (PERC 5/i Integrated RAID Controller). Dell Server Administrator 6.3.0 System Management Software is installed on ESXi.
Dell server has been thoroughly tested before installation, so hardware malfunction is very unlikely.
Ten virtual machines are running on this server. Guest OS on this machines are Debian Lenny and CentOS 5.5 with latest updates, and installed latest VMware tools with matching version.
Everything seemed to be fine until we have configured ESXi to send syslog messages and SNMP traps to our syslog and SNMP trap server. There were disturbing messages from ESXi. Few times a day ESXi sends messages regarding some of virtual machine. For instance (syslog wrongfully shows time one hour later):
********** syslog (grep -i heartbeat vmware.log) ***********
Feb 24 03:10:50 <local4.info> 10.5.0.41 Hostd: [2011-02-24 03:10:50.718 707DAB90 verbose 'vm:/vmfs/volumes/4c91ff62-f82b4bf1-5081-00188b2f6167/aaa2.oktv.hr/aaa2.oktv.hr.vmx'] Updating current heartbeatStatus: yellow
Feb 24 03:10:50 <local4.info> 10.5.0.41 Vpxa: [2011-02-24 03:10:50.720 1EEF5B90 verbose 'App'] [VpxaHalVmHostagent] 304: guestHeartbeatStatus changed to yellow
Feb 24 03:10:50 <local4.info> 10.5.0.41 Vpxa: [2011-02-24 03:10:50.720 1EEF5B90 verbose 'App'] [VpxaHalServices] VmHeartbeatChange Event for vm(17) 304
Feb 24 03:10:50 <local4.info> 10.5.0.41 Vpxa: [2011-02-24 03:10:50.720 1EEF5B90 verbose 'App'] [VpxaInvtVmChangeListener] Guest HeartbeatStatus Changed
Feb 24 03:11:20 <local4.info> 10.5.0.41 Hostd: [2011-02-24 03:11:20.718 70CD2B90 verbose 'vm:/vmfs/volumes/4c91ff62-f82b4bf1-5081-00188b2f6167/aaa2.oktv.hr/aaa2.oktv.hr.vmx'] Updating current heartbeatStatus: green
Feb 24 03:11:20 <local4.info> 10.5.0.41 Vpxa: [2011-02-24 03:11:20.720 1EF77B90 verbose 'App'] [VpxaHalVmHostagent] 304: guestHeartbeatStatus changed to green
Feb 24 03:11:20 <local4.info> 10.5.0.41 Vpxa: [2011-02-24 03:11:20.721 1EF77B90 verbose 'App'] [VpxaHalServices] VmHeartbeatChange Event for vm(17) 304
Feb 24 03:11:20 <local4.info> 10.5.0.41 Vpxa: [2011-02-24 03:11:20.721 1EF77B90 verbose 'App'] [VpxaInvtVmChangeListener] Guest HeartbeatStatus Changed
Feb 24 03:11:20 <local4.info> 10.5.0.41 Vpxa: [2011-02-24 03:11:20.721 1EF77B90 verbose 'App'] [VpxaInvtHost] Increment master gen. no to (88663): VmRuntime:GuestHeartbeatStatusChanged
******************************************************************************
********************************** SNMP trap **********************************
Feb 24 04:10:50 <local3.warn> nstorage1 snmptrapd[26851]: 10.5.0.41: Enterprise Specific Trap (VMWARE-VMINFO-MIB::vmwVmHBLost) Uptime: 19 days, 11:51:45.87, VMWARE-VMINFO-MIB::vmwVmID = INTEGER: 1, VMWARE-VMINFO-MIB::vmwVmConfigFilePath = STRING: /vmfs/volumes/4c91ff62-f82b4bf1-5081-00188b2f6167/aaa2.oktv.hr/aaa2.oktv.hr.vmx, VMWARE-VMINFO-MIB::vmwVmDisplayName.1 = STRING: aaa2.oktv.hr-2
Feb 24 04:11:20 <local3.warn> nstorage1 snmptrapd[26851]: 10.5.0.41: Enterprise Specific Trap (VMWARE-VMINFO-MIB::vmwVmHBDetected) Uptime: 19 days, 11:52:15.88, VMWARE-VMINFO-MIB::vmwVmID = INTEGER: 1, VMWARE-VMINFO-MIB::vmwVmConfigFilePath = STRING: /vmfs/volumes/4c91ff62-f82b4bf1-5081-00188b2f6167/aaa2.oktv.hr/aaa2.oktv.hr.vmx, VMWARE-VMINFO-MIB::vmwVmDisplayName.1 = STRING: aaa2.oktv.hr-2
******************************************************************************
vSphere client shows that server CPU usage is only 1/4 of maximum, and memory usage is only 1/2 of maximum without significant peaks. Guest OS system log doesn't notes any problems. VMware says that heartbeat loss could be caused by improperly installed VMware tools or unresponsive guest OS. I'm pretty sure that guest OS and VMware tools are configured and updated properly.
I have found on VMware communities few posts with similar problem: "Lots of Lost VM heartbeat snmp alerts" (http://communities.vmware.com/thread/196092) and "Problem with VM Heartbeat / Heartbeat Alarms - Alerts all the time?!" (http://communities.vmware.com/thread/231717) , but there was not any solution for this problem. In the end of the second post tsc09 claims: "VMware have finally acknowledged the problem as a bug (#616568). Apparently it affect 4.1 too. No expected release date for a fix yet.". I could not find officially description of this bug on VMware web pages.
I wonder is VMware officially acknowledged this problem as a bug, and on where on VMware is that bug description. If it is not a bug, what is explanation for that behavior, and how to solve this problem. My biggest concern is VM with VoIP server on this ESXi. We had experienced some problems with VoIP which we could not explain. This messages could mean that sometimes virtual machines just freezes for short period of time, and become unresponsive, although ESXi has enought resources to run VM withouth problem!
Dell server has been thoroughly tested before installation, so hardware malfunction is very unlikely.
Ten virtual machines are running on this server. Guest OS on this machines are Debian Lenny and CentOS 5.5 with latest updates, and installed latest VMware tools with matching version.
Everything seemed to be fine until we have configured ESXi to send syslog messages and SNMP traps to our syslog and SNMP trap server. There were disturbing messages from ESXi. Few times a day ESXi sends messages regarding some of virtual machine. For instance (syslog wrongfully shows time one hour later):
********** syslog (grep -i heartbeat vmware.log) ***********
Feb 24 03:10:50 <local4.info> 10.5.0.41 Hostd: [2011-02-24 03:10:50.718 707DAB90 verbose 'vm:/vmfs/volumes/4c91ff62-f82b4bf1-5081-00188b2f6167/aaa2.oktv.hr/aaa2.oktv.hr.vmx'] Updating current heartbeatStatus: yellow
Feb 24 03:10:50 <local4.info> 10.5.0.41 Vpxa: [2011-02-24 03:10:50.720 1EEF5B90 verbose 'App'] [VpxaHalVmHostagent] 304: guestHeartbeatStatus changed to yellow
Feb 24 03:10:50 <local4.info> 10.5.0.41 Vpxa: [2011-02-24 03:10:50.720 1EEF5B90 verbose 'App'] [VpxaHalServices] VmHeartbeatChange Event for vm(17) 304
Feb 24 03:10:50 <local4.info> 10.5.0.41 Vpxa: [2011-02-24 03:10:50.720 1EEF5B90 verbose 'App'] [VpxaInvtVmChangeListener] Guest HeartbeatStatus Changed
Feb 24 03:11:20 <local4.info> 10.5.0.41 Hostd: [2011-02-24 03:11:20.718 70CD2B90 verbose 'vm:/vmfs/volumes/4c91ff62-f82b4bf1-5081-00188b2f6167/aaa2.oktv.hr/aaa2.oktv.hr.vmx'] Updating current heartbeatStatus: green
Feb 24 03:11:20 <local4.info> 10.5.0.41 Vpxa: [2011-02-24 03:11:20.720 1EF77B90 verbose 'App'] [VpxaHalVmHostagent] 304: guestHeartbeatStatus changed to green
Feb 24 03:11:20 <local4.info> 10.5.0.41 Vpxa: [2011-02-24 03:11:20.721 1EF77B90 verbose 'App'] [VpxaHalServices] VmHeartbeatChange Event for vm(17) 304
Feb 24 03:11:20 <local4.info> 10.5.0.41 Vpxa: [2011-02-24 03:11:20.721 1EF77B90 verbose 'App'] [VpxaInvtVmChangeListener] Guest HeartbeatStatus Changed
Feb 24 03:11:20 <local4.info> 10.5.0.41 Vpxa: [2011-02-24 03:11:20.721 1EF77B90 verbose 'App'] [VpxaInvtHost] Increment master gen. no to (88663): VmRuntime:GuestHeartbeatStatusChanged
******************************************************************************
********************************** SNMP trap **********************************
Feb 24 04:10:50 <local3.warn> nstorage1 snmptrapd[26851]: 10.5.0.41: Enterprise Specific Trap (VMWARE-VMINFO-MIB::vmwVmHBLost) Uptime: 19 days, 11:51:45.87, VMWARE-VMINFO-MIB::vmwVmID = INTEGER: 1, VMWARE-VMINFO-MIB::vmwVmConfigFilePath = STRING: /vmfs/volumes/4c91ff62-f82b4bf1-5081-00188b2f6167/aaa2.oktv.hr/aaa2.oktv.hr.vmx, VMWARE-VMINFO-MIB::vmwVmDisplayName.1 = STRING: aaa2.oktv.hr-2
Feb 24 04:11:20 <local3.warn> nstorage1 snmptrapd[26851]: 10.5.0.41: Enterprise Specific Trap (VMWARE-VMINFO-MIB::vmwVmHBDetected) Uptime: 19 days, 11:52:15.88, VMWARE-VMINFO-MIB::vmwVmID = INTEGER: 1, VMWARE-VMINFO-MIB::vmwVmConfigFilePath = STRING: /vmfs/volumes/4c91ff62-f82b4bf1-5081-00188b2f6167/aaa2.oktv.hr/aaa2.oktv.hr.vmx, VMWARE-VMINFO-MIB::vmwVmDisplayName.1 = STRING: aaa2.oktv.hr-2
******************************************************************************
vSphere client shows that server CPU usage is only 1/4 of maximum, and memory usage is only 1/2 of maximum without significant peaks. Guest OS system log doesn't notes any problems. VMware says that heartbeat loss could be caused by improperly installed VMware tools or unresponsive guest OS. I'm pretty sure that guest OS and VMware tools are configured and updated properly.
I have found on VMware communities few posts with similar problem: "Lots of Lost VM heartbeat snmp alerts" (http://communities.vmware.com/thread/196092) and "Problem with VM Heartbeat / Heartbeat Alarms - Alerts all the time?!" (http://communities.vmware.com/thread/231717) , but there was not any solution for this problem. In the end of the second post tsc09 claims: "VMware have finally acknowledged the problem as a bug (#616568). Apparently it affect 4.1 too. No expected release date for a fix yet.". I could not find officially description of this bug on VMware web pages.
I wonder is VMware officially acknowledged this problem as a bug, and on where on VMware is that bug description. If it is not a bug, what is explanation for that behavior, and how to solve this problem. My biggest concern is VM with VoIP server on this ESXi. We had experienced some problems with VoIP which we could not explain. This messages could mean that sometimes virtual machines just freezes for short period of time, and become unresponsive, although ESXi has enought resources to run VM withouth problem!