Hi guys, I was wondering if I could get some help in trying to track down what's causing my ESXi host to crash.
I've run ESXi for a number of years, but this hardware setup for probably only a year and a half. The hardware is pretty off the shelf simple, Core i7, 16gig RAM, ASRock motherboard with support for VT-d, a pair of 1Gb Intel NICs, an IBM raid controller converted to LSI2008 in passthrough, a bunch of HDDs for a NAS and an Intel SSD for the datastore. Every once in a while, and this is the worst part about it, I get the purple screen. It doesn't happen with any regularity, once every couple months maybe, and it's driving me nuts. I recently swapped the PSU and have actually swapped CPUs as well. I'm out of ideas other than swapping hardware one by one and waiting 6 months which is a pretty crummy way to troubleshoot this. I unfortunately don't have a dump from the last crash, just a picture of the screen. It actually indicates that dumping failed anyway so.. I also grabbed everything from the /var/log directory so if there is anything that can help please ask and I'll post it ASAP.
If all that can be done is you guys tell me what to setup to capture more useful info for next time please pass that info along as well.
Something from vmkwarning.log
2014-04-04T03:00:01.759Z cpu2:1647387)WARNING: LinScsi: SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056 Unknown status vmhba0:0:0:0 (driver name: ahci) - Message repeated 1 time
2014-04-04T07:09:01.873Z cpu3:32868)WARNING: LinScsi: SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056 Unknown status vmhba0:0:0:0 (driver name: ahci) - Message repeated 1 time
0:00:00:00.000 cpu0:1)WARNING: Serial: 806: Serial port com1 failed during initialization.
0:00:00:00.000 cpu0:1)WARNING: Serial: 807: Serial port com1 will be disabled.
0:00:00:00.000 cpu0:1)WARNING: Serial: 806: Serial port com2 failed during initialization.
0:00:00:00.000 cpu0:1)WARNING: Serial: 807: Serial port com2 will be disabled.
0:00:00:04.435 cpu0:32768)WARNING: Cpu: 2145: Cache latency measurement may be inaccurate min= 196 max= 792 avg= 276
0:00:00:04.507 cpu0:32768)WARNING: VMKAcpi: 780: No IPMI PNP id found
0:00:00:04.530 cpu0:32768)WARNING: PCI: 764: ARI-capable device 0000:02:00.0 under non-ARI-capable bridge 0000:00:1c.0
2014-04-05T15:04:56.671Z cpu2:33202)WARNING: PCI: 157: 0000:02:00.0: Bypassing non-ACS capable device in hierarchy
I've seen messages indicating the CPU has no performed a heartbeat in 8 seconds, but I don't know what that means though
Much appreciated