Page 1 of 1

Purple Screen on Proliant DL380G9 vSAN Free

Posted: Wed Nov 15, 2023 10:05 am
by Muecke_63
Hello Community,

i search for help or self made experience on a eventually Hardware or Kernel Problem.
I have two Proliant DL360 G9 installed with VSphere ESXi 7.0.3
Both ESXi running 1 Node of an VSAN Free HA Cluster.
NVMe are mapped via passthrough directly to the VSAN VMs.

Everything was running for about 4 Months without an issue.

Then 4 Weeks ago the first ESXi got an Purple Screen generated through an NMI.
Core Dump was not created, but the screenshot of the PSOD points to a problem with the PCI Express Controller or an attached NVMe.
PCI Express Controller comes from the official HP Enablement Kit for the G9 Servers.
High Performance Fan Kit is also installed.
NVMe is an official HPE Kioxia CM5 with 6,84TB.

Problem occures in different ESXi Releases (HPE OEM Builds).
It seems to be in conjunction with higher load, both times it occured when Veeam Backup was creating Backups from the Datastore in hotadd mode.
But one takeout was in normal operation with low load on Datastore.

The second ESXi Node is simply placed an Samsung PM9A3 in a Startech U.2 to PCIe Adapter with Standard Fan Kit.
This makes no problems so far.

I have attached 2 Screenshots.

My next step is that i have ordered an additional Startech Adapter and a PM9A3 for test this configuraion in the ESXi Node 1.

Does anyone got an similar problem?


Help and Suggests appreciated

Best regards and have a nice Day,

Re: Purple Screen on Proliant DL380G9 vSAN Free

Posted: Wed Nov 15, 2023 11:04 am
by yaroslav (staff)
Hi,

Sorry to read about this incident. I'd suggest also reaching out to the VMware community on Reddit or their forum. If you have active support, please reach out to VMware support.
Those could be driver/firmware mismatches. Consider redeploying ESXi hosts using HPE-customized ISO (if that is available), and have a look at the firmware you use for your PCIe Devices.
Last but not least, I'd like to mention that it is highly unlikely that PSOD is caused here by StarWind-related components as they are inside a VM.

Re: Purple Screen on Proliant DL380G9 vSAN Free

Posted: Thu Nov 16, 2023 2:50 am
by panoramicvelvety
Muecke_63 wrote:Hello Community,

i search for help or self made experience on a eventually Hardware or Kernel Problem.
I have two Proliant DL360 G9 installed with VSphere ESXi 7.0.3
Both ESXi running 1 Node of an VSAN Free HA Cluster.
NVMe are mapped via passthrough directly to the VSAN VMs.

Everything was running for about 4 Months without an issue.

Then 4 Weeks ago the first ESXi got an Purple Screen generated through an NMI.
Core Dump was not created, but the screenshot of the PSOD points to a problem with the PCI Express Controller or an attached NVMe.
PCI Express Controller comes from the official HP Enablement Kit for the G9 Servers.
High Performance Fan Kit is also installed.
NVMe is an official HPE Kioxia CM5 with 6,84TB.

Problem occures in different ESXi Releases (HPE OEM Builds).
It seems to be in conjunction with higher load, both times it occured when Veeam Backup was creating Backups from the Datastore in hotadd mode.
But one takeout was in normal operation with low load on Datastore.

The second ESXi Node is simply placed an Samsung PM9A3 in a Startech U.2 to PCIe Adapter with Standard Fan Kit.
This makes no problems so far.

I have attached 2 Screenshots.

My next step is that i have ordered an additional Startech Adapter and a PM9A3 for test this configuraion in the ESXi Node 1.

Does anyone got an similar problem?


Help and Suggests appreciated

Best regards and have a nice Day,
I had that problem too. :(

Re: Purple Screen on Proliant DL380G9 vSAN Free

Posted: Thu Nov 16, 2023 10:28 am
by yaroslav (staff)
Hi,

thanks for your input. Could you please share info on how you were able to solve it?

Re: Purple Screen on Proliant DL380G9 vSAN Free

Posted: Thu Nov 16, 2023 12:16 pm
by yaroslav (staff)
What NIC do you use? And, do you have SRIOV enabled for it?

Re: Purple Screen on Proliant DL380G9 vSAN Free

Posted: Mon Nov 20, 2023 5:14 pm
by Muecke_63
Ok, since i cannot believe that an 3 DWPD NVMe is defective after about 5% Usage, i recreated the HA Node this time based on a eager zerod VMDK.
Now the access to NVMe is done from ESXi Kernel.
I will give an update how its going on.

Re: Purple Screen on Proliant DL380G9 vSAN Free

Posted: Mon Nov 20, 2023 6:39 pm
by yaroslav (staff)
Hi,

Please make sure to use NVMe controllers and apply this fix https://knowledgebase.starwindsoftware. ... ontroller/

Re: Purple Screen on Proliant DL380G9 vSAN Free

Posted: Wed Jan 17, 2024 10:17 pm
by Muecke_63
A short Update on this thread, since i serve the storage to vsan appliance as vmdk no more crashes happened.

Re: Purple Screen on Proliant DL380G9 vSAN Free

Posted: Wed Jan 17, 2024 10:42 pm
by Muecke_63
yaroslav (staff) wrote:Hi,

Please make sure to use NVMe controllers and apply this fix https://knowledgebase.starwindsoftware. ... ontroller/
Hello yaroslav,

thank you for your advice, i will implement the change on next downtime.

Re: Purple Screen on Proliant DL380G9 vSAN Free

Posted: Wed Jan 17, 2024 10:50 pm
by yaroslav (staff)
Great news. You are always welcome :)
Let me know how it goes.

Re: Purple Screen on Proliant DL380G9 vSAN Free

Posted: Thu Jan 18, 2024 5:08 pm
by Muecke_63
Hello yaroslav,

for the CentOS7 Starwind VSAN Appliance i can`t choose NVMe as Disk Controller.
I selected now an VMWare Paravirtual instead.

Best Regards

Re: Purple Screen on Proliant DL380G9 vSAN Free

Posted: Thu Jan 18, 2024 6:36 pm
by yaroslav (staff)
Can you try with LSI logic SAS, please? Make sure to have this fix in place https://knowledgebase.starwindsoftware. ... ontroller/