Purple Screen on Proliant DL380G9 vSAN Free

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
Muecke_63
Posts: 8
Joined: Wed Nov 15, 2023 9:36 am

Wed Nov 15, 2023 10:05 am

Hello Community,

i search for help or self made experience on a eventually Hardware or Kernel Problem.
I have two Proliant DL360 G9 installed with VSphere ESXi 7.0.3
Both ESXi running 1 Node of an VSAN Free HA Cluster.
NVMe are mapped via passthrough directly to the VSAN VMs.

Everything was running for about 4 Months without an issue.

Then 4 Weeks ago the first ESXi got an Purple Screen generated through an NMI.
Core Dump was not created, but the screenshot of the PSOD points to a problem with the PCI Express Controller or an attached NVMe.
PCI Express Controller comes from the official HP Enablement Kit for the G9 Servers.
High Performance Fan Kit is also installed.
NVMe is an official HPE Kioxia CM5 with 6,84TB.

Problem occures in different ESXi Releases (HPE OEM Builds).
It seems to be in conjunction with higher load, both times it occured when Veeam Backup was creating Backups from the Datastore in hotadd mode.
But one takeout was in normal operation with low load on Datastore.

The second ESXi Node is simply placed an Samsung PM9A3 in a Startech U.2 to PCIe Adapter with Standard Fan Kit.
This makes no problems so far.

I have attached 2 Screenshots.

My next step is that i have ordered an additional Startech Adapter and a PM9A3 for test this configuraion in the ESXi Node 1.

Does anyone got an similar problem?


Help and Suggests appreciated

Best regards and have a nice Day,
Attachments
PSOD-1
PSOD-1
Screenshot 2023-10-15 200540.png (139.41 KiB) Viewed 4896 times
PSOD-2
PSOD-2
Screenshot 2023-11-14 215214.png (133.75 KiB) Viewed 4896 times
yaroslav (staff)
Staff
Posts: 2361
Joined: Mon Nov 18, 2019 11:11 am

Wed Nov 15, 2023 11:04 am

Hi,

Sorry to read about this incident. I'd suggest also reaching out to the VMware community on Reddit or their forum. If you have active support, please reach out to VMware support.
Those could be driver/firmware mismatches. Consider redeploying ESXi hosts using HPE-customized ISO (if that is available), and have a look at the firmware you use for your PCIe Devices.
Last but not least, I'd like to mention that it is highly unlikely that PSOD is caused here by StarWind-related components as they are inside a VM.
panoramicvelvety
Posts: 1
Joined: Thu Nov 16, 2023 2:47 am

Thu Nov 16, 2023 2:50 am

Muecke_63 wrote:Hello Community,

i search for help or self made experience on a eventually Hardware or Kernel Problem.
I have two Proliant DL360 G9 installed with VSphere ESXi 7.0.3
Both ESXi running 1 Node of an VSAN Free HA Cluster.
NVMe are mapped via passthrough directly to the VSAN VMs.

Everything was running for about 4 Months without an issue.

Then 4 Weeks ago the first ESXi got an Purple Screen generated through an NMI.
Core Dump was not created, but the screenshot of the PSOD points to a problem with the PCI Express Controller or an attached NVMe.
PCI Express Controller comes from the official HP Enablement Kit for the G9 Servers.
High Performance Fan Kit is also installed.
NVMe is an official HPE Kioxia CM5 with 6,84TB.

Problem occures in different ESXi Releases (HPE OEM Builds).
It seems to be in conjunction with higher load, both times it occured when Veeam Backup was creating Backups from the Datastore in hotadd mode.
But one takeout was in normal operation with low load on Datastore.

The second ESXi Node is simply placed an Samsung PM9A3 in a Startech U.2 to PCIe Adapter with Standard Fan Kit.
This makes no problems so far.

I have attached 2 Screenshots.

My next step is that i have ordered an additional Startech Adapter and a PM9A3 for test this configuraion in the ESXi Node 1.

Does anyone got an similar problem?


Help and Suggests appreciated

Best regards and have a nice Day,
I had that problem too. :(
yaroslav (staff)
Staff
Posts: 2361
Joined: Mon Nov 18, 2019 11:11 am

Thu Nov 16, 2023 10:28 am

Hi,

thanks for your input. Could you please share info on how you were able to solve it?
yaroslav (staff)
Staff
Posts: 2361
Joined: Mon Nov 18, 2019 11:11 am

Thu Nov 16, 2023 12:16 pm

What NIC do you use? And, do you have SRIOV enabled for it?
Muecke_63
Posts: 8
Joined: Wed Nov 15, 2023 9:36 am

Mon Nov 20, 2023 5:14 pm

Ok, since i cannot believe that an 3 DWPD NVMe is defective after about 5% Usage, i recreated the HA Node this time based on a eager zerod VMDK.
Now the access to NVMe is done from ESXi Kernel.
I will give an update how its going on.
yaroslav (staff)
Staff
Posts: 2361
Joined: Mon Nov 18, 2019 11:11 am

Mon Nov 20, 2023 6:39 pm

Hi,

Please make sure to use NVMe controllers and apply this fix https://knowledgebase.starwindsoftware. ... ontroller/
Muecke_63
Posts: 8
Joined: Wed Nov 15, 2023 9:36 am

Wed Jan 17, 2024 10:17 pm

A short Update on this thread, since i serve the storage to vsan appliance as vmdk no more crashes happened.
Muecke_63
Posts: 8
Joined: Wed Nov 15, 2023 9:36 am

Wed Jan 17, 2024 10:42 pm

yaroslav (staff) wrote:Hi,

Please make sure to use NVMe controllers and apply this fix https://knowledgebase.starwindsoftware. ... ontroller/
Hello yaroslav,

thank you for your advice, i will implement the change on next downtime.
yaroslav (staff)
Staff
Posts: 2361
Joined: Mon Nov 18, 2019 11:11 am

Wed Jan 17, 2024 10:50 pm

Great news. You are always welcome :)
Let me know how it goes.
Muecke_63
Posts: 8
Joined: Wed Nov 15, 2023 9:36 am

Thu Jan 18, 2024 5:08 pm

Hello yaroslav,

for the CentOS7 Starwind VSAN Appliance i can`t choose NVMe as Disk Controller.
I selected now an VMWare Paravirtual instead.

Best Regards
yaroslav (staff)
Staff
Posts: 2361
Joined: Mon Nov 18, 2019 11:11 am

Thu Jan 18, 2024 6:36 pm

Can you try with LSI logic SAS, please? Make sure to have this fix in place https://knowledgebase.starwindsoftware. ... ontroller/
Post Reply