Skip to content

The VM with MxGPU gets "Kernel panic" after multiple resets #9

@yanghangliu

Description

@yanghangliu

Test environment:
-kernel 5.14.0
-qemu-kvm-9.1
-libvirt-10.10.0
-gim-dkms-8.1.0.K-0.noarch
-MI210 or MI300X

How ro reproduce the issue:

  1. create the MxGPU from MI210 or MI300X
# modprobe gim
  1. make sure the VM has AMD GPU/vGPU driver installed
# cat /etc/yum.repos.d/amdgpu.repo 
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/amdgpu/latest/rhel/9.6/main/x86_64/
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
# cat rocm.repo
[ROCm]
name=ROCm
baseurl=https://repo.radeon.com/rocm/el9/latest/main/
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
# dnf -y install amdgpu-dkms rocm
# rpm -q amdgpu-dkms rocm
  1. start a VM with AMD MxGPU
  2. check the vGPU status in the VM
# amd-smi list
  1. reset the VM for 5 times
# /bin/virsh reset --domain rhel96
  1. check the VM dmesg via console
[   10.912244] [drm] amdgpu kernel modesetting enabled.
[   10.912655] amdgpu: Virtual CRAT table created for CPU
[   10.912812] amdgpu: Topology: Add CPU node
[   10.913253] [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x74B5 0x1002:0x74A1 0x00).
[   10.913920] [drm] register mmio base: 0x82400000
[   10.914054] [drm] register mmio size: 2097152
[   10.919517] [drm] host supports REQ_INIT_DATA handshake
[   10.919782] [drm] MCBP is enabled
[   25.950941] [drm] add ip block number 0 <soc15_common>
[   25.951334] [drm] add ip block number 1 <gmc_v9_0>
[   25.951645] [drm] add ip block number 2 <psp>
[   25.951908] [drm] add ip block number 3 <vega20_ih>
[   25.952162] [drm] add ip block number 4 <smu>
[   25.952391] [drm] add ip block number 5 <gfx_v9_4_3>
[   25.952682] [drm] add ip block number 6 <sdma_v4_4_2>
[   25.952926] [drm] add ip block number 7 <vcn_v4_0_3>
[   25.953179] [drm] add ip block number 8 <jpeg_v4_0_3>
[   25.958915] amdgpu 0000:04:00.0: amdgpu: Fetched VBIOS from VRAM BAR
[   25.959159] amdgpu: ATOM BIOS: 113-M3000100-102
[   25.959450] amdgpu 0000:04:00.0: Direct firmware load for amdgpu/psp_13_0_6_cap.bin failed with error -2
[   25.959727] amdgpu 0000:04:00.0: amdgpu: cap microcode does not exist, skip
[   25.962247] amdgpu 0000:04:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[   25.962785] amdgpu 0000:04:00.0: amdgpu: MEM ECC is active.
[   25.963014] amdgpu 0000:04:00.0: amdgpu: SRAM ECC is active.
[   25.963226] amdgpu 0000:04:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[18127] ras_mask[18127]
[   25.963463] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[   25.963739] amdgpu 0000:04:00.0: amdgpu: VRAM: 196288M 0x0000020000000000 - 0x0000022FEBFFFFFF (196288M used)
[   25.963955] amdgpu 0000:04:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[   25.964298] [drm] Detected VRAM RAM=196288M, BAR=262144M
[   25.964513] [drm] RAM width 8192bits HBM
[   25.965383] [drm] amdgpu: 196288M of VRAM memory ready
[   25.965644] [drm] amdgpu: 515584M of GTT memory ready.
[   25.965889] [drm] GART: num cpu pages 131072, num gpu pages 131072
[   26.087915] [drm] PCIE GART of 512M enabled.
[   26.088153] [drm] PTB located at 0x0000020000100000
[   26.160930] [drm] Found VCN firmware Version ENC: 1.23 DEC: 9 VEP: 0 Revision: 9
[   26.165454] [drm] MM table gpu addr = 0x200007a1000, cpu addr = 00000000124a76da.
[   27.228916] amdgpu 0000:04:00.0: amdgpu: smu driver if version = 0x08042024, smu fw if version = 0x08042027, smu fw program = 0, smu fw version = 0x0055708e (85.112.142)
[   27.229471] amdgpu 0000:04:00.0: amdgpu: SMU driver if version not matched
[   27.231535] amdgpu 0000:04:00.0: amdgpu: SMU is initialized successfully!
[   27.385901] [drm] kiq ring mec 2 pipe 1 q 0
[   27.387025] [drm] kiq ring mec 2 pipe 1 q 0
[   27.388196] [drm] kiq ring mec 2 pipe 1 q 0
[   27.389377] [drm] kiq ring mec 2 pipe 1 q 0
[   27.390611] [drm] kiq ring mec 2 pipe 1 q 0
[   27.391815] [drm] kiq ring mec 2 pipe 1 q 0
[   27.393076] [drm] kiq ring mec 2 pipe 1 q 0
[   27.394326] [drm] kiq ring mec 2 pipe 1 q 0
[   27.417626] amdgpu 0000:04:00.0: amdgpu: XGMI: Add node 0, hive 0x2101b92d193f83b3.
[   28.064738] amdgpu: HMM registered 196288MB device memory
[   28.069929] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[   28.070134] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[   28.070296] kfd kfd: amdgpu: KFD node 0 partition 0 size 196288M
[   28.070856] kfd kfd: amdgpu: Node: 0, interrupt_bitmap: 7777
[   28.075888] BUG: kernel NULL pointer dereference, address: 00000000000000d7
[   28.075889] #PF: supervisor read access in kernel mode
[   28.075890] #PF: error_code(0x0000) - not-present page
[   28.075892] PGD 119622067 P4D 0 
[   28.075894] Oops: 0000 [#1] PREEMPT SMP NOPTI
[   28.075896] CPU: 3 PID: 738 Comm: systemd-udevd Not tainted 5.14.0-xxx.el9.x86_64 #1
[   28.075898] Hardware name: Red Hat KVM/RHEL, BIOS edk2-20241117-2.el9 11/17/2024
[   28.075899] RIP: 0010:kgd_gfx_v9_hiq_mqd_load+0xc1/0x450 [amdgpu]
[   28.076288] Code: c3 78 03 00 00 45 8b 9c 07 68 6f 01 00 45 85 db 0f 8e 23 03 00 00 48 69 c3 78 03 00 00 c1 e5 0d 4c 01 f8 48 8b 90 50 6f 01 00 <a0> d7 00 00 00 00 00 00 00 00 00 00 00 00 00 01 53 e0 0b 00 00 00
[   28.076289] RSP: 0018:ff67ee13811775e8 EFLAGS: 00010286
[   28.076290] RAX: ff2ca365f0900000 RBX: 0000000000000000 RCX: ff67ee1381600000
[   28.076291] RDX: 0000000000000200 RSI: 0000000000000106 RDI: ff2ca365f0916d68
[   28.076292] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[   28.076292] R10: ffffffffc0aa3220 R11: 0000000000000100 R12: ff67ee1381539000
[   28.076293] R13: 0000000000000000 R14: ff2ca365f0916d60 R15: ff2ca365f0900000
[   28.076294] FS:  00007f96628a9b40(0000) GS:ff2ca4613f6c0000(0000) knlGS:0000000000000000
[   28.076295] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   28.076296] CR2: 00000000000000d7 CR3: 0000000119640003 CR4: 0000000000771ef0
[   28.076300] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   28.076301]
[   28.076301] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[   28.076301] PKRU: 55555554
[   28.076302] Call Trace:
[   28.076303]  <TASK>
[   28.076304]  ? show_trace_log_lvl+0x1c4/0x2df
[   28.076310]  ? show_trace_log_lvl+0x1c4/0x2df
[   28.076312]  ? hiq_load_mqd_kiq_v9_4_3+0xbc/0x120 [amdgpu]
[   28.076591]  ? __die_body.cold+0x8/0xd
[   28.076594]  ? page_fault_oops+0x132/0x170
[   28.076599]  ? exc_page_fault+0x61/0x150
[   28.076601]  ? asm_exc_page_fault+0x22/0x30
[   28.076605]  ? __pfx_hiq_load_mqd_kiq_v9_4_3+0x10/0x10 [amdgpu]
[   28.076885]  ? kgd_gfx_v9_hiq_mqd_load+0xc1/0x450 [amdgpu]
[   28.077188]  ? kgd_gfx_v9_hiq_mqd_load+0x89/0x450 [amdgpu]
[   28.077476]  hiq_load_mqd_kiq_v9_4_3+0xbc/0x120 [amdgpu]
[   28.077759]  kq_initialize.constprop.0+0x312/0x450 [amdgpu]
[   28.078039]  kernel_queue_init+0x3c/0x60 [amdgpu]
[   28.078306]  pm_init+0x64/0xd0 [amdgpu]
[   28.078571]  start_cpsch+0x1a4/0x2c0 [amdgpu]
[   28.078849]  kfd_resume+0x18/0x36 [amdgpu]
[   28.079166]  kfd_init_node+0x15e/0x1de [amdgpu]
[   28.079460]  kgd2kfd_device_init.cold+0x46f/0x6ce [amdgpu]
[   28.079748]  amdgpu_amdkfd_device_init+0x141/0x1e0 [amdgpu]
[   28.080044]  amdgpu_device_ip_init+0x4b4/0x4cc [amdgpu]
[   28.080347]  amdgpu_device_init.cold+0x6ef/0xbd6 [amdgpu]
[   28.080641]  amdgpu_driver_load_kms+0x15/0x70 [amdgpu]
[   28.080877]  amdgpu_pci_probe+0x18d/0x3d0 [amdgpu]
[   28.081107]  ? rpm_resume+0x28e/0x770
[   28.081112]  local_pci_probe+0x4c/0xa0
[   28.081116]  pci_call_probe+0x56/0x160
[   28.081118]  pci_device_probe+0x7c/0x100
[   28.081120]  ? driver_sysfs_add+0x59/0xc0
[   28.081124]  really_probe+0xde/0x390
[   28.081126]  ? pm_runtime_barrier+0x50/0x90
[   28.081128]  __driver_probe_device+0xd6/0x130
[   28.081130]  driver_probe_device+0x1e/0x90
[   28.081132]  __driver_attach+0xd2/0x1c0
[   28.081134]  ? __pfx___driver_attach+0x10/0x10
[   28.081136]  bus_for_each_dev+0x75/0xd0
[   28.081139]  bus_add_driver+0xc2/0x1f0
[   28.081141]  driver_register+0x70/0xd0
[   28.081142]  ? __pfx_init_module+0x10/0x10 [amdgpu]
[   28.081360]  do_one_initcall+0x41/0x210
[   28.081365]  do_init_module+0x64/0x230
[   28.081368]  __do_sys_init_module+0x12e/0x1b0
[   28.081371]  do_syscall_64+0x5c/0xe0
[   28.081374]  ? __mod_memcg_lruvec_state+0x8a/0x120
[   28.081379]  ? __mod_lruvec_page_state+0x97/0x150
[   28.081381]  ? folio_add_new_anon_rmap+0x41/0xb0
[   28.081384]  ? _raw_spin_unlock+0xa/0x30
[   28.081388]  ? do_anonymous_page+0x1bb/0x3e0
[   28.081391]  ? __handle_mm_fault+0x2fe/0x650
[   28.081394]  ? __count_memcg_events+0x4f/0xb0
[   28.081395]  ? mm_account_fault+0x6c/0x100
[   28.081397]  ? handle_mm_fault+0x120/0x250
[   28.081398]  ? do_user_addr_fault+0x35d/0x620
[   28.081399]  ? clear_bhb_loop+0x25/0x80
[   28.081402]  ? clear_bhb_loop+0x25/0x80
[   28.081404]  ? clear_bhb_loop+0x25/0x80
[   28.081406]  ? clear_bhb_loop+0x25/0x80
[   28.081408]  ? clear_bhb_loop+0x25/0x80
[   28.081409]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   28.081412] RIP: 0033:0x7f96635c24ae
[   28.081415] Code: 48 8b 0d 6d 99 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 3a 99 0e 00 f7 d8 64 89 01 48
[   28.081416] RSP: 002b:00007ffdcf018ba8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[   28.081417] RAX: ffffffffffffffda RBX: 000055fb6d28f750 RCX: 00007f96635c24ae
[   28.081418] RDX: 00007f966372032c RSI: 0000000001d83408 RDI: 00007f9660689010
[   28.081419] RBP: 00007f9660689010 R08: 000055fb6d2c6d70 R09: 0000000001d83000
[   28.081419] R10: 0000000000000005 R11: 0000000000000246 R12: 00007f966372032c
[   28.081420] R13: 000055fb6d2c0990 R14: 0000000000000007 R15: 000055fb6d2c50d0
[   28.081421]  </TASK>
[   28.081421] Modules linked in: amdgpu(+) video wmi amdxcp i2c_algo_bit drm_ttm_helper ttm drm_exec gpu_sched drm_suballoc_helper drm_buddy drm_display_helper drm_kms_helper nvme_tcp nvme_fabrics drm nvme_core ahci crct10dif_pclmul libahci crc32_pclmul crc32c_intel nvme_keyring nvme_auth libata virtio_net ghash_clmulni_intel virtio_blk cec net_failover failover serio_raw dm_mirror dm_region_hash dm_log dm_mod
[   28.081438] CR2: 00000000000000d7
[   28.097518] ---[ end trace 0000000000000000 ]---
[   28.097519] RIP: 0010:kgd_gfx_v9_hiq_mqd_load+0xc1/0x450 [amdgpu]
[   28.098140] Code: c3 78 03 00 00 45 8b 9c 07 68 6f 01 00 45 85 db 0f 8e 23 03 00 00 48 69 c3 78 03 00 00 c1 e5 0d 4c 01 f8 48 8b 90 50 6f 01 00 <a0> d7 00 00 00 00 00 00 00 00 00 00 00 00 00 01 53 e0 0b 00 00 00
[   28.098512] RSP: 0018:ff67ee13811775e8 EFLAGS: 00010286
[   28.098702] RAX: ff2ca365f0900000 RBX: 0000000000000000 RCX: ff67ee1381600000
[   28.098895] RDX: 0000000000000200 RSI: 0000000000000106 RDI: ff2ca365f0916d68
[   28.099094] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[   28.099287] R10: ffffffffc0aa3220 R11: 0000000000000100 R12: ff67ee1381539000
[   28.099481] R13: 0000000000000000 R14: ff2ca365f0916d60 R15: ff2ca365f0900000
[   28.099677] FS:  00007f96628a9b40(0000) GS:ff2ca4613f6c0000(0000) knlGS:0000000000000000
[   28.099876] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   28.100080] CR2: 00000000000000d7 CR3: 0000000119640003 CR4: 0000000000771ef0
[   28.100285] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   28.100489] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[   28.100690] PKRU: 55555554
[   28.100890] Kernel panic - not syncing: Fatal exception
[   29.195807] Shutting down cpus with NMI
[   29.196242] Kernel Offset: 0x20400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   29.197902] ---[ end Kernel panic - not syncing: Fatal exception ]---
  1. check the host dmesg
[43998.754088] gim error libgv: [0:3d:0:0][VF00][amdgv_reset_vf_flr:187] Issuing FLR on vf: 0.
[43998.819541] gim error libgv: [0:bd:0:0][PF][amdgv_ecc_check_global_ras_errors:388] GPU detected ECC Fatal Error.
[43998.828302] gim error libgv: [0:bd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[43998.829235] gim error libgv: [0:9d:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[43998.829478] gim error libgv: [0:1b:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[43998.829633] gim error libgv: [0:4e:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[43998.829661] gim error libgv: [0:cd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[43998.829680] gim error libgv: [0:dd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[43998.829680] gim error libgv: [0:5f:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[43999.312056] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[43999.312187] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[43999.312286] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[43999.312386] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[43999.312484] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[43999.312584] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[43999.312683] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[43999.312786] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[43999.312874] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH0_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[43999.313088] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH1_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[43999.313306] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH2_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[43999.313556] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH3_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[43999.313857] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH4_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[43999.314295] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH5_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[43999.314671] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH6_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[43999.315122] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH7_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[43999.317562] gim error libgv: [0:3d:0:0][PF][mi300_mca_push_unknown_bank_count:236] socket: 0, 1 new hardware errors detected in UNKNOWN Block. 1 total UNKNOWN Block ECC errors since GPU load.
[43999.319990] gim error libgv: [0:3d:0:0][PF][mi300_reset_notify_engine_status:1224] Graphics Virtualization Scheduler has entered an abnormal state
[43999.322483] gim error libgv: [0:3d:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44057.809137] gim error libgv: [0:3d:0:0][VF00][amdgv_reset_vf_flr:187] Issuing FLR on vf: 0.
[44058.367650] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44058.368027] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44058.368369] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44058.368677] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44058.368989] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44058.369321] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44058.369650] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44058.369992] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44058.370291] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH0_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44058.370903] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH1_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44058.371762] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH2_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44058.372623] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH3_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44058.373596] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH4_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44058.374508] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH5_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44058.375458] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH6_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44058.376424] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH7_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44058.377830] gim error libgv: [0:3d:0:0][PF][mi300_reset_notify_engine_status:1224] Graphics Virtualization Scheduler has entered an abnormal state
[44058.379262] gim error libgv: [0:bd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44058.379340] gim error libgv: [0:9d:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44058.379384] gim error libgv: [0:dd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44058.379401] gim error libgv: [0:4e:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44058.379410] gim error libgv: [0:cd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44058.379412] gim error libgv: [0:5f:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44058.382356] gim error libgv: [0:3d:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44058.383075] gim error libgv: [0:1b:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44119.067072] gim error libgv: [0:3d:0:0][VF00][amdgv_reset_vf_flr:187] Issuing FLR on vf: 0.
[44119.624647] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44119.625140] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44119.625549] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44119.625922] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44119.626405] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44119.626767] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44119.627158] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44119.627477] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44119.627767] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH0_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44119.628510] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH1_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44119.629174] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH2_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44119.629797] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH3_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44119.630653] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH4_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44119.631513] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH5_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44119.632367] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH6_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44119.633372] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH7_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44119.634631] gim error libgv: [0:3d:0:0][PF][mi300_reset_notify_engine_status:1224] Graphics Virtualization Scheduler has entered an abnormal state
[44119.635910] gim error libgv: [0:5f:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44119.637180] gim error libgv: [0:4e:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44119.637208] gim error libgv: [0:9d:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44119.637211] gim error libgv: [0:bd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44119.637864] gim error libgv: [0:cd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44119.637889] gim error libgv: [0:dd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44119.638093] gim error libgv: [0:3d:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44119.641235] gim error libgv: [0:1b:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44180.056990] gim error libgv: [0:3d:0:0][VF00][amdgv_reset_vf_flr:187] Issuing FLR on vf: 0.
[44180.121605] gim error libgv: [0:5f:0:0][PF][amdgv_ecc_check_global_ras_errors:388] GPU detected ECC Fatal Error.
[44180.131195] gim error libgv: [0:bd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44180.132200] gim error libgv: [0:9d:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44180.132489] gim error libgv: [0:dd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44180.132549] gim error libgv: [0:cd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44180.134103] gim error libgv: [0:1b:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44180.134194] gim error libgv: [0:4e:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44180.134327] gim error libgv: [0:5f:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44180.614942] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44180.615239] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44180.615451] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44180.615631] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44180.615817] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44180.616002] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44180.616182] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44180.616352] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44180.616508] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH0_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44180.616820] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH1_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44180.617227] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH2_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44180.617549] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH3_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44180.617888] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH4_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44180.618312] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH5_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44180.618649] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH6_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44180.618994] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH7_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44180.620580] gim error libgv: [0:3d:0:0][PF][mi300_mca_push_unknown_bank_count:236] socket: 0, 1 new hardware errors detected in UNKNOWN Block. 2 total UNKNOWN Block ECC errors since GPU load.
[44180.622399] gim error libgv: [0:3d:0:0][PF][mi300_reset_notify_engine_status:1224] Graphics Virtualization Scheduler has entered an abnormal state
[44180.623823] gim error libgv: [0:3d:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44215.259070] gim error libgv: [0:3d:0:0][VF00][amdgv_sched_exit_full_access_timeout:1962] VF 0 full access timeout. |start time: 0| - |end time: 44215251092|
[44215.271296] gim error libgv: [0:3d:0:0][VF00][amdgv_reset_vf_flr:187] Issuing FLR on vf: 0.
[44215.343528] gim error libgv: [0:9d:0:0][PF][amdgv_ecc_check_global_ras_errors:388] GPU detected ECC Fatal Error.
[44215.353289] gim error libgv: [0:bd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44215.354194] gim error libgv: [0:9d:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44215.354484] gim error libgv: [0:cd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44215.354611] gim error libgv: [0:4e:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44215.354647] gim error libgv: [0:dd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44215.354651] gim error libgv: [0:5f:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44215.355866] gim error libgv: [0:1b:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44215.836164] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44215.836370] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44215.836565] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44215.836751] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44215.836938] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44215.837179] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44215.837367] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44215.837552] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44215.837720] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH0_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44215.838082] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH1_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44215.838429] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH2_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44215.838774] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH3_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44215.839222] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH4_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44215.839615] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH5_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44215.840018] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH6_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44215.840411] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH7_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44217.850044] gim error libgv: [0:3d:0:0][PF][amdgv_psp_cmd_km_submit:525] PSP command fence wait failed.
[44217.851135] gim error libgv: [0:3d:0:0][PF][mi300_psp_set_mb_int:474] Failed to execute VF gate command.
[44217.854317] gim error libgv: [0:3d:0:0][PF][mi300_mca_push_unknown_bank_count:236] socket: 0, 1 new hardware errors detected in UNKNOWN Block. 3 total UNKNOWN Block ECC errors since GPU load.
[44217.857788] gim error libgv: [0:3d:0:0][PF][mi300_reset_notify_engine_status:1224] Graphics Virtualization Scheduler has entered an abnormal state
[44217.861241] gim error libgv: [0:3d:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions