NVSM Health
Command Details
Command Details¶
bash_hello_world¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/hello.bash
Timeout¶
300 seconds.
collect_fru¶
Brief¶
Run ipmitool fru print command
Description¶
This runs the “ipmitool fru” command to obtain FRU (field replaceable unit) information from the BMC (baseboard management controller). FRU information is important for keeping inventory of the components installed on the system and their serial numbers.
Depends On¶
Command-line¶
ipmitool fru print
Timeout¶
300 seconds.
collect_nvsm¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/collect_nvsm.py
Timeout¶
300 seconds.
collect_usb_sysfs¶
Brief¶
Collect information for connected USB devices from sysfs
Description¶
None
Used By¶
Command-line¶
echo TODO
Timeout¶
300 seconds.
dcc_ipmitool_sel_writeraw¶
Brief¶
None
Description¶
None
Command-line¶
ipmitool -I lanplus -H 192.168.1.42 -U nvsm-admin -P None sel writeraw \
bin_file
Timeout¶
300 seconds.
dcc_passgen¶
Brief¶
Run dcc_passgen tool
Description¶
Run dcc_passgen for DCC BMC. This command requires superuser privileges.
Module¶
Depends On¶
Used By¶
Command-line¶
dcc_passgen
Timeout¶
300 seconds.
dcs_cam_camera_mapping¶
Brief¶
None
Description¶
None
Command-line¶
python3 ${NVSMHEALTH_DUMP_TOOLS}/dcs_camera_info.py --cmd camera_mapping \
--display 0
Timeout¶
300 seconds.
dcs_cam_gpus_all¶
Brief¶
None
Description¶
None
Command-line¶
python3 ${NVSMHEALTH_DUMP_TOOLS}/dcs_camera_info.py --cmd gpus_all --display 0
Timeout¶
300 seconds.
dcs_cam_query_gpu_info¶
Brief¶
None
Description¶
None
Command-line¶
python3 ${NVSMHEALTH_DUMP_TOOLS}/dcs_camera_info.py --cmd query_gpu_info
Timeout¶
300 seconds.
ethtool¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/ethtool.sh
Timeout¶
300 seconds.
fru_dcc_version¶
Brief¶
Determine system version using DCC version stored in DCS FRU
Description¶
This command reads the dcc version stored in the DCS FRU table by reading its value using ipmitool. On C1.1 systems this will be “1.1”. This command does not require superuser privileges.
Used By¶
Command-line¶
ipmitool fru print 0 | grep -E 'Product Extra(\s+):' | head -n 3 | awk 'NR==3 \
{{print $4}}'
Timeout¶
300 seconds.
gds_check¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_GDS_CUDA_PATH}/gds/tools/gdscheck.py -pvV
Timeout¶
300 seconds.
gds_stack_trace¶
Brief¶
None
Description¶
None
Command-line¶
for x in `nvidia-smi --query-compute-apps=pid --format=csv,noheader` ; do cat \
/proc/$x/task/*/stack; done
Timeout¶
300 seconds.
gds_stats¶
Brief¶
None
Description¶
None
Command-line¶
for x in `nvidia-smi --query-compute-apps=pid --format=csv,noheader` ; do \
${NVSMHEALTH_DUMP_GDS_CUDA_PATH}/gds/tools/gds_stats -p $x -l 3; done
Timeout¶
300 seconds.
ipmitool_bmc_info¶
Brief¶
None
Description¶
None
Command-line¶
ipmitool bmc info
Timeout¶
300 seconds.
ipmitool_chassis_status¶
Brief¶
None
Description¶
None
Command-line¶
ipmitool chassis status
Timeout¶
300 seconds.
ipmitool_lan_print¶
Brief¶
None
Description¶
None
Command-line¶
ipmitool lan print 1
Timeout¶
300 seconds.
ipmitool_power_led_status¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/ipmitool_power_led_status.sh
Timeout¶
300 seconds.
ipmitool_raw¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/ipmitool_raw.sh
Timeout¶
300 seconds.
ipmitool_raw_dgxa100¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/ipmitool_raw_dgxa100.sh
Timeout¶
300 seconds.
ipmitool_sdr_dump¶
Brief¶
None
Description¶
None
Command-line¶
out=$(mktemp); ipmitool sdr dump $out > /dev/null 2>&1; cat $out
Timeout¶
300 seconds.
ipmitool_sdr_info¶
Brief¶
None
Description¶
None
Command-line¶
ipmitool sdr info
Timeout¶
300 seconds.
ipmitool_sel_elist¶
Brief¶
None
Description¶
None
Command-line¶
ipmitool sel elist
Timeout¶
300 seconds.
ipmitool_sel_info¶
Brief¶
None
Description¶
None
Command-line¶
ipmitool sel info
Timeout¶
300 seconds.
ipmitool_sel_list¶
Brief¶
None
Description¶
None
Command-line¶
ipmitool sel list
Timeout¶
300 seconds.
ipmitool_sel_time_get¶
Brief¶
None
Description¶
None
Command-line¶
ipmitool sel time get
Timeout¶
300 seconds.
ipmitool_sel_writeraw¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/sel_writeraw.sh
Timeout¶
300 seconds.
ipmitool_user_list_1¶
Brief¶
None
Description¶
None
Command-line¶
ipmitool user list 1
Timeout¶
300 seconds.
java_hello_world¶
Brief¶
None
Description¶
None
Command-line¶
java -classpath ${NVSMHEALTH_DUMP_TOOLS}/tools hello
Timeout¶
300 seconds.
mdadm_detail¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/mdadm-detail.sh
Timeout¶
300 seconds.
mdadm_examine¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/mdadm-examine.sh
Timeout¶
300 seconds.
mlx_fetch_arm_log¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/mlnx_arm_logs.sh
Timeout¶
300 seconds.
mlxcables¶
Brief¶
None
Description¶
None
Command-line¶
mst start && mst cable add && mlxcables
Timeout¶
300 seconds.
modinfo¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/modinfo.sh
Timeout¶
300 seconds.
nvidia_address_text¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/nvidia_address_text.py
Timeout¶
300 seconds.
nvidia_debugdump¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/nvidia-debugdump.sh
Timeout¶
300 seconds.
nvidia_dkms_log¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/nvidia-dkms-log.sh
Timeout¶
300 seconds.
nvidia_driver_ko¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/nvidia_driver_ko.py
Timeout¶
300 seconds.
nvidia_settings¶
Brief¶
None
Description¶
None
Command-line¶
nvidia-settings -q all
Timeout¶
300 seconds.
nvidia_smi_nvlink¶
Brief¶
None
Description¶
None
Command-line¶
nvidia-smi topo -p2p rw >/dev/null && nvidia-smi nvlink -s
Timeout¶
300 seconds.
nvidia_smi_query_unit¶
Brief¶
None
Description¶
None
Command-line¶
nvidia-smi -q -u
Timeout¶
300 seconds.
nvidia_smi_topo¶
Brief¶
None
Description¶
None
Command-line¶
nvidia-smi topo -m
Timeout¶
300 seconds.
nvidia_vm_health_check_show¶
Brief¶
None
Description¶
None
Command-line¶
nvidia-vm health-check show
Timeout¶
300 seconds.
nvidia_vm_image_show¶
Brief¶
None
Description¶
None
Command-line¶
nvidia-vm image show
Timeout¶
300 seconds.
nvidia_vm_resources_show¶
Brief¶
None
Description¶
None
Command-line¶
nvidia-vm resources show
Timeout¶
300 seconds.
nvme_list¶
Brief¶
Collect list of NVMe devices using the nvme-cli tool
Description¶
None
Depends On¶
Command-line¶
nvme list --output-format=json
Timeout¶
300 seconds.
nvme_logs¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/nvme-logs.sh
Timeout¶
300 seconds.
nvsm_health_show_debug¶
Brief¶
None
Description¶
None
Command-line¶
nvsm-health --show --log-level=debug
Timeout¶
300 seconds.
nvsm_show_alerts¶
Brief¶
None
Description¶
None
Command-line¶
nvsm show alerts
Timeout¶
300 seconds.
nvsm_show_debug¶
Brief¶
None
Description¶
None
Command-line¶
nvsm --log-level=debug show -level all
Timeout¶
300 seconds.
perl_hello_world¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/hello.pl
Timeout¶
300 seconds.
ping_compute¶
Brief¶
None
Description¶
None
Command-line¶
ping -w 5 ngc.nvidia.com
Timeout¶
300 seconds.
ps¶
Brief¶
None
Description¶
None
Command-line¶
ps -wwo pid,uid,pcpu,pmem,etime,state,ppid,user,args --pid 2 --ppid 2 \
--deselect
Timeout¶
300 seconds.
psu_info_dgx1¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/psu_info_dgx1.sh
Timeout¶
300 seconds.
python_hello_world¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/hello.py
Timeout¶
300 seconds.
run_bmc_boot_slot_task¶
Brief¶
Run ipmitool raw 0x3C 0x3 0x0
Description¶
Get bmc boot slot. This command requires superuser privileges.
Depends On¶
Command-line¶
ipmitool raw 0x3C 0x3 0x0
Timeout¶
300 seconds.
run_cec_boot_status¶
Brief¶
Run ipmitool raw 0x3C 0x68 0x00
Description¶
Get boot status. This command requires superuser privileges.
Depends On¶
Command-line¶
ipmitool raw 0x3C 0x68 0x00
Timeout¶
300 seconds.
run_cec_version¶
Brief¶
Run ipmitool raw 0x3C 0xF 0x9
Description¶
Get CEC version. This command requires superuser privileges.
Depends On¶
Used By¶
Command-line¶
ipmitool raw 0x3C 0xF 0x9
Timeout¶
300 seconds.
run_dmidecode¶
Brief¶
Run the dmidecode command
Description¶
Verify system as described by SMBIOS/DMI using the dmidecode tool
Depends On¶
Used By¶
Command-line¶
dmidecode
Timeout¶
300 seconds.
run_dmidecode_memory¶
Brief¶
Run the dmidecode command
Description¶
Run the “dmidecode” command to get memory DMI type information. Some flags are added to output in a machine-readable format. This command does not require superuser privileges.
Depends On¶
Used By¶
Command-line¶
dmidecode --type memory
Timeout¶
300 seconds.
run_dpkg_grep_kvm¶
Brief¶
Run dpkg list and grep for kvm package
Description¶
None
Used By¶
Command-line¶
bash -c "dpkg -l | grep -c dgx-kvm-sw"
Timeout¶
300 seconds.
run_gpu_monitor_status¶
Brief¶
Execute GET on nvsm_core
Description¶
This runs the “nvsm_core –mode=client GET /nvsm/v1/Systems/1/GPUs” command to obtain gpumonitor status information.
Depends On¶
Used By¶
Command-line¶
nvsm_core --mode=client GET /nvsm/v1/Systems/1/GPUs
Timeout¶
300 seconds.
run_ipmi_fru¶
Brief¶
Run ipmitool fru print command
Description¶
This runs the “ipmitool fru” command to obtain FRU (field replaceable unit) information from the BMC (baseboard management controller). FRU information is important for keeping inventory of the components installed on the system and their serial numbers.
Depends On¶
Used By¶
Command-line¶
ipmitool fru print
Timeout¶
300 seconds.
run_ipmi_getenables¶
Brief¶
Run ipmitool mc getenables command
Description¶
Check BMC status with ipmitool. This command requires superuser privileges.
Depends On¶
Used By¶
Command-line¶
ipmitool mc getenables
Timeout¶
300 seconds.
run_ipmi_info¶
Brief¶
Run ipmitool mc info command
Description¶
Check BMC status with ipmitool. This command requires superuser privileges.
Module¶
Depends On¶
Used By¶
Command-line¶
ipmitool mc info
Timeout¶
300 seconds.
run_ipmi_sdr_elist¶
Brief¶
Run ipmitool sdr elist command
Description¶
Check BMC bom devices with ipmitool. This command requires superuser privileges.
Depends On¶
Used By¶
Command-line¶
ipmitool sdr elist
Timeout¶
300 seconds.
run_ipmi_sensor¶
Brief¶
Run ipmitool sensor command
Description¶
Check BMC sensor status with ipmitool. This command requires superuser privileges.
Depends On¶
Used By¶
Command-line¶
ipmitool sensor
Timeout¶
300 seconds.
run_ipmitool¶
Brief¶
Run the ipmitool command
Description¶
This simply runs the “ipmitool” command to make sure that ipmitool is able to access the BMC (baseboard management controller).
Depends On¶
Used By¶
Command-line¶
ipmitool
Timeout¶
300 seconds.
run_lsblk_scsi_device_info¶
Brief¶
Run the lsblk utility
Description¶
Run the “lsblk” utility to get info for scsi block devices. Get the output in json format.
Used By¶
Command-line¶
lsblk -S -P -o NAME,HCTL,TYPE,VENDOR,MODEL,REV,TRAN
Timeout¶
300 seconds.
run_lscpu¶
Brief¶
Run lscpu command
Description¶
Verify hyperthreading and NUMA are enabled
Used By¶
Command-line¶
lscpu
Timeout¶
300 seconds.
run_lspci¶
Brief¶
Run the lspci command
Description¶
Run the “lspci” command to list PCI devices. Some flags are added such that lspci output is printed in a machine-readable format. This command does not require superuser privileges.
Used By¶
Command-line¶
lspci -vmm -nn
Timeout¶
300 seconds.
run_lspci_n¶
Brief¶
Run the lspci command
Description¶
Run the “lspci” command to list PCI devices. Some flags are added such that lspci output is printed in a machine-readable format. This command does not require superuser privileges.
Used By¶
Command-line¶
lspci -vmm -n
Timeout¶
300 seconds.
run_lspci_verbose¶
Brief¶
Run the lspci command with verbose flags
Description¶
Run the “lspci” command with verbose flags to show detailed information about PCI devices. This command requires superuser privileges in order to read privileged PCI device registers. Much of the verbose output from lspci is not necessarily in a machine-readable format.
Used By¶
Command-line¶
lspci -vvv -nn -D
Timeout¶
300 seconds.
run_mlxfwmanager¶
Brief¶
Collect details of mellanox devices firmware version using Mellanox Firmware Manager
Description¶
None
Used By¶
Command-line¶
mlxfwmanager --query-format xml
Timeout¶
300 seconds.
run_net_ifconfig¶
Brief¶
Run ifconfig command to show all network interfaces
Description¶
See all network interfaces
Used By¶
Command-line¶
ifconfig -a
Timeout¶
300 seconds.
run_nvidia_smi_gpu_bus_id¶
Brief¶
Collect GPU’s identified with the NVIDIA System Management Interface (nvidia-smi) tool
Description¶
None
Module¶
Used By¶
Command-line¶
nvidia-smi --query-gpu=gpu_bus_id --format=csv,noheader
Timeout¶
300 seconds.
run_nvidia_smi_p2p_topology¶
Brief¶
Collect GPUs p2p topology using the nvidia-smi tool
Description¶
None
Module¶
Used By¶
Command-line¶
nvidia-smi topo -p2p rw
Timeout¶
300 seconds.
run_nvidia_smi_topology¶
Brief¶
Collect GPUDirect topology using the nvidia-smi tool
Description¶
None
Module¶
Used By¶
Command-line¶
nvidia-smi topo --matrix
Timeout¶
300 seconds.
run_smartctl_scan¶
Brief¶
Run the smartctl utility
Description¶
Run the “smartctl” utility to scan for devices. Some flags are added to output in a machine-readable format. This command requires superuser privileges.
Depends On¶
Used By¶
Command-line¶
smartctl --scan
Timeout¶
300 seconds.
run_storcli_pall¶
Brief¶
Run the storcli command
Description¶
None
Depends On¶
Used By¶
Command-line¶
storcli64 /c0/pall show all J
Timeout¶
300 seconds.
run_storcli_vall¶
Brief¶
Run the storcli command
Description¶
None
Depends On¶
Used By¶
Command-line¶
storcli64 /c0/vall show all J
Timeout¶
300 seconds.
run_storcli_version¶
Brief¶
Run the storcli command
Description¶
None
Depends On¶
Used By¶
Command-line¶
storcli64 -v -NoLog
Timeout¶
300 seconds.
run_xl_info¶
Brief¶
Run the “xl info” command for XenServer information
Description¶
The “xl info” command prints basic information about the running XenServer hypervisor.
Depends On¶
Used By¶
Command-line¶
xl info
Timeout¶
300 seconds.
service_cachefilesd_status¶
Brief¶
None
Description¶
None
Command-line¶
service cachefilesd status
Timeout¶
300 seconds.
service_status_all¶
Brief¶
None
Description¶
None
Command-line¶
service --status-all
Timeout¶
300 seconds.
smartctl¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/smartctl.sh
Timeout¶
300 seconds.
storcli_cmds¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/storcli_cmds.sh
Timeout¶
300 seconds.
sysfs_dmi_bios_version¶
Brief¶
Determine BIOS version in DMI table via sysfs
Description¶
This command reads the BIOS version stored in the DMI table by reading its value using sysfs. The product name is used to determine which BIOS version is running with, e.g. DGX-1, DGX-2, or DGX Station. This command does not require superuser privileges.
Command-line¶
cat /sys/devices/virtual/dmi/id/bios_version
Timeout¶
300 seconds.
sysfs_dmi_product_name¶
Brief¶
Determine product name in DMI table via sysfs
Description¶
This command reads the product name stored in the DMI table by reading its value using sysfs. The product name is used to determine which platform NVSysinfo is running on, e.g. DGX-1, DGX-2, or DGX Station. This command does not require superuser privileges.
Used By¶
Command-line¶
cat /sys/devices/virtual/dmi/id/product_name
Timeout¶
300 seconds.
sysfs_dmi_system_vendor¶
Brief¶
Determine system vendor in DMI table via sysfs
Description¶
This command reads the system vendor name (sometimes also “Manufacturer”) stored in the DMI table by reading its value using sysfs. On DGX systems this will be “NVIDIA”, but might be some other string depending on the system. This command does not require superuser privileges.
Used By¶
Command-line¶
cat /sys/devices/virtual/dmi/id/sys_vendor
Timeout¶
300 seconds.
timedatectl_status¶
Brief¶
None
Description¶
None
Command-line¶
timedatectl status
Timeout¶
300 seconds.
uptime¶
Brief¶
Run uptime command
Description¶
Check system uptime with the uptime utility
Used By¶
Command-line¶
uptime -p
Timeout¶
300 seconds.
xenserver_status_report¶
Brief¶
None
Description¶
None
Command-line¶
${NVSMHEALTH_DUMP_TOOLS}/xenserver-status-report.sh
Timeout¶
300 seconds.