NVSM Health

Platform Parameters

Platform Parameters

aux_firmware_version_type

Brief

No documentation

blacklist_recommendations

Brief

Flag to enable “blacklist recommendation” check for NVIDIA GPUs

Description

This is a boolean flag to enable a “blacklist recommendation” check, which tests NVIDIA GPUs on the system and might recommend removing certain GPUs for subsequent triage. In a production environment with many GPUs installed, there can be one or more GPUs on the system that are not in an operational state. To avoid undue impact to production jobs, these GPUs can be temporarily blacklisted (or disabled) while processing continues on operational GPUs. Reasons for blacklisting GPUs might include pending page retirements or PCIe connectivity problems.

blacklist_recommendations_instant

Brief

Flag to enable a quick version of the “blacklist recommendation” check for NVIDIA GPUs

Description

This flag enables a quick-running subset of the “blacklist recommendation” check. This quick check is suitable for running periodically or for running as a job preamble.

chassis_serial_number_index

Brief

No documentation

check_smartctl_disk_count

Brief

No documentation

dcgm_instant_watch_type

Brief

No documentation

dcs_camera

Brief

Flag to enable checks related to the NVIDIA DRIVE Constellation Simulator virtual cameras

dcs_can_cfg

Brief

Parameters for logging into the CAN Interface for NVIDIA DRIVE Constellation Simulator

dcs_dcv_bmc

Brief

Flag to enable checks related to the BMC of NVIDIA DRIVE Constellation Vehicle

dcs_dcv_firmware

Brief

Flag to enable checks related to the NVIDIA DRIVE Constellation Vehicle firmware

dcs_dcv_sensor_threshold

Brief

Flag to enable sensor threshold checks for NVIDIA DRIVE Constellation Vehicle

dcs_grid_license

Brief

Flag to enable license check for NVIDIA DRIVE Constellation Simulator GRID

dcs_nv_settings_cfg

Brief

Parameters used to check NVIDIA settings used by NVIDIA DRIVE Constellation Simulator

dcs_psu_attrib_values

Brief

No documentation

dcs_psu_info_commands

Brief

No documentation

dcs_usb_serial_cfg

Brief

Parameters used to check the USB serial connection between DCS and DCV

dcs_usb_tree

Brief

List of USB devices required by NVIDIA DRIVE Constellation Simulator

dcv_fan_bom

Brief

List of fan sensors expected by NVIDIA DRIVE Constellation Simulator

dcv_psu_bom

Brief

List of PSU sensors expected by NVIDIA DRIVE Constellation Simulator

dimm_bom

Brief

List of expected DIMMs (including DIMM slots)

dimm_part_number

Brief

No documentation

disk_controllers_bom

Brief

List of disk controllers expected on PCIe bus

dmidecode_dimm_vendors

Brief

List of recognized DIMM vendors

drive_gpu_sm_count

Brief

Expected streaming multiprocessor count per GPU on NVIDIA DRIVE Constellation Simulator

enable_cgx_sm_checks

Brief

No documentation

enable_dcv_checks

Brief

Enable checks for NVIDIA DRIVE Constellation Vehicle

enable_gpu_mig

Brief

No documentation

enable_psu_consistency_check

Brief

No documentation

enable_vbios_version_check

Brief

No documentation

ethernet_controller_info

Brief

Expected Ethernet controller properties on PCIe bus

ethernet_controllers_bom

Brief

List of expected Ethernet controllers on PCIe bus

fan_bom

Brief

List of expected fan sensors on BMC

fru_devices

Brief

List of BMC FRU devices to check for serial number consistency

gpu_bom

Brief

List of expected GPUs on PCIe bus

gpu_direct_topology

Brief

Expected topology for NVIDIA GPUDirect

gpu_p2p_topology

Brief

No documentation

gpu_sm_count

Brief

Expected streaming multiprocessor count per GPU

gpu_total_retired_page_count

Brief

Maximum number of retired pages per GPU (currently 60)

ib_controllers_bom

Brief

Expected InfiniBand controllers on PCIe bus

lscpu_number_of_cores

Brief

Expected number of total logical CPU cores on the system

mdadm_disk_status

Brief

Enable disk status check for Linux software RAID

mdadm_volume_status

Brief

Enable volume status check for Linux software RAID

meminfo_memory_size

Brief

Expected system memory size in kilobytes

mlnx_firmware_check

Brief

Flag to enable check of Mellanox firmware version consistency

net_ping

Brief

List of network-layer connectivity checks

Description

Currently, this is only used to check network connectivity of devices attached to NVIDIA DRIVE Constellation Simulator

nvme

Brief

List of expected NVMe drives (including capacity)

nvme_check_smart_log

Brief

Flag to enable SMART check for NVMe drives

nvswitch_bom

Brief

List of expected NVSwitch devices on PCIe bus

pcie_switches_bom

Brief

List of expected switches on PCIe bus

pegasus_storage_config_list

Brief

No documentation

psu_bom

Brief

List of expected PSU sensors on BMC

psu_model_info

Brief

No documentation

psu_vendor_info

Brief

No documentation

skip_bmc_revision_check

Brief

Flag to disable check for BMC Revision

skip_nvme_drive_model

Brief

No documentation

smartctl_check_ssd_brick

Brief

Flag to enable check for SSD drives reporting erroneous SMART information

smartctl_megaraid_disk_count

Brief

Expected MegaRAID disk count (including capacity) reported by smartctl

storcli_disk_stats

Brief

Expected MegaRAID physical disk status reported by StorCLI

storcli_platform_string

Brief

String used to display platform (e.g. DGX) in StorCLI check messages

summary_base_os_string

Brief

String used in display messages to refer to platform OS

summary_serial_num_string

Brief

String used in display messages to refer to platform serial number

sw_rel_file

Brief

Name of file used to store OS release version information

vga_bom

Brief

List of expected VGA controllers on PCI bus (usually for BMC)

xenserver_number_of_cores

Brief

Expected number of virtual CPU cores (when running in XenServer hypervisor)