NVSM Health
Platform Parameters
Platform Parameters¶
blacklist_recommendations¶
Brief¶
Flag to enable “blacklist recommendation” check for NVIDIA GPUs
Description¶
This is a boolean flag to enable a “blacklist recommendation” check, which tests NVIDIA GPUs on the system and might recommend removing certain GPUs for subsequent triage. In a production environment with many GPUs installed, there can be one or more GPUs on the system that are not in an operational state. To avoid undue impact to production jobs, these GPUs can be temporarily blacklisted (or disabled) while processing continues on operational GPUs. Reasons for blacklisting GPUs might include pending page retirements or PCIe connectivity problems.
blacklist_recommendations_instant¶
Brief¶
Flag to enable a quick version of the “blacklist recommendation” check for NVIDIA GPUs
Description¶
This flag enables a quick-running subset of the “blacklist recommendation” check. This quick check is suitable for running periodically or for running as a job preamble.
dcs_camera¶
Brief¶
Flag to enable checks related to the NVIDIA DRIVE Constellation Simulator virtual cameras
dcs_can_cfg¶
Brief¶
Parameters for logging into the CAN Interface for NVIDIA DRIVE Constellation Simulator
dcs_dcv_firmware¶
Brief¶
Flag to enable checks related to the NVIDIA DRIVE Constellation Vehicle firmware
dcs_dcv_sensor_threshold¶
Brief¶
Flag to enable sensor threshold checks for NVIDIA DRIVE Constellation Vehicle
dcs_nv_settings_cfg¶
Brief¶
Parameters used to check NVIDIA settings used by NVIDIA DRIVE Constellation Simulator
drive_gpu_sm_count¶
Brief¶
Expected streaming multiprocessor count per GPU on NVIDIA DRIVE Constellation Simulator
ib_controller_link_info¶
Brief¶
Expected link width and speed for InfiniBand controllers on PCIe bus
net_link¶
Brief¶
List of link-layer connectivity checks
Description¶
Currently, this is only used to check network connectivity of devices attached to NVIDIA DRIVE Constellation Simulator
net_ping¶
Brief¶
List of network-layer connectivity checks
Description¶
Currently, this is only used to check network connectivity of devices attached to NVIDIA DRIVE Constellation Simulator
smartctl_check_ssd_brick¶
Brief¶
Flag to enable check for SSD drives reporting erroneous SMART information
smartctl_megaraid_disk_count¶
Brief¶
Expected MegaRAID disk count (including capacity) reported by smartctl
storcli_platform_string¶
Brief¶
String used to display platform (e.g. DGX) in StorCLI check messages