Skip to content

SMARTctlManagerCollector.Describe can end up with a wrong set of descriptors when slow/broken devices are involved #305

@delroth

Description

@delroth

SMARTctlManagerCollector.Describe currently uses prometheus.DescribeByCollect. Unfortunately, this can apparently race with the collection process in the case of slow/broken devices that time out. On a system with such a broken drive, I observed the following errors when querying /metrics:

$ curl http://localhost:9633/metrics
An error has occurred while serving metrics:

3 error(s) occurred:
* collected metric smartctl_device_interface_speed label:{name:"device" value:"sda"} label:{name:"speed_type" value:"max"} gauge:{value:6e+09} with unregistered descriptor Desc{fqName: "smartctl_device_interface_speed", help: "Device interface speed, bits per second", constLabels: {}, variableLabels: {device,speed_type}}
* collected metric smartctl_device_interface_speed label:{name:"device" value:"sda"} label:{name:"speed_type" value:"current"} gauge:{value:6e+09} with unregistered descriptor Desc{fqName: "smartctl_device_interface_speed", help: "Device interface speed, bits per second", constLabels: {}, variableLabels: {device,speed_type}}
* collected metric smartctl_device_error_log_count label:{name:"device" value:"sda"} label:{name:"error_log_type" value:"summary"} gauge:{value:0} with unregistered descriptor Desc{fqName: "smartctl_device_error_log_count", help: "Device SMART error log count", constLabels: {}, variableLabels: {device,error_log_type}}

Or sometimes even: 107 error(s) occurred.

I suspect that maybe Describe should try to take the collector lock to prevent this race, but I'm not completely sure of this. In the meantime, I locally patched the collector to use NewRegistry instead of NewPedanticRegistry, which at least makes the race harmless.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions