-
Notifications
You must be signed in to change notification settings - Fork 111
Open
Description
SMARTctlManagerCollector.Describe
currently uses prometheus.DescribeByCollect
. Unfortunately, this can apparently race with the collection process in the case of slow/broken devices that time out. On a system with such a broken drive, I observed the following errors when querying /metrics:
$ curl http://localhost:9633/metrics
An error has occurred while serving metrics:
3 error(s) occurred:
* collected metric smartctl_device_interface_speed label:{name:"device" value:"sda"} label:{name:"speed_type" value:"max"} gauge:{value:6e+09} with unregistered descriptor Desc{fqName: "smartctl_device_interface_speed", help: "Device interface speed, bits per second", constLabels: {}, variableLabels: {device,speed_type}}
* collected metric smartctl_device_interface_speed label:{name:"device" value:"sda"} label:{name:"speed_type" value:"current"} gauge:{value:6e+09} with unregistered descriptor Desc{fqName: "smartctl_device_interface_speed", help: "Device interface speed, bits per second", constLabels: {}, variableLabels: {device,speed_type}}
* collected metric smartctl_device_error_log_count label:{name:"device" value:"sda"} label:{name:"error_log_type" value:"summary"} gauge:{value:0} with unregistered descriptor Desc{fqName: "smartctl_device_error_log_count", help: "Device SMART error log count", constLabels: {}, variableLabels: {device,error_log_type}}
Or sometimes even: 107 error(s) occurred
.
I suspect that maybe Describe
should try to take the collector lock to prevent this race, but I'm not completely sure of this. In the meantime, I locally patched the collector to use NewRegistry
instead of NewPedanticRegistry
, which at least makes the race harmless.
mweinelt and RaitoBezarius
Metadata
Metadata
Assignees
Labels
No labels