Skip to content

Avoid crash on failure to parse int in emerson_temp discovery. #818

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

indigoparadox
Copy link
Contributor

Thank you for your interest in contributing to Checkmk!
Consider looking into Readme regarding process details.

General information

While performing a discovery on certain Emerson/Vertiv devices, they can return an SNMP value to the generic Emerson temperature discovery function that cannot be parsed as an integer, causing this ValueError:
Checkmk_Crash_kale_lab_54b25f28-30d6-11f0-bf2f-005056a12db3_2025-05-14_11-46-04.tar.gz

This results in a perpetural state of "unknown" for the discovery service, and an accumulation of crash dumps:
2025-05-14 11_27_34-Checkmk Local site kale_lab - Services of host PTMONH07_VertivRectifier — Mozill

Proposed changes

There may be a better way to fix this, but it's difficult for me to know if e.g. changing the match string would break the original piece of hardware this check was written for. Python doesn't seem to have a quick or elegant way to comprehensively see if a string can be parsed as an integer other than to try and catch it if it fails, so that's what I opted for, here.

Thank you for your consideration!

@mo-ki
Copy link
Member

mo-ki commented May 19, 2025

Hi!
Thank you for your contribution!
The problem with this approach is that it masks all other problems as well. Is there any change to deal with a value like IB2 Temp 1 properly?
If that can't be done, I suggest to not detect the affected devices at all, by setting

detect=all_of(
    startswith(".1.3.6.1.4.1.6302.2.1.1.1.0", "Emerson Network Power"),
    not_ ...   # some condition to exclude affected devices
)

Any chance you can come up with a good condition on which the affected devices are?

@indigoparadox
Copy link
Contributor Author

indigoparadox commented May 19, 2025

I agree that there are some pitfalls, so here's an alternate approach:

All of our devices are exhibiting the issue outlined above because they seem to be providing OIDs beyond those the check was originally designed for. These OIDs (1.3.6.1.4.1.6302.2.1.2.7.3.*) do not appear in any MIB I presently have access to (e.g. https://mibbrowser.online/mibdb_search.php?mib=EES-POWER-MIB), and looking closely at their values, it's difficult to tell if they're even providing valid data for temperatures above "Temperature 1".

With this being the case, I've rewritten the OID selection to explicitly confine it to 1.3.6.1.4.1.6302.2.1.2.7.1 and 1.3.6.1.4.1.6302.2.1.2.7.2, thereby limiting it to the (known good and valid) values the check was originally designed for and eliminating the crash when attempting to parse invalid data (because these OIDs return valid temperatures in a sane range on our devices).

What do you think?

Thanks!

@indigoparadox indigoparadox force-pushed the indigoparadox-emerson-temp branch from 9367c6d to 3570cf0 Compare May 19, 2025 14:30
@indigoparadox indigoparadox force-pushed the indigoparadox-emerson-temp branch from 3570cf0 to acff302 Compare May 19, 2025 14:31
@@ -40,7 +40,7 @@ def check_emerson_temp(item, params, info):


def parse_emerson_temp(string_table: StringTable) -> StringTable:
return string_table
return [[x] for x in string_table[0]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many ways of doing this, obviously, but this way a test is failing that ensures we can deal with an emtpy input. I suggest to leave line 52 alone, add a comment, and do this:

def parse_...
    # we only use at most the first two lines because....
    return string_table[:2]

@mo-ki
Copy link
Member

mo-ki commented May 23, 2025

I like this approach better.
I just realize that this plugin is not using the system object ID nor the system description in its detect spec. ~35 of our ~1000 plugins do that. It's unfortunate, because it means that one adiitional OID is fetched during every discovery of any of your hosts. Would you share your system description with us? Maybe this is an opportunity to improve this :-)

@indigoparadox
Copy link
Contributor Author

Certainly, the approach you outline in your review does seem to work well on our test device, so let's go with that.

The sysOID is: .1.3.6.1.4.1.6302.2.1

I hesitated to change any of the detection logic in the (probably naive) assumption that the original author was using a weird method of detection for a reason and I didn't want to break anyone's existing hardware that I may not have access to (especially if it's already behaving differently from the hardware I do have access to). If changing that does break anything, though, hopefully someone would submit an appropriate fix with a comment to this effect. 😌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants