Skip to content

Conversation

@sholeksandr
Copy link
Contributor

  1. Optimized memory usage.
  2. Fixed potential memory leaks.
  3. Add memory usage dump
  4. Add memory usage detailed trace

Example memory usage on MSN3800

Before:
VIRT = 252336
RES = 23464

After:
VIRT = 177364
RES = 24492

Bug: 4665041

1. Optimized memory usage.
2. Fixed potential memory leaks.
3. Add memory usage dump
4. Add memory usage detailed trace

Example memory usage on MSN3800

Before:
VIRT = 252336
RES = 23464

After:
VIRT = 177364
RES = 24492

Bug: 4665041

Signed-off-by: Oleksandr Shamray <[email protected]>
@greptile-apps
Copy link

greptile-apps bot commented Nov 14, 2025

Greptile Overview

Greptile Summary

This PR implements comprehensive memory optimization for the ThermalControl system, reducing virtual memory usage by ~75KB (from 252KB to 177KB on MSN3800). The changes address critical memory leaks and add memory profiling capabilities.

Key optimizations:

  • Fixed subprocess zombie process leaks by adding _terminate_proc() helper with proper cleanup in finally blocks for mlxreg and i2cset commands
  • Replaced list reassignments (self.fault_list = []) with .clear() method to reuse list objects and reduce allocations
  • Added bounded error counter dictionaries with automatic cleanup of zero-count entries to prevent unbounded growth
  • Implemented __del__ destructor in system_device class for proper cleanup
  • Added comprehensive memory profiling infrastructure (ObjectSnapshot, compare_snapshots) in hw_management_lib.py
  • Integrated psutil-based memory monitoring with configurable alerts and detailed growth tracking

Issues found:

  • _i2c_set() method has conflicting timeout logic where the outer retry loop uses a timeout parameter but subprocess.run uses a hardcoded 2.0s timeout, potentially causing total wait times to far exceed intended limits (affects both thermal_control.py and thermal_control_2_5.py)

Code quality improvements:

  • Migrated from shell=True to shell=False subprocess calls with explicit argument lists (security & reliability)
  • Removed unused imports (Timer, pdb) and obsolete constants (MLXREG_SET_CMD_STR, MLXREG_GET_CMD_STR)
  • Enhanced thread naming in RepeatedTimer for better debugging
  • Improved error handling with proper exception catching for process termination

Confidence Score: 4/5

  • This PR is mostly safe to merge with one logical issue in timeout handling that should be fixed
  • The PR successfully addresses critical memory leaks (subprocess zombies) and implements solid memory optimization techniques. The subprocess cleanup fix is essential and well-implemented. However, the _i2c_set() timeout logic issue could cause operations to take significantly longer than intended, which in a thermal control system could affect responsiveness. The memory profiling additions are purely observational and safe. Once the timeout issue is resolved, this would be a 5/5.
  • The _i2c_set() method in both usr/usr/bin/hw_management_thermal_control.py and usr/usr/bin/hw_management_thermal_control_2_5.py needs timeout logic correction to prevent unexpectedly long operation times

Important Files Changed

File Analysis

Filename Score Overview
usr/usr/bin/hw_management_lib.py 5/5 Added memory profiling utilities (ObjectSnapshot, compare_snapshots, print_comparison) and enhanced RepeatedTimer with thread naming for better debugging
usr/usr/bin/hw_management_thermal_control.py 4/5 Fixed subprocess zombie process leaks, optimized error counter dictionary with bounds checking, changed list resets from reassignment to .clear(), added memory usage monitoring and tracking
usr/usr/bin/hw_management_thermal_control_2_5.py 4/5 Applied identical memory optimization fixes as thermal_control.py: subprocess leak fixes, error counter bounds, list clearing, and memory monitoring

Sequence Diagram

sequenceDiagram
    participant TM as ThermalManagement
    participant SD as system_device
    participant PSU as psu_fan_sensor
    participant PROC as Subprocess
    participant MEM as Memory Profiler
    
    Note over TM: Memory Optimization Flow
    
    TM->>MEM: Initial snapshot (gmemory_snapshot_profiler)
    activate MEM
    MEM-->>TM: Baseline memory snapshot
    deactivate MEM
    
    loop Thermal Control Cycle
        TM->>SD: Poll sensors & handle errors
        SD->>SD: clear_fault_list() - Use .clear() not reassignment
        Note right of SD: Optimized: Reuses list objects
        
        alt Error counter growth
            SD->>SD: Check dict size >= warn_err_limit_entries
            SD->>SD: Remove zero-count entries
            Note right of SD: Memory leak prevention
        end
        
        PSU->>PSU: set_pwm() via i2cset
        PSU->>PROC: Popen(['i2cset', ...])
        activate PROC
        PROC-->>PSU: Result
        PSU->>PROC: _terminate_proc() in finally block
        deactivate PROC
        Note right of PSU: Fixed: Prevents zombie processes
        
        TM->>TM: write_pwm_mlxreg()
        TM->>PROC: Popen(['yes']) + Popen(['mlxreg'])
        activate PROC
        PROC-->>TM: PWM set
        TM->>PROC: _terminate_proc() for both processes
        deactivate PROC
        Note right of TM: Fixed: Ensures cleanup in finally
        
        alt Memory usage > threshold
            TM->>MEM: collect_snapshot(self)
            activate MEM
            MEM-->>TM: Current snapshot
            deactivate MEM
            TM->>MEM: compare_snapshots(old, new)
            MEM-->>TM: Memory growth report (JSON)
            TM->>TM: Log detailed memory comparison
            TM->>TM: Increase threshold by ALERT_STEP
        end
    end
    
    Note over TM,MEM: Result: ~75KB VIRT reduction, zombie leak fixed
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. usr/usr/bin/hw_management_thermal_control_2_5.py, line 1764-1780 (link)

    logic: Same timeout issue as in thermal_control.py: outer loop uses parameter timeout (2.0s) but subprocess.run has hardcoded timeout=2.0. With retries, total wait time could exceed intended timeout significantly.

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +1716 to +1732
timestump = current_milli_time()
while current_milli_time() - timestump < timeout:
try:
# 1. check if i2c bus is available
if os.path.isfile("/sys/bus/i2c/devices/i2c-{}/name".format(int(bus))):
cmd = ['i2cset', '-f', '-y', str(bus), str(addr)] + arg_list
# 2. execute i2c command
result = subprocess.run(cmd, shell=False, timeout=2.0,
capture_output=True, check=True)
return result
else:
self.log.warn("I2C bus {} is not available. Try again in 100ms".format(bus))
time.sleep(0.1)
except (ValueError, TypeError, OSError, IOError):
self.log.warn("I2C bus {} is not available".format(bus))
time.sleep(0.1)
return None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The outer while loop uses parameter timeout (default 2.0s) but subprocess.run has hardcoded timeout=2.0. This could cause the total wait time to be much longer than intended if retries occur. If the outer timeout is 2.0s and subprocess timeout is also 2.0s, a single retry could take 4+ seconds total.

Consider using a shorter subprocess timeout or making the relationship between the two timeouts explicit, e.g.:

Suggested change
timestump = current_milli_time()
while current_milli_time() - timestump < timeout:
try:
# 1. check if i2c bus is available
if os.path.isfile("/sys/bus/i2c/devices/i2c-{}/name".format(int(bus))):
cmd = ['i2cset', '-f', '-y', str(bus), str(addr)] + arg_list
# 2. execute i2c command
result = subprocess.run(cmd, shell=False, timeout=2.0,
capture_output=True, check=True)
return result
else:
self.log.warn("I2C bus {} is not available. Try again in 100ms".format(bus))
time.sleep(0.1)
except (ValueError, TypeError, OSError, IOError):
self.log.warn("I2C bus {} is not available".format(bus))
time.sleep(0.1)
return None
timestump = current_milli_time()
subprocess_timeout = min(1.0, timeout / 2) # Use half of total timeout or 1s max
while current_milli_time() - timestump < timeout * 1000: # convert to ms
try:
# 1. check if i2c bus is available
if os.path.isfile("/sys/bus/i2c/devices/i2c-{}/name".format(int(bus))):
cmd = ['i2cset', '-f', '-y', str(bus), str(addr)] + arg_list
# 2. execute i2c command
result = subprocess.run(cmd, shell=False, timeout=subprocess_timeout,
capture_output=True, check=True)
return result
else:
self.log.warn("I2C bus {} is not available. Try again in 100ms".format(bus))
time.sleep(0.1)
except (ValueError, TypeError, OSError, IOError):
self.log.warn("I2C bus {} is not available".format(bus))
time.sleep(0.1)
return None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants