Fix cpuinfo init on Linux without CPU sysfs lists#28230
Fix cpuinfo init on Linux without CPU sysfs lists#28230tianleiwu wants to merge 1 commit intomicrosoft:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes ONNX Runtime startup failures on Linux ARM64 environments where /sys/devices/system/cpu/{possible,present} are unavailable by (1) making early cpuinfo-init logging safe before a default logger exists, and (2) patching the bundled pytorch/cpuinfo to fall back to sysconf(_SC_NPROCESSORS_ONLN) for both CPU counts and per-CPU present/possible flags.
Changes:
- Guard
LOGS_DEFAULT(...)usage inPosixEnvso cpuinfo init failures won’t crash when logging hasn’t been initialized yet. - Patch
pytorch/cpuinfoLinux processor detection to provide robust sysfs-missing fallbacks (counts + flags). - Add a standalone simulation script to validate the early-logging and sysfs-missing behaviors (incl.
LD_PRELOADsysfs hiding).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| onnxruntime/core/platform/posix/env.cc | Avoids crashing during early PosixEnv construction by falling back to std::cerr when no default logger exists. |
| cmake/external/onnxruntime_external_deps.cmake | Wires in the new cpuinfo patch during FetchContent dependency setup (Linux + ARM64/ARM64EC patch flow). |
| cmake/patches/cpuinfo/fix_missing_sysfs_fallback.patch | Adds sysconf(_SC_NPROCESSORS_ONLN)-based fallbacks for max CPU count and present/possible flags when sysfs cpulists are missing. |
| onnxruntime/test/common/test_cpuinfo_sysfs_fallback.py | Adds a manual/simulation validation script (compiles small programs + LD_PRELOAD shim). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def test_safe_logging_pattern(): | ||
| """ | ||
| Test 1: Verify the safe logging pattern doesn't crash when no logger exists. | ||
|
|
||
| This simulates the fix in env.cc where we check HasDefaultLogger() before | ||
| calling LOGS_DEFAULT(). We compile a minimal C++ program that: | ||
| - Does NOT register a default logger | ||
| - Calls the safe logging pattern | ||
| - Verifies it writes to stderr instead of crashing | ||
| """ | ||
| print("=" * 60) | ||
| print("Test 1: Safe logging pattern (no default logger)") | ||
| print("=" * 60) | ||
|
|
||
| source = textwrap.dedent(r""" | ||
| #include <iostream> | ||
| #include <string_view> | ||
|
|
||
| // Minimal simulation of ORT's logging check pattern | ||
| namespace logging { | ||
| class LoggingManager { | ||
| public: | ||
| // Simulate: no default logger registered | ||
| static bool HasDefaultLogger() { return false; } | ||
| }; | ||
| } // namespace logging | ||
|
|
||
| void LogEarlyWarning(std::string_view message) { | ||
| if (logging::LoggingManager::HasDefaultLogger()) { | ||
| // Would call LOGS_DEFAULT(WARNING) here - but logger doesn't exist | ||
| // This path should NOT be taken | ||
| std::cerr << "BUG: should not reach here\n"; | ||
| return; | ||
| } | ||
| // Safe fallback to stderr | ||
| std::cerr << "onnxruntime warning: " << message << "\n"; | ||
| } | ||
|
|
||
| int main() { | ||
| // This simulates what PosixEnv() does when cpuinfo_initialize() fails | ||
| bool cpuinfo_available = false; // Simulating failure | ||
| if (!cpuinfo_available) { | ||
| LogEarlyWarning("cpuinfo_initialize failed. " | ||
| "May cause CPU EP performance degradation due to undetected CPU features."); | ||
| } | ||
| std::cout << "PASS: Safe logging pattern works without crash\n"; | ||
| return 0; | ||
| } | ||
| """) | ||
|
|
||
| with tempfile.NamedTemporaryFile(suffix=".cc", mode="w", delete=False) as f: | ||
| f.write(source) | ||
| src_path = f.name | ||
|
|
||
| try: | ||
| exe_path = src_path.replace(".cc", "") | ||
| result = subprocess.run( | ||
| ["g++", "-std=c++17", "-o", exe_path, src_path], check=False, capture_output=True, text=True | ||
| ) | ||
| if result.returncode != 0: | ||
| print(f"FAIL: Compilation failed: {result.stderr}") | ||
| return False | ||
|
|
||
| result = subprocess.run([exe_path], check=False, capture_output=True, text=True, timeout=10) | ||
| if result.returncode != 0: | ||
| print(f"FAIL: Program crashed with exit code {result.returncode}") | ||
| print(f"stderr: {result.stderr}") | ||
| return False | ||
|
|
||
| if "PASS" in result.stdout: | ||
| print(result.stdout.strip()) | ||
| print(f"stderr output (expected): {result.stderr.strip()}") | ||
| return True | ||
| print(f"FAIL: Unexpected output: {result.stdout}") | ||
| return False | ||
| finally: |
There was a problem hiding this comment.
This script defines module-level functions named test_* but they return booleans and rely on main()/print output rather than assertions. If this file is ever picked up by a test runner (e.g., pytest discovery), these will not behave as proper tests. Consider converting these to real pytest/unittest tests with assertions + skipping, or renaming/moving the script so it’s clearly an on-demand diagnostic and won’t be auto-collected.
| try: | ||
| exe_path = src_path.replace(".cc", "") | ||
| result = subprocess.run( | ||
| ["g++", "-std=c++17", "-o", exe_path, src_path], check=False, capture_output=True, text=True | ||
| ) | ||
| if result.returncode != 0: | ||
| print(f"FAIL: Compilation failed: {result.stderr}") | ||
| return False | ||
|
|
||
| result = subprocess.run([exe_path], check=False, capture_output=True, text=True, timeout=10) | ||
| if result.returncode != 0: | ||
| print(f"FAIL: Program crashed with exit code {result.returncode}") |
There was a problem hiding this comment.
The script assumes Linux + gcc/g++ + LD_PRELOAD availability, but it doesn’t guard for non-Linux platforms or for missing compilers/linker support. Adding explicit platform checks (e.g., sys.platform == 'linux') and tool detection (e.g., shutil.which('gcc')) with a clear SKIP message would make this more robust for developers running it in other environments.
| def get_ort_root(): | ||
| """Get the ORT repository root.""" | ||
| return os.path.dirname(os.path.dirname(os.path.abspath(__file__))) | ||
|
|
||
|
|
There was a problem hiding this comment.
get_ort_root() is currently unused. If it’s not needed, please remove it to avoid dead code; if it’s intended for future integration, consider using it (or adding a brief comment explaining why it exists).
| def get_ort_root(): | |
| """Get the ORT repository root.""" | |
| return os.path.dirname(os.path.dirname(os.path.abspath(__file__))) |
| """ | ||
| Test 3: Verify LD_PRELOAD shim can hide sysfs files. | ||
|
|
||
| This compiles a small shim that intercepts open/fopen to return ENOENT |
There was a problem hiding this comment.
The docstring says the shim “intercepts open/fopen”, but the shim only overrides fopen. Either update the comment to match the implementation, or also intercept open/open64 if that’s needed to simulate the real failure mode.
| This compiles a small shim that intercepts open/fopen to return ENOENT | |
| This compiles a small shim that intercepts fopen to return ENOENT |
| ${Patch_EXECUTABLE} -p1 < ${PROJECT_SOURCE_DIR}/patches/cpuinfo/win_arm_fp16_detection_fallback.patch && | ||
| # https://github.com/microsoft/onnxruntime/issues/10038 | ||
| ${Patch_EXECUTABLE} -p1 < ${PROJECT_SOURCE_DIR}/patches/cpuinfo/fix_missing_sysfs_fallback.patch |
There was a problem hiding this comment.
fix_missing_sysfs_fallback.patch modifies Linux-only sources (src/linux/processors.c), but it’s being applied unconditionally in the “Windows ARM64/ARM64EC” cpuinfo patch chain. That increases the chance of Windows builds breaking in the future if the Linux patch stops applying cleanly (even though the fix is Linux-specific). Consider applying this patch only under the Linux branch (or gating it by CMAKE_SYSTEM_NAME STREQUAL "Linux").
| ${Patch_EXECUTABLE} -p1 < ${PROJECT_SOURCE_DIR}/patches/cpuinfo/win_arm_fp16_detection_fallback.patch && | |
| # https://github.com/microsoft/onnxruntime/issues/10038 | |
| ${Patch_EXECUTABLE} -p1 < ${PROJECT_SOURCE_DIR}/patches/cpuinfo/fix_missing_sysfs_fallback.patch | |
| ${Patch_EXECUTABLE} -p1 < ${PROJECT_SOURCE_DIR}/patches/cpuinfo/win_arm_fp16_detection_fallback.patch |
Description
Fixes ONNX Runtime startup on Linux ARM64 environments where
/sys/devices/system/cpu/possibleand/sys/devices/system/cpu/presentare unavailable, such as AWS Lambda ARM64/Graviton and restricted build sandboxes.There are two related failure modes:
PosixEnvmay be constructed before ORT's default logger is registered. Ifcpuinfo_initialize()fails during that early construction path, the existingLOGS_DEFAULT(INFO)call can terminate withAttempt to use DefaultLogger but none has been registered.pytorch/cpuinfocode treats missing Linux CPUpossible/presentsysfs cpulists as fatal on ARM Linux. The max-count helpers returnUINT32_MAX, which wraps to0after1 + UINT32_MAXin ARM Linux initialization and prevents cpuinfo from reaching the later/proc/cpuinfoandgetauxval()based detection paths.Root Cause
The immediate import crash is caused by unsafe early logging in
onnxruntime/core/platform/posix/env.cc. Python bindings can referenceEnv::Default()during module load before logging is initialized, so a cpuinfo initialization failure must not useLOGS_DEFAULT()unless a default logger exists.The cpuinfo initialization failure is more subtle. A count-only fallback is not enough: after cpuinfo computes max possible/present CPU counts, it calls
cpuinfo_linux_detect_possible_processors()andcpuinfo_linux_detect_present_processors()to setCPUINFO_LINUX_FLAG_POSSIBLEandCPUINFO_LINUX_FLAG_PRESENTon each processor. ARM Linux initialization later marks processors valid only if those flags are set. If only the count fallback is provided,valid_processorscan remain zero and cpuinfo can proceed into an invalid partial initialization state.Fix
PosixEnvlogging safe when cpuinfo initialization fails before a default logger exists:logging::LoggingManager::HasDefaultLogger()beforeLOGS_DEFAULT()std::cerrwhen no logger is registeredsysconf(_SC_NPROCESSORS_ONLN) - 10..nproc-1sysconf(_SC_NPROCESSORS_ONLN)count and present/possible flag fallback behavior/sys/devices/system/cpu/{possible,present}viaLD_PRELOADTesting
Ran from a clean branch/worktree:
Result:
onnxruntime.capinot built/importable in this workspace)Also validated the cpuinfo patch directly:
And syntax-checked patched
src/linux/processors.cin a temporary tree with cpuinfo headers.A full ORT build was not completed in this workspace; a previous
build_cu128.shrun was interrupted.Related Issue
Fixes #10038.