Support device vs gpu http mode by YqGe585 · Pull Request #626 · PFCCLab/PaddleAPITest

YqGe585 · 2026-04-21T12:04:55Z

Support device vs gpu http mode

1. 修复随机种子不一致导致的大量误报 HTTP 模式下本地设备未在 _run_paddle 前设置随机种子，而服务端始终无条件调用 np.random.seed(random_seed)，导致两侧输入数据不同，clone 等 API 出现 max_abs_diff=33160 的假精度错误。修复：在 _test_http_mode 调用 _run_paddle 前同步设置种子。 2. 修复空 tensor 触发 np.nanmax ValueError _print_diff 对 size=0 的空数组调用 np.nanmax 会抛出 "zero-size array to reduction operation fmax which has no identity"。修复：在 _print_diff 中对空数组提前返回 (0.0, 0.0)。 3. 新增 special_compare 框架及非确定性 API skip 注册 - 新增 tester/special_compare/ 模块，支持按 API 注册自定义前向/反向对比函数，自动发现子模块无需修改主文件。 - 注册 argsort：用 gather 原始值替代直接比较索引，解决 tie 打破方式不同导致的误报。 - 注册 empty、empty_like、multinomial 为 skip，这些 API 输出天然不确定，不应进行精度对比。 - log_writer 新增 skip 日志类型支持。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

aggregate_logs(end=True) 末尾用 "w" 模式写 api_config_skip.txt，会把之前 write_to_log("skip",...) 聚合的内容（如 multinomial）完全覆盖。改为 "a" 追加模式，并将计数修正为已有 skip 数 + 差集数之和。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…PU 0 pebble ProcessPool以spawn方式创建worker时会重新import http_server模块，模块级的import paddle会在init_server_worker设置CUDA_VISIBLE_DEVICES之前执行，导致paddle的CUDA context始终在GPU 0上初始化，而非分配到的GPU 6/7。修复：删除模块级import paddle，确保在init_server_worker中先设置 CUDA_VISIBLE_DEVICES再import paddle，使CUDA context在正确的GPU上创建。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- base.py need_skip(): float8检查加not paddle_only守卫，paddle_only=True时不跳过float8（paddle原生支持），torch_vs_paddle模式不受影响 - paddle_device_vs_gpu.py: 新增_fill_float8_paddle_inputs()，在gen_paddle_input()后将config_analyzer留下的None float8 tensor替换为真实float8 tensor（float32生成→paddle.cast），不修改共享代码 - paddle_device_vs_gpu.py _run_paddle(): 入口加need_skip(paddle_only=True)，过滤sparse等不支持的case但保留float8 - http_server.py: 新增_SkippedError，skip case返回422而非500，客户端写skip日志 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- base.py gen_paddle_output_and_output_grad(): float8 dtype和bfloat16一样先生成float32 numpy再paddle.cast，避免numpy不认识float8_e4m3fn报TypeError - paddle_device_vs_gpu.py: need_skip时抛_PaddleSkipError而非返回(None,None)，明确区分skip和paddle_error - http_server.py: run_single_api捕获_PaddleSkipError转为__SKIP__:前缀的RuntimeError，handler通过前缀区分skip(422)和paddle_error(500)，移除错误的_SkippedError类 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… logging - Override need_skip() in DeviceVsGPU to skip float8 cases on XPU (XPU cast_kernel cannot create float8 tensors via float32->cast path) - Add _has_float8_dtype() helper with slice-safe check to avoid unhashable type error on __getitem__/__setitem__ slice args - Add early skip check in _test_http_mode() before sending HTTP request - Fix backward exception block to write accuracy_error log Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

embedding(sparse=True) backward produces SelectedRows sparse gradients. paddle.save() cannot serialize sparse Tensors, causing HTTP 500 with no [paddle gpu error] log (exception occurs outside _run_paddle's try/except). After paddle.grad(), convert any sparse Tensor in the grad list to dense via .to_dense() so serialization succeeds. Mathematically equivalent. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…e=True) Previous fix used g.is_sparse() and g.to_dense() which both fail for SelectedRows: - is_sparse() returns False (SelectedRows is not SparseCoo/SparseCsr) - to_dense() causes Segfault in Paddle's C++ layer Correct approach: - Detect SelectedRows: Tensor where is_dense(), is_sparse(), is_sparse_coo(), and is_sparse_csr() are all False - Convert via numpy() + paddle.to_tensor() which works correctly on SelectedRows Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

paddle.save does not support named tuple types (CummaxRetType, CumminRetType, TopKRetType, etc.). Convert them to plain tuple recursively before serialization so cummax/cummin/topk cases no longer fail with HTTP 500 paddle_error. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When input tensor has a zero dimension, randint upper bound becomes 0, causing 'high <= 0' crash in numpy. Fall back to zeros tensor instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

numpy.finfo() only accepts inexact (floating-point) types. When the tensor dtype is an integer (e.g. int64), calling numpy.finfo() raises ValueError: data type not inexact. Fix by selecting numpy.iinfo() for integer dtypes and numpy.finfo() for floating-point dtypes when computing the safe value range for pow/rpow API cases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…failures - http_server.py: return "remote_error" instead of "paddle_error" in HTTP response; use tester._last_error to propagate real Paddle exception into detail field - paddle_device_vs_gpu.py: store exception in self._last_error in _run_paddle; update _test_http_mode to handle "remote_error" type from server - log_writer.py: register "remote_error" -> "api_config_remote_error" and add to fail_case summary in print_log_info Remote GPU failures now go to api_config_remote_error.txt with real error detail visible in log_inorder.log; local XPU failures remain in api_config_paddle_error.txt. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

paddle-bot · 2026-04-21T12:05:04Z

Thanks for your contribution!

…orted as remote_error These APIs return None by design (in-place mutation). Return the modified tensor(s) instead so downstream serialization and accuracy comparison can proceed normally. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Previously, HTTP network errors (server crash / timeout) silently dropped affected cases. aggregate_logs() would then misclassify them as skip, making ~1400 cases invisible in the last full run. Now write_to_log("network_error", ...) persists them to api_config_network_error.txt so they are visible and easy to re-run. checkpoint is intentionally NOT written, preserving the re-runnable semantics for transient network failures. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

All conditions in base.need_skip() are Torch-specific (sparse has no Torch counterpart, prod multi-axis / torch_error_skip / float8 dtype are all Torch-side limitations). Since Device vs GPU compares Paddle on XPU against Paddle on GPU with no Torch involvement, calling super() was causing 246 sparse API cases to be silently skipped. Remove the super() call entirely and keep only the one real hardware constraint in this mode: XPU cannot create float8 tensors via the float32→cast path. Paddle vs Torch and all other modes are unaffected — they do not inherit APITestPaddleDeviceVSGPU. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

XPU does not have sparse kernels, so sparse API cases should be skipped on XPU (same as float8) rather than attempting HTTP comparison with the GPU server. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

These sparse-related Tensor methods don't carry "sparse" in their name but still require sparse kernel support that XPU lacks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Enable paddle.amp.auto_cast() in APITestPaddleDeviceVSGPU._run_paddle and propagate test_amp through the HTTP payload so both the local XPU side and the remote GPU server side run under the same AMP context. - tester/paddle_device_vs_gpu.py: wrap paddle_api call with auto_cast when test_amp is True; add test_amp field to HTTP request payload - tester/http_server.py: read test_amp from request JSON, pass it to run_single_api and the tester instance - engineV2.py: include test_amp in kwargs for custom_device_vs_gpu mode Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- tester/http_server.py: - Add --admin_token CLI arg; if set, enables /admin/* endpoints - POST /admin/upload_file: receive a file (path + content) and write it into REPO_ROOT, with path-traversal protection - POST /admin/restart: send response then os.execv() restart in a background thread, preserving original argv - _check_admin_token(): common auth guard using secrets.compare_digest - Refactor do_POST into _handle_run_api_test() + new admin handlers - scripts/sync_watch.py (new): local watchdog-based watcher that detects .py file changes, uploads them via /admin/upload_file, triggers restart via /admin/restart, then polls /health until the server is ready Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- README.md: add scripts/ and tester/http_server.py to project structure tree - engineV2-README.md: add --admin_token to http_server parameter table; add "远程代码同步（sync_watch）" section explaining sync_watch.py usage Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add an "AMP 模式" subsection under the HTTP comparison section explaining that --test_amp=True synchronises paddle.amp.auto_cast() on both the local device side and the remote GPU server side, with an example command. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Claude Code's Edit tool uses atomic writes: content is written to a temp file (e.g. http_server.py.tmp.xxxxxx) then renamed to the target via rename(). This generates a watchdog "moved" event on the dest path, not a "modified" event, so changes were silently ignored. Add on_moved handler that enqueues event.dest_path, fixing sync for any editor that uses atomic/safe-write mode. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

`print_log_info` computed `skip_case` from only numpy_error/torch_error/ paddle_to_torch_failed/match_error, missing the 'skip' log type that write_to_log("skip", ...) actually uses. In Device-vs-GPU HTTP mode the four legacy types are all 0, so the summary showed "Skipped cases: 0" while Log Type Breakdown correctly showed "skip: 1917". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

XPU does not support complex128 in cast_kernel, tensor memory allocation, or gradient accumulation. Add _has_complex128() to detect complex128 in all arg forms (TensorConfig, string, Dtype() enum, complex() literal) and skip the config unconditionally on XPU. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Two bugs in tester/special_compare/argsort.py caused all paddle.argsort cases to be mis-reported as accuracy_error: 1. When the config uses keyword argument form (x=Tensor(...)), the input tensor is in tester.paddle_kwargs["x"] rather than tester.paddle_args[0]. Fix: fall back to paddle_kwargs["x"] when paddle_args is empty. 2. When the input is a 0-dim tensor (Tensor([], dtype)), input_np.ndim==0 so axis = -1 + 0 = -1 remains negative, causing np.take_along_axis to raise an out-of-bounds error. Fix: early-return with a direct index comparison when ndim==0. Verified: all 100 paddle.argsort cases in all_config.txt now pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…P mode paddle.autograd.Jacobian/Hessian are lazy evaluation objects. In HTTP mode, the server-side _normalize() and client-side comparison logic both called paddle.save on the raw lazy object, which pickle-failed with a cryptic error. Fix: add isinstance(obj, Jacobian) check in _normalize() (http_server.py) and mirror it in the new _normalize_output() static method (paddle_device_vs_gpu.py), calling obj[:] to trigger full evaluation and return a plain Tensor before saving. Verified: all 8 hessian/jacobian cases from all_config.txt now pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Register custom forward/backward comparison functions for six APIs whose accuracy errors are caused by valid but different tie-breaking choices between XPU and GPU, not genuine precision bugs. - sort.py: paddle.sort / Tensor.sort — compare sorted values only (forward); sort both dx arrays along sort axis before comparing (backward) - topk.py: paddle.topk / Tensor.topk — sort both value outputs before comparing (handles sorted=False); sort both dx arrays (backward) - reduce_max_min.py: paddle.max / Tensor.max / Tensor.min — compare values only, skip indices (forward); verify all nonzero grads land on valid tied positions rather than comparing values directly, since XPU and GPU implement different but valid subgradients for tied elements (backward) - max_pool.py: nn.functional.max_pool1d/2d — compare pooled values only, skip return_mask (forward); sort both dx arrays (backward) - grid_sample.py: nn.functional.grid_sample — use tester.atol/rtol for nearest-mode forward; sort both dx arrays for nearest-mode backward; use tester.atol/rtol for bilinear/bicubic backward accumulation differences - roi_align.py: vision.ops.roi_align — relax backward atol by dtype when aligned=True (float64→0.15, float32→1e-3) to accommodate atomic-add ordering differences in gradient accumulation All previously-failing tie-breaking cases now pass. Remaining errors in each API are confirmed genuine XPU/GPU kernel bugs or missing kernel registrations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Register 22 random/non-deterministic APIs in special_compare so that XPU vs GPU comparisons are correctly skipped or conditionally handled: Unconditional skip (17 APIs): - paddle.normal, standard_normal, log_normal, poisson, bernoulli, standard_gamma, binomial - paddle.Tensor.normal_, exponential_, cauchy_, geometric_, log_normal_, bernoulli_ - paddle.nn.functional.gumbel_softmax - paddle.geometric.sample_neighbors - paddle.nn.functional.fractional_max_pool2d/3d Conditional skip — only when training=True (default): - paddle.nn.functional.dropout/dropout2d/dropout3d/alpha_dropout (training=False cases are still accuracy-checked) Conditional skip — only when training=True AND p>0: - paddle.incubate.nn.functional.fused_dropout_add (training=False or p=0.0 cases are still accuracy-checked) Verified end-to-end against GPU server: 44/52 sampled cases correctly skipped, 8 training=False cases correctly passed accuracy check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- paddle.linalg.eigh: compare |eigenvectors| instead of eigenvectors directly to handle the valid v vs -v sign freedom; eigenvalues compared normally (no ambiguity) - paddle.linalg.svd: compare |U| and |Vh| for forward; skip backward because sign ambiguity propagates into the gradient of x - paddle.linalg.svd_lowrank: unconditional skip (randomized algorithm — singular values themselves differ between XPU/GPU RNG) Verified over 28 cases from all_config.txt: eigh 8/8 pass, svd float64 forward all pass, svd_lowrank all skip Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…port - engineV2.py: add --enable_api_kernel_fallback CLI flag (default False); when set with --custom_device_vs_gpu, sets FLAGS_enable_api_kernel_fallback=1 on the local process only — remote GPU server is unaffected - tester/paddle_device_vs_gpu.py: override need_check_grad() to skip backward for dropout/fused_dropout_add with training=False on XPU, preventing (InvalidArgument) GradOp is only callable when is_test is false - tester/http_server.py: add /admin/delete_file endpoint that removes the target .py and its __pycache__ .pyc to prevent stale import residue - scripts/sync_watch.py: add on_deleted handler and _delete_file() to sync local file deletions to the remote server via /admin/delete_file Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

1. arange step tensor: fix NameError caused by using `step_config` instead of `step_val` when regenerating the step tensor for int-dtype output (paddle.arange with float step Tensor and int dtype argument) 2. pow get_base_max: fix ZeroDivisionError when exponent == 1 (ln(1) == 0). When value == 1, x^1 == x so there is no overflow constraint; return default_max directly instead of dividing by zero. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Previously, --gpu_ids only took effect in file mode (via init_worker_gpu which sets CUDA_VISIBLE_DEVICES before importing paddle). In single-config mode the value was silently ignored, always falling back to the default device (xpu:0). Set CUDA_VISIBLE_DEVICES early in main() so both modes behave consistently. File mode is unaffected since init_worker_gpu overrides the value per-worker before importing paddle. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Previously _has_complex128() returned True for ALL Python complex scalars, causing ~98 test cases to be incorrectly skipped on XPU. The fix narrows the skip condition to two precise cases: 1. Config contains a tensor with explicit complex128 dtype 2. Config has a Python complex scalar AND a float64 tensor (Paddle promotes this combination to complex128, which XPU cannot handle) complex scalar + float32/bfloat16/int* tensor promotes to complex64, which XPU supports — those cases now proceed to normal testing. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>

YqGe585 and others added 18 commits April 21, 2026 19:33

support accuracy by API

766ec86

add http

bb26c12

support http mode

01b34c0

fix: handle zero-size tensor in pad and take index generation

72d755f

When input tensor has a zero dimension, randint upper bound becomes 0, causing 'high <= 0' crash in numpy. Fall back to zeros tensor instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

update dtype_atol_rtol config: add float32/float64/int/bool tolerances

18ae82b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

add rrelu to nondeterministic.py

553127b

update http_config.yaml: set remote_host to 10.78.119.13

d103e09

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

YqGe585 and others added 11 commits April 22, 2026 11:23

skip sparse APIs on XPU in Device vs GPU mode

9aaa282

XPU does not have sparse kernels, so sparse API cases should be skipped on XPU (same as float8) rather than attempting HTTP comparison with the GPU server. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

also skip coalesce/is_coalesced on XPU in Device vs GPU mode

328bebb

These sparse-related Tensor methods don't carry "sparse" in their name but still require sparse kernel support that XPU lacks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

YqGe585 and others added 10 commits April 22, 2026 20:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support device vs gpu http mode #626

Support device vs gpu http mode #626
YqGe585 wants to merge 39 commits intoPFCCLab:mainfrom
YqGe585:accuracy

YqGe585 commented Apr 21, 2026

Uh oh!

paddle-bot Bot commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YqGe585 commented Apr 21, 2026

Uh oh!

paddle-bot Bot commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant