Skip to content

Sbosisio/fix nsys jax #5036

Sbosisio/fix nsys jax

Sbosisio/fix nsys jax #5036

Triggered via pull request November 14, 2025 17:09
Status Failure
Total duration 1d 0h 55m 28s
Artifacts 45

ci.yaml

on: pull_request
metadata
3s
metadata
bump-manifest
16s
bump-manifest
Matrix: amd64 / test-distribution
Matrix: arm64 / test-distribution
amd64  /  ...  /  build-base
2m 36s
amd64 / build-base / build-base
arm64  /  ...  /  build-base
3m 10s
arm64 / build-base / build-base
amd64  /  ...  /  build-mpi-operator-compatible-base
3m 42s
amd64 / test-nccl / build-mpi-operator-compatible-base
amd64  /  ...  /  build-nccl-gke
1m 52s
amd64 / test-nccl / nccl-test-gke / build-nccl-gke
arm64  /  ...  /  build-mpi-operator-compatible-base
arm64 / test-nccl / build-mpi-operator-compatible-base
arm64  /  ...  /  build-nccl-gke
arm64 / test-nccl / nccl-test-gke / build-nccl-gke
Matrix: amd64 / test-jax-cutlass-h100 / jax-cutlass-test-h100
Matrix: amd64 / test-jax / run-unit-test
Matrix: amd64 / test-te-a100 / run-unit-test
Matrix: amd64 / test-te-h100 / te-test-h100
amd64  /  build-torchax
6m 48s
amd64 / build-torchax
amd64  /  ...  /  launch-slurm-runner
28m 56s
amd64 / test-jax / runner / launch-slurm-runner
amd64  /  test-nsys-jax-eks
30m 8s
amd64 / test-nsys-jax-eks
amd64  /  ...  /  launch-slurm-runner
2h 28m
amd64 / test-te-a100 / runner / launch-slurm-runner
amd64  /  build-upstream-t5x
8m 54s
amd64 / build-upstream-t5x
Matrix: amd64 / test-nsys-jax / run-unit-test
amd64  /  ...  /  launch-slurm-runner
2h 41m
amd64 / test-nsys-jax / runner / launch-slurm-runner
Matrix: amd64 / test-nccl / nccl-test
Matrix: amd64 / test-nccl / nccl-test-gke / nccl-gke
Matrix: arm64 / test-jax-cutlass-h100 / jax-cutlass-test-h100
Waiting for pending jobs
Matrix: arm64 / test-jax / run-unit-test
Waiting for pending jobs
Matrix: arm64 / test-te-a100 / run-unit-test
Waiting for pending jobs
Matrix: arm64 / test-te-h100 / te-test-h100
Waiting for pending jobs
arm64  /  build-torchax
7m 52s
arm64 / build-torchax
arm64  /  test-nsys-jax-eks
0s
arm64 / test-nsys-jax-eks
arm64  /  ...  /  launch-slurm-runner
arm64 / test-jax / runner / launch-slurm-runner
arm64  /  ...  /  launch-slurm-runner
arm64 / test-te-a100 / runner / launch-slurm-runner
arm64  /  build-upstream-t5x
9m 18s
arm64 / build-upstream-t5x
Matrix: arm64 / test-nsys-jax / run-unit-test
Waiting for pending jobs
arm64  /  ...  /  launch-slurm-runner
arm64 / test-nsys-jax / runner / launch-slurm-runner
Matrix: arm64 / test-nccl / nccl-test
Waiting for pending jobs
Matrix: arm64 / test-nccl / nccl-test-gke / nccl-gke
Waiting for pending jobs
amd64  /  ...  /  maxtext-gke-xpk
1d 0h
amd64 / test-maxtext-gke / maxtext-gke-xpk
Matrix: amd64 / test-maxtext / maxtext-multinode
Matrix: amd64 / test-maxtext / single-process-multi-device
amd64  /  ...  /  build-rosetta
15m 30s
amd64 / build-rosetta-t5x / build-rosetta
amd64  /  test-axlearn-eks
17m 27s
amd64 / test-axlearn-eks
amd64  /  test-axlearn-fuji-models-eks
5m 39s
amd64 / test-axlearn-fuji-models-eks
Matrix: amd64 / test-nsys-jax-archive
arm64  /  ...  /  maxtext-gke-xpk
arm64 / test-maxtext-gke / maxtext-gke-xpk
Matrix: arm64 / test-maxtext / maxtext-multinode
Waiting for pending jobs
Matrix: arm64 / test-maxtext / single-process-multi-device
Waiting for pending jobs
arm64  /  ...  /  build-rosetta
16m 44s
arm64 / build-rosetta-t5x / build-rosetta
arm64  /  test-axlearn-eks
0s
arm64 / test-axlearn-eks
arm64  /  test-axlearn-fuji-models-eks
0s
arm64 / test-axlearn-fuji-models-eks
Matrix: arm64 / test-nsys-jax-archive
amd64  /  ...  /  test-maxtext-metrics
22s
amd64 / test-maxtext / test-maxtext-metrics
amd64  /  collect-docker-tags
3s
amd64 / collect-docker-tags
Matrix: amd64 / test-rosetta-t5x / vit-multi-gpu-multi-node
arm64  /  ...  /  test-maxtext-metrics
arm64 / test-maxtext / test-maxtext-metrics
arm64  /  collect-docker-tags
3s
arm64 / collect-docker-tags
Matrix: arm64 / test-rosetta-t5x / vit-multi-gpu-multi-node
Waiting for pending jobs
amd64  /  ...  /  sitrep
8s
amd64 / test-maxtext / test-maxtext-sitrep / sitrep
amd64  /  ...  /  test-t5x-rosetta-summary
3s
amd64 / test-rosetta-t5x / test-t5x-rosetta-summary
amd64  /  ...  /  test-t5x-rosetta-metrics
17s
amd64 / test-rosetta-t5x / test-t5x-rosetta-metrics
arm64  /  ...  /  sitrep
arm64 / test-maxtext / test-maxtext-sitrep / sitrep
arm64  /  ...  /  test-t5x-rosetta-summary
arm64 / test-rosetta-t5x / test-t5x-rosetta-summary
arm64  /  ...  /  test-t5x-rosetta-metrics
arm64 / test-rosetta-t5x / test-t5x-rosetta-metrics
amd64  /  ...  /  test-maxtext-outcome
2s
amd64 / test-maxtext / test-maxtext-outcome
amd64  /  ...  /  sitrep
14s
amd64 / test-rosetta-t5x / test-t5x-rosetta-sitrep / sitrep
arm64  /  ...  /  test-maxtext-outcome
arm64 / test-maxtext / test-maxtext-outcome
arm64  /  ...  /  sitrep
arm64 / test-rosetta-t5x / test-t5x-rosetta-sitrep / sitrep
amd64  /  ...  /  test-t5x-rosetta-outcome
2s
amd64 / test-rosetta-t5x / test-t5x-rosetta-outcome
arm64  /  ...  /  test-t5x-rosetta-outcome
arm64 / test-rosetta-t5x / test-t5x-rosetta-outcome
make-publish-configs
5s
make-publish-configs
merge-new-manifest
merge-new-manifest
Matrix: publish-containers
finalize  /  workflow-badge
4s
finalize / workflow-badge
finalize  /  report
10s
finalize / report
finalize  /  upload-badge
10s
finalize / upload-badge
finalize  /  publish-badge
4s
finalize / publish-badge
Fit to window
Zoom out
Zoom in

Annotations

11 errors
amd64 / test-te-h100 / te-test-h100 (unittest, 8)
Process completed with exit code 1.
amd64 / test-te-a100 / te-A100-unit-test
The self-hosted runner lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
amd64 / test-nsys-jax / nsys-jax-A100-unit-test
Process completed with exit code 1.
amd64 / test-maxtext / test-maxtext-outcome
Process completed with exit code 1.
amd64 / test-rosetta-t5x / test-t5x-rosetta-metrics
Process completed with exit code 1.
amd64 / test-rosetta-t5x / test-t5x-rosetta-outcome
Process completed with exit code 1.
amd64 / test-nccl / nccl-test-gke / nccl-gke (reduce_scatter_perf_mpi)
The job has exceeded the maximum execution time while awaiting a runner for 24h0m0s
amd64 / test-nccl / nccl-test-gke / nccl-gke (all_reduce_perf_mpi)
The job has exceeded the maximum execution time while awaiting a runner for 24h0m0s
amd64 / test-nccl / nccl-test-gke / nccl-gke (broadcast_perf_mpi)
The job has exceeded the maximum execution time while awaiting a runner for 24h0m0s
amd64 / test-nccl / nccl-test-gke / nccl-gke (all_gather_perf_mpi)
The job has exceeded the maximum execution time while awaiting a runner for 24h0m0s
amd64 / test-maxtext-gke / maxtext-gke-xpk
The job has exceeded the maximum execution time while awaiting a runner for 24h0m0s

Artifacts

Produced during runtime
Name Size Digest
artifact-axlearn-build-amd64
565 Bytes
sha256:39b2a30f1680c75cba69c2dfc7bf84eac3314ab26bb073ca1f25c475931180b5
artifact-axlearn-build-arm64
567 Bytes
sha256:a871c514bfb4e1573e3317c816a3fcd87d92f15b0a14f2b5640320a3fb33bff6
artifact-axlearn-test
179 KB
sha256:7d25dd66e379d2d31e1e58157ca3c7ab4e8ed4bf9c51397f8923fcffb1a85e0a
artifact-base-build-amd64
568 Bytes
sha256:5cb9874ec00a0e682ef99141147c005ef766a91d627527f2b14aefb3c1bb6e09
artifact-base-build-arm64
566 Bytes
sha256:e86fd1bd8c32212559ba691a589f9f09c6b0099cbd5df0df27e4f51aba381dca
artifact-equinox-build-amd64
569 Bytes
sha256:82658b1bf9af835732644fb9f2fbd53302fd319afe661fb37fc2720276172885
artifact-equinox-build-arm64
568 Bytes
sha256:49deb1b3c377009acc9a9bc0846ab0eb520f5e760a11c64b1b19ab190c31ae52
artifact-final-report
3.81 KB
sha256:791125572f5489fae148467a7a31b2309efee2fafa4dfb6bd5f53b869003375f
artifact-jax-build-amd64
553 Bytes
sha256:a1fe440251579f3dea8bb955487e1f12a1629ff386210df0ec294a5e3e606b98
artifact-jax-build-arm64
553 Bytes
sha256:c7ccf3cbaf73c244de5004c39a97af42e937d9f1c987c850a2bac0122750400e
artifact-maxtext-build-amd64
568 Bytes
sha256:4d140a630b348296a37399a8f555c9f4fd18b4cd0696b75cfd3788f7531d4d2d
artifact-maxtext-build-arm64
568 Bytes
sha256:b47dec55580d566ef27f47574b75cb33baf7de7959ce37832bfa9e39f621bb0e
artifact-maxtext-test
1.46 KB
sha256:81567cdf31ed21bcc040dc7c123a7ec0d1fbf4e375ff50f8a06aaaf3d05b7540
artifact-mpi-operator-compatible-base-build-amd64
639 Bytes
sha256:8f48acc8c5f88d307257699d0deb4286a0a58470078d5d11b4aff8f051154778
artifact-nccl-gke-build-amd64
571 Bytes
sha256:6533ee977ec0eafbf2c41331100b7f637ad4b7a2ca8297af16a2e5a9ecd566ec
artifact-rosetta-build-t5x-amd64
584 Bytes
sha256:7eba5628e03a827a49309f9db392ea949c25e8eede50d0bfe2fc1fd5c994dd95
artifact-rosetta-build-t5x-arm64
584 Bytes
sha256:3ebd70fe1b7841416a7ec7fe18852d7483b45ed3e1d6a3f359d66fb0b46f7f4a
artifact-rosetta-t5x-mgmn-test
624 Bytes
sha256:17f3899b81b3a873e2b4d8f8430c284809761d128a1871506253fe57f6567807
artifact-t5x-build-amd64
570 Bytes
sha256:1297720be586f6a862c25db604ee56772ee4f3d7e4bd52f4a561383af6aaf98a
artifact-t5x-build-arm64
568 Bytes
sha256:25136b7a95e4990a8945e2f9c2c61e0091fabc89511ae7c585383f6e6d1f4812
artifact-torchax-build-amd64
567 Bytes
sha256:09313dfe050778f35e3432f99fc77db8160ebb719a013091ccb5f074ff922a4d
artifact-torchax-build-arm64
566 Bytes
sha256:09f28fc16c370b1fc502f0eacc7695994ee60a89a508f18f7e0f4118ad3c71f3
artifact-workflow-metadata
277 Bytes
sha256:acf03161ded0c4453e63a5c120ebf4c10e50e0720603baa73b281052aa2e0034
bumped-manifest
51.6 KB
sha256:2ad648433a30390ec30d22b4537368a1f71f424a7f58c763471ac2812b33adf2
final-axlearn
263 Bytes
sha256:5bf03f382fd3c181063e735b30949ae8b70e637116eac8d7fcb0f509b3ed1bdc
final-base
254 Bytes
sha256:785d08dbb099a8073cd0c05c7aaf561d1ffa734f52e7a1d9a1efaba82e177e83
final-equinox
263 Bytes
sha256:6220814edc60996075839f314db3875b490b5fe9d82aa116272a79e4b8b48001
final-jax
251 Bytes
sha256:ab8725abc0d5120db78df1ac743d5520d680cbe48bd51b46da056f4b6741cc65
final-maxtext
263 Bytes
sha256:f0d519af03c7ec7d1286fa6bfdb16970a7b2a45b3073610a0b7e29d48f14c221
final-t5x
251 Bytes
sha256:42ecb5e574abe5f37e55803d73e2285a5f43d23976c8b47dfde9fec6d4e250c2
final-upstream-t5x
277 Bytes
sha256:a48c7226dd4ec913b05f3edced1cd0bd131cabe25a24ec8c66b94b9ae4d90251
jax-cutlass-test-H100
4.69 KB
sha256:889e88e0ddfa698b0deacd86aade962001e640e8b6623351a2908507fbd225e4
jax-unit-test-A100
22.5 KB
sha256:e7e59762c0e1021188cceb18a0043df778eb8d104ca041d10064f95ee9af3522
mealkit-axlearn
272 Bytes
sha256:77d6213bcf8d72ead2ac3795a09b7af24e21a283b0048aef00ec65f0b05cf91a
mealkit-equinox
273 Bytes
sha256:1f5590f756f72579985a1c3d8ac53bc0517a1e8aa9374cb4ee7ff750b96874a0
mealkit-jax
261 Bytes
sha256:55f1f0db13bbdf5a004d786a8108ff0c5780c1dc29947626be39244dbf162386
mealkit-maxtext
272 Bytes
sha256:110392e50d27be7f238d12bcad76f5e3b4b3d00a4673739114a4274de218dc7b
mealkit-t5x
261 Bytes
sha256:04d70be3bbc4804edd699fcb8e32b153eef75ad282142d6ce2c586c16f2a45e8
mealkit-upstream-t5x
286 Bytes
sha256:13c50f1b632b748abd57d1206d52341b297311fe5482eb667ecb421a09c4381d
nsys-jax-unit-test-A100
138 MB
sha256:a3b56a6bc5b790d8657ae2ea470d077ef069f1816e7e0797b20dd5200116fbd7
rosetta-t5x-vit-19372009803-VIT8G1N
15.7 KB
sha256:53d1e9e2c81565037d8e4c92e542602cccb08adcf40b2aced26a358c5cf4ee94
te-unit-test-H100
2.09 MB
sha256:99b4650e56082039ee0ff89a98bd4ac1569af904f8faaf52de1cca2428b0ab4d
upstream-maxtext-19372009803-1DP2FSDP4TP1PP_single_process
23.7 KB
sha256:1ba1bb1e53a358cfcbb427e9e2b4828b04a50588b23d257f15be7e9e45079ca7
upstream-maxtext-19372009803-2DP2FSDP2TP1PP
28 KB
sha256:08c775c843449611a3aa0c85ec09f1eb6a83d7ac5688947b40483842f718208e
upstream-maxtext-metrics-test-log
2.52 KB
sha256:4f53ecc070b05660932dedb413a75c0412c7cad1dd1b1b156fb610ed50e35cd2