for javacpp-pytorch 1.5.13 I have some idea #1712

mullerhai · 2025-10-20T16:12:59Z

mullerhai
Oct 20, 2025

The latest version of Storch is now basically stable. Since all the lessons in the "PyTorch on Scala" course have been successfully completed, I believe I have verified most of the basic functionalities of PyTorch. It’s safe to say that Storch works properly and follows a coding style close to Python’s PyTorch. I highly recommend that everyone learns the Scala 3 version of PyTorch because it appears simpler and allows users to focus more on neural network construction, training, and inference.

After three years of verification and nearly 110,000 lines of Scala code, I can confirm that javacpp-pytorch is indeed an excellent glue layer. However, directly programming with javacpp-pytorch comes with many risks and drawbacks it not easy to use :
first maybe The code is quite verbose, especially when written directly in Java.
second It’s error-prone, with many parameters using types like Pointer and cpp...Vector, and vague names like var1 or var2. Without checking PyTorch’s official documentation, it’s hard to know the exact data types and actual meanings of these parameters.
thrid some Javacpp-pytorch still has some unresolved bugs, such as some padding type layers not working correctly and some hyperparameters failing to take effect use put() method .

Since Python’s PyTorch extends libtorch significantly, many features implemented in Python PyTorch are not available in libtorch, making it impossible for javacpp-pytorch to generate corresponding bindings.
Storch addresses most of these risks and issues, greatly improving the efficiency of using PyTorch. Based on the above, I believe that to promote more widespread use of PyTorch on the JVM platform, it’s better to encourage everyone to try Storch, which is built on javacpp-pytorch. What do you think?

Of course, I also anticipate that there might be a Java-based wrapper library for javacpp-pytorch called JTorch in the future. If necessary, I would like to transfer my extended Storch repository (based on @sbrunk Storch main branch code ) to the bytedeco organization. I also plan to consult @sbrunk sbrunk’s opinions, as I believe hosting it under bytedeco would attract more contributors to optimize Storch and foster ecosystem growth.
On another note, we have high expectations for the 1.5.13 version of pytorch. Looking through the code, I noticed some new additions:
For ProcessGroupGloo, new methods like:

public native @ByVal Milliseconds getTimeout();
public native @Const @ByVal TensorVector getInputTensors();
public native @Const @ByVal TensorVector getOutputTensors();
public native @StdString BytePointer getProfilerTitle();
public native @Cast("uint64_t*") @StdVector LongPointer global_ranks_in_group(); 
public native Options global_ranks_in_group(LongPointer setter);
public native @StdString BytePointer group_name(); 
public native Options group_name(BytePointer setter);
public native @Cast("uint64_t*") @StdVector LongPointer groupRanks();

The new class ProcessGroupStatus.
Additions to the global torch class:

@Namespace("c10d") @MemberGetter public static native @Cast("const char*") BytePointer GLOO_BACKEND_NAME();
// Control whether connections are established in a full mesh or lazily as needed.
@Namespace("c10d") public static native @ByRef StringVector TORCH_GLOO_LAZY_INIT(); 
public static native void TORCH_GLOO_LAZY_INIT(StringVector setter);
// Returns default value for lazyInit.
@Namespace("c10d") public static native @Cast("bool") boolean getDefaultGlooLazyInit();

I believe these updates show that Gloo support is quite robust. However, based on my understanding of large model training, most users still rely on NCCL-based distributed frameworks with DDP and DSFP modes. Therefore, I would like to request adding the following includes to the presets/torch_include.h generation script:

#include "torch/csrc/distributed/c10d/ProcessGroupNCLL.hpp"
#include "torch/csrc/distributed/c10d/TCPStore.hpp"
#include "torch/csrc/distributed/c10d/FileStore.hpp"
#include "torch/csrc/distributed/c10d/NCLLUtils.hpp"
#include "torch/csrc/distributed/c10d/ProcessGroupMPI.hpp"

This should generate corresponding bindings after compilation. I understand this is challenging and may not produce correct code immediately, requiring further debugging. However, without ProcessGroupNCLL, we can only work on small single-machine models, making it difficult to optimize model quantization and CUDA memory usage. I sincerely hope that the 1.5.13 version of javacpp-pytorch will include functional ProcessGroupNCLL, enabling us to develop libraries like DeepSpeed, FairScale, or Colossal-AI to accelerate distributed training. I’d like to hear your thoughts and current challenges—I’m willing to assist with implementation.
Additionally, could the EmbeddingImpl layer include the Embedding::from_pretrained() method? This is crucial for future large-model fine-tuning, especially since we already have the EmbeddingFromPretrainedOptions class. Regarding JIT, I’m unsure if libtorch includes a method similar to torch::trace(jitmodule) and torch.jit.is_scripting in Python.

sbrunk · 2025-10-20T20:56:39Z

sbrunk
Oct 20, 2025

I'm happy to support moving the repo into the bytedeco org (or into a dedicated org if that makes more sense). I could also transfer the original repo currently under my personal namespace once it's up to date, as otherwise we would have a forked repo which is often treated as a second class citizen by GitHub compared to a non-forked one.

I haven't had the time to really develop Storch further this year, and I doubt this will change much in the near future due other responsibilities and unforeseen things happening in my life. That said, I'll try to help as much as possible. Storch has come a long way and you've put an amazing amount of work into it @mullerhai!

6 replies

saudet Oct 21, 2025
Maintainer

@sbrunk If it's OK with you, please do transfer your repository to @bytedeco. We can update later as necessary. Thanks!

@mullerhai Well, if we can find sponsors that sounds great!

mullerhai Oct 21, 2025
Author

@sbrunk If it's OK with you, please do transfer your repository to @bytedeco. We can update later as necessary. Thanks!

@mullerhai Well, if we can find sponsors that sounds great!

@saudet @sbrunk ，wait we transfer the whole latest storch repo to bytedeco ，I think I will try to introduce storch as bytedeco organization incubate top level【🎖️】？project ，and let these company support us， these days I am prepare for Pytorch on Scala3 Online lesson on Youtube and Bilibili【China】，I think we could make big impact for world

sbrunk Oct 26, 2025

I've just transferred the original repo, which is now at: https://github.com/bytedeco/storch

Now we can look into getting the changes from bytedeco/storch#88 merged.

mullerhai Oct 27, 2025
Author

I've just transferred the original repo, which is now at: https://github.com/bytedeco/storch

Now we can look into getting the changes from bytedeco/storch#88 merged.

Perfectly! let me do more fix and clean code ,then make a pr

mullerhai Oct 27, 2025
Author

I've just transferred the original repo, which is now at: https://github.com/bytedeco/storch

Now we can look into getting the changes from bytedeco/storch#88 merged.
HI,@sbrunk ,could you generate new branch,then I apply pr ON new branch, that we could validate thecode quickly, next make merge

mullerhai · 2025-10-27T15:47:06Z

mullerhai
Oct 27, 2025
Author

@saudet today I have talk with HUAWEI company AI leader Mr.Cheng , He is very interesting with Storch and javacpp-pytorch. it we could develop CANN[ like cuda] and [TensorRT-llm] with javacpp-pytorch. they want to support us

3 replies

saudet Oct 28, 2025
Maintainer

Thanks for feedback! Yes, this is something that we can do. What would that support look like?

mullerhai Oct 28, 2025
Author

Thanks for feedback! Yes, this is something that we can do. What would that support look like?

some Money sponsor and Gpu pc ，but gpu is HUAWEI NPU shengteng

saudet Oct 29, 2025
Maintainer

This sounds interesting. The easiest thing for me is if they could become a sponsor through GitHub here:
https://github.com/sponsors/bytedeco
Could you check first if this is possible or not? Thanks!

mullerhai · 2025-10-28T11:53:49Z

mullerhai
Oct 28, 2025
Author

Hi, @saudet @sbrunk , I want to transfer Storch-numpy Storch-pandas ,storch-opencv ,storch-ffmpeg to Bytedeco repo, to make more people use them . storch-opencv ,storch-ffmpeg is base on javacpp-opencv and javacpp-ffmpeg. how do you think ?

1 reply

saudet Oct 29, 2025
Maintainer

Sure, why not?

mullerhai · 2025-10-28T12:20:37Z

mullerhai
Oct 28, 2025
Author

@sbrunk I have update storch-core code to the version 0.7.5-1.5.12 , you could check and validate review the code , in my test on pytorch_on_scala3_lesson is could correctly work

10 replies

mullerhai Nov 6, 2025
Author

Sounds promising! How did you talk to them? Can they use Google Meet or do we need to use WeChat or something like that?

ohh，you know wechat？maybe use wechat or Tencent Meeting ，but they want to meet us in Shenzhen or Beijing to talk about storch and CANN [HuaWei cuda], the develop scheduler maybe is first generate Huawei CANN java glue layer code from cpp ,and try to suit with Huawei NPU clips, We could serach [pytorch-npu] it is build pytorch for huawei clips use python. second , storch invoke CANN Java glue layer code to run huawei gpu. finally they will use storch for Huawei cloud machine pc

saudet Nov 7, 2025
Maintainer

I'm fine doing remote meetings, but I'm not going to go to China for that for now anyway. You could go in person, and others like me can join remotely though.

mullerhai Nov 8, 2025
Author

I'm fine doing remote meetings, but I'm not going to go to China for that for now anyway. You could go in person, and others like me can join remotely though.

@saudet If Huawei makes any new progress, I will share the latest updates with you. I will continue to communicate and follow up with them. Currently, I've asked them to internally evaluate the suitability of javacpp-pytorch and Storch. However, it's important to note that Huawei is just one company. While their chip market share in China is currently around 5%, NVIDIA's CUDA still dominates with approximately 90% market share. Regardless, we must prioritize robust support for NVIDIA CUDA and NVIDIA NCCL.
That said, recent Chinese government policies strongly support domestic chip enterprises like Huawei in gaining a larger market share. For instance, there are 50% electricity discounts for data centers using domestic chips. Looking ahead, we may also collaborate with other Chinese chip manufacturers, such as Cambricon, Moore Threads, Biren, and other domestic GPU companies.
Given the recent global boom in large models, distributed training for large models will become an essential feature we must support. Almost all major Chinese internet companies are now using large models for training, fine-tuning, and deployment.
Therefore, I have a few core questions:

First, When do you plan to release javacpp-pytorch 2.9.0-1.5.13?
Second Will javacpp-pytorch 2.9.0-1.5.13 support NCCL-based distributed training?

These questions are not specifically aimed at the Huawei collaboration but rather to enhance the competitiveness of our project."

saudet Nov 9, 2025
Maintainer

Regardless of what happens with CUDA and NVIDIA in China, I'm not aware of anyone anywhere doing distributed training on the Java platform. If Huawei wants to do that, that's interesting, but we're going to need a team to basically port all the tools required from the Python platform. NCCL is only a very small piece. But it's something you could ask them for starters. Could Huawei have one of their engineers spend some time to add support for NCCL to PyTorch on Java?

mullerhai Nov 9, 2025
Author

Regardless of what happens with CUDA and NVIDIA in China, I'm not aware of anyone anywhere doing distributed training on the Java platform. If Huawei wants to do that, that's interesting, but we're going to need a team to basically port all the tools required from the Python platform. NCCL is only a very small piece. But it's something you could ask them for starters. Could Huawei have one of their engineers spend some time to add support for NCCL to PyTorch on Java?

yeah ，no one use java for distribute training llm ，but just we want to make first ，it is should us aim，or we can not instead of python ai env，Huawei chips CANN gpu driven it also support distribute training

mullerhai · 2025-11-09T04:34:42Z

mullerhai
Nov 9, 2025
Author

@sbrunk Hi Sbrunk,
Thank for testing and merging the storch code.
Looking ahead, we need to add many features to Storch, such as quantization, mixed-precision training, GPU profiler, NCCL-based distributed training, reinforcement learning, graph neural networks, probability distribution modules, and RPC modules. I believe all these features are quite challenging to implement, and I think it would be difficult to accomplish them alone. We need more people to participate and contribute code.

0 replies

mullerhai · 2025-11-15T14:02:31Z

mullerhai
Nov 15, 2025
Author

Hi @saudet @sbrunk ,good news, I attended Huawei's chip-driven CANN salon meetup in Beijing today 2025-11-15. The event was vibrant and engaging, with many Chinese companies already having completed adaptations with them. I communicated with the event organizer—their next salon will feature presentations on javacpp-pytorch and storch, and I’ll probably get a 10-minute slot to speak. Most of the participating companies today still primarily use C++ and Python, and the venue was quite crowded. Their current research depth has reached the level of transformer Attention Matrix multiplication, tile blocking, and CUDA kernel optimization—essentially focusing on matrix blocking strategies, more efficient model inference communication across multiple GPUs and machines, and so on. Compared to them, our innovative javacpp-pytorch and storch are still in the shallow end, not yet diving into the kernel layer. We have a lot of work ahead. They’ll contact me next week to discuss the details of cooperation between CANN and Bytedeco. Do you have any questions? I’ll ask Huawei for their thoughts about your behalf. the CANN repo https://gitcode.com/cann

1 reply

saudet Nov 16, 2025
Maintainer

Thanks for the update! I guess my only question would be, given that we would need to invest a lot of effort, time and money to get all this working with Java, what are the use cases that are worth it? Why not just keep doing those tasks with Python?

mullerhai · 2025-11-23T05:47:43Z

mullerhai
Nov 23, 2025
Author

@saudet @sbrunk
Hi @saudet,

I'm writing to provide a quick update on my communication with an engineering team from Huawei's chip division. The conversation was very positive, and they are very interested in collaborating with us. They are currently working on an internal budget and assessment, with a projected start for our official collaboration in 2026.

As a first step, they want to evaluate the performance of Storch. I've already demonstrated some of its capabilities to them, and they were impressed, noting that it is very close to Python PyTorch. They then asked for a demonstration of Storch's training capabilities with a GPU. To show this, I had to demonstrate Storch integrated with CUDA.

However, I've run into a significant technical issue. Initially, I tried on my Windows 11 machine with an NVIDIA 3060 GPU, but javacpp-pytorch was unable to communicate with CUDA. I then switched to a different laptop running Ubuntu with an NVIDIA 4060 GPU. I had to uninstall CUDA 12.4, as this machine had CUDA 13.0 installed. Even with the Ubuntu setup, javacpp-pytorch (version 2.7.1-1.5.12) still fails to recognize and use CUDA properly.

Because of this, I had to reschedule the demonstration with them for either next week or in about two weeks. This is very time-sensitive for us, as the Huawei partnership is a key opportunity. Therefore, I am hoping a stable release of javacpp-pytorch 2.9-1.5.13 can be made available soon. This would allow me to use javacpp-cuda 1.3.0-1.5.13 to successfully demonstrate Storch's CUDA capabilities to them.

Summary of Issues and Requests:

Urgent Need for New Release: I am encountering CUDA recognition issues on Ubuntu with the current version (2.7.1-1.5.12). Your prompt release of version 1.5.13 would be crucial for this partnership.
Release Frequency: I would also like to express that waiting half a year between releases is very long. We would greatly appreciate an increase in the frequency of stable releases in the future.
Regarding Huawei NPU (CANN): They mentioned that the primary area for our initial collaboration would be to implement javacpp-cann for their NPU. Could you please assess whether it's feasible to use javacpp to port their CANN (Compute Architecture for Neural Networks) API into a Java version?

8 replies

saudet Nov 23, 2025
Maintainer

What is the error message you're getting when you try to use the snapshots?

mullerhai Nov 24, 2025
Author

What is the error message you're getting when you try to use the snapshots?


(base) muller@muller-Dell-G16-7630:~/Documents/code/javacpp-presets/pytorch$ mvn compile exec:java
[WARNING] 
[WARNING] Some problems were encountered while building the effective settings
[WARNING] expected START_TAG or END_TAG not TEXT (position: TEXT seen ...OM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">\n``\n      <m... @6:9)  @ /home/muller/.m2/settings.xml, line 6, column 9
[WARNING] 
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for org.bytedeco:pytorch:jar:2.9.0-1.5.13-SNAPSHOT
[WARNING] 'version' contains an expression but should be a constant. @ org.bytedeco:pytorch:2.9.0-${project.parent.version}, /home/muller/Documents/code/javacpp-presets/pytorch/pom.xml, line 14, column 12
[WARNING] 
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING] 
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING] 
[INFO] Inspecting build with total of 1 modules
[INFO] Installing Central Publishing features
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/maven-metadata.xml
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/maven-metadata.xml (3.9 kB at 1.9 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-dependency-plugin/2.8/maven-dependency-plugin-2.8.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-dependency-plugin/2.8/maven-dependency-plugin-2.8.pom (11 kB at 14 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-plugins/24/maven-plugins-24.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-plugins/24/maven-plugins-24.pom (11 kB at 78 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-dependency-plugin/2.8/maven-dependency-plugin-2.8.jar
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-dependency-plugin/2.8/maven-dependency-plugin-2.8.jar (153 kB at 790 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-clean-plugin/2.5/maven-clean-plugin-2.5.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-clean-plugin/2.5/maven-clean-plugin-2.5.pom (3.9 kB at 34 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-plugins/22/maven-plugins-22.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-plugins/22/maven-plugins-22.pom (13 kB at 123 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-clean-plugin/2.5/maven-clean-plugin-2.5.jar
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-clean-plugin/2.5/maven-clean-plugin-2.5.jar (25 kB at 212 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-surefire-plugin/2.12.4/maven-surefire-plugin-2.12.4.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-surefire-plugin/2.12.4/maven-surefire-plugin-2.12.4.pom (10 kB at 99 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/surefire/surefire/2.12.4/surefire-2.12.4.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/surefire/surefire/2.12.4/surefire-2.12.4.pom (14 kB at 95 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-surefire-plugin/2.12.4/maven-surefire-plugin-2.12.4.jar
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-surefire-plugin/2.12.4/maven-surefire-plugin-2.12.4.jar (30 kB at 311 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-deploy-plugin/2.7/maven-deploy-plugin-2.7.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-deploy-plugin/2.7/maven-deploy-plugin-2.7.pom (5.6 kB at 52 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-deploy-plugin/2.7/maven-deploy-plugin-2.7.jar
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-deploy-plugin/2.7/maven-deploy-plugin-2.7.jar (27 kB at 152 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-site-plugin/3.3/maven-site-plugin-3.3.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-site-plugin/3.3/maven-site-plugin-3.3.pom (21 kB at 191 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-site-plugin/3.3/maven-site-plugin-3.3.jar
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-site-plugin/3.3/maven-site-plugin-3.3.jar (124 kB at 707 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-antrun-plugin/1.3/maven-antrun-plugin-1.3.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-antrun-plugin/1.3/maven-antrun-plugin-1.3.pom (4.7 kB at 40 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-plugins/12/maven-plugins-12.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-plugins/12/maven-plugins-12.pom (12 kB at 81 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/maven-parent/9/maven-parent-9.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/maven-parent/9/maven-parent-9.pom (33 kB at 217 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-antrun-plugin/1.3/maven-antrun-plugin-1.3.jar
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-antrun-plugin/1.3/maven-antrun-plugin-1.3.jar (24 kB at 184 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-assembly-plugin/2.2-beta-5/maven-assembly-plugin-2.2-beta-5.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-assembly-plugin/2.2-beta-5/maven-assembly-plugin-2.2-beta-5.pom (15 kB at 120 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-plugins/16/maven-plugins-16.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-plugins/16/maven-plugins-16.pom (13 kB at 74 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-assembly-plugin/2.2-beta-5/maven-assembly-plugin-2.2-beta-5.jar
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-assembly-plugin/2.2-beta-5/maven-assembly-plugin-2.2-beta-5.jar (209 kB at 924 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-release-plugin/2.5.3/maven-release-plugin-2.5.3.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-release-plugin/2.5.3/maven-release-plugin-2.5.3.pom (11 kB at 27 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/release/maven-release/2.5.3/maven-release-2.5.3.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/release/maven-release/2.5.3/maven-release-2.5.3.pom (5.0 kB at 46 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-release-plugin/2.5.3/maven-release-plugin-2.5.3.jar
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-release-plugin/2.5.3/maven-release-plugin-2.5.3.jar (53 kB at 399 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/apache/maven/plugins/maven-metadata.xml
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/codehaus/mojo/maven-metadata.xml
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/codehaus/mojo/maven-metadata.xml
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-metadata.xml
[WARNING] Checksum validation failed, expected 1b4cdf0dfbc0be907aba4d5b99442e2975431658 but is 92f1870546e293de34329ae4a217db0a519c1ed6 from alimaven for http://maven.aliyun.com/nexus/content/groups/public/org/codehaus/mojo/maven-metadata.xml
[WARNING] Checksum validation failed, expected 9b398666e685d6d2f6596a6d14d811e7f42037fe but is 3bf17e27e8e75c9a429f605445549ca5de32a0fd from alimaven for http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-metadata.xml
[WARNING] Could not validate integrity of download from http://maven.aliyun.com/nexus/content/groups/public/org/codehaus/mojo/maven-metadata.xml
org.eclipse.aether.transfer.ChecksumFailureException: Checksum validation failed, expected 1b4cdf0dfbc0be907aba4d5b99442e2975431658 but is 92f1870546e293de34329ae4a217db0a519c1ed6
    at org.eclipse.aether.connector.basic.ChecksumValidator.validateExternalChecksums (ChecksumValidator.java:174)
    at org.eclipse.aether.connector.basic.ChecksumValidator.validate (ChecksumValidator.java:103)
    at org.eclipse.aether.connector.basic.BasicRepositoryConnector$GetTaskRunner.runTask (BasicRepositoryConnector.java:460)
    at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run (BasicRepositoryConnector.java:364)
    at org.eclipse.aether.util.concurrency.RunnableErrorForwarder$1.run (RunnableErrorForwarder.java:75)
    at org.eclipse.aether.connector.basic.BasicRepositoryConnector$DirectExecutor.execute (BasicRepositoryConnector.java:628)
    at org.eclipse.aether.connector.basic.BasicRepositoryConnector.get (BasicRepositoryConnector.java:235)
    at org.eclipse.aether.internal.impl.DefaultMetadataResolver$ResolveTask.run (DefaultMetadataResolver.java:586)
    at org.eclipse.aether.util.concurrency.RunnableErrorForwarder$1.run (RunnableErrorForwarder.java:75)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1144)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:642)
    at java.lang.Thread.run (Thread.java:1583)
[WARNING] Checksum validation failed, expected 1b4cdf0dfbc0be907aba4d5b99442e2975431658 but is 92f1870546e293de34329ae4a217db0a519c1ed6 from alimaven for http://maven.aliyun.com/nexus/content/groups/public/org/codehaus/mojo/maven-metadata.xml
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/codehaus/mojo/maven-metadata.xml (21 kB at 92 kB/s)
[WARNING] Could not validate integrity of download from http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-metadata.xml
org.eclipse.aether.transfer.ChecksumFailureException: Checksum validation failed, expected 9b398666e685d6d2f6596a6d14d811e7f42037fe but is 3bf17e27e8e75c9a429f605445549ca5de32a0fd
    at org.eclipse.aether.connector.basic.ChecksumValidator.validateExternalChecksums (ChecksumValidator.java:174)
    at org.eclipse.aether.connector.basic.ChecksumValidator.validate (ChecksumValidator.java:103)
    at org.eclipse.aether.connector.basic.BasicRepositoryConnector$GetTaskRunner.runTask (BasicRepositoryConnector.java:460)
    at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run (BasicRepositoryConnector.java:364)
    at org.eclipse.aether.util.concurrency.RunnableErrorForwarder$1.run (RunnableErrorForwarder.java:75)
    at org.eclipse.aether.connector.basic.BasicRepositoryConnector$DirectExecutor.execute (BasicRepositoryConnector.java:628)
    at org.eclipse.aether.connector.basic.BasicRepositoryConnector.get (BasicRepositoryConnector.java:235)
    at org.eclipse.aether.internal.impl.DefaultMetadataResolver$ResolveTask.run (DefaultMetadataResolver.java:586)
    at org.eclipse.aether.util.concurrency.RunnableErrorForwarder$1.run (RunnableErrorForwarder.java:75)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1144)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:642)
    at java.lang.Thread.run (Thread.java:1583)
[WARNING] Checksum validation failed, expected 9b398666e685d6d2f6596a6d14d811e7f42037fe but is 3bf17e27e8e75c9a429f605445549ca5de32a0fd from alimaven for http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-metadata.xml
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-metadata.xml (10 kB at 32 kB/s)
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/codehaus/mojo/maven-metadata.xml (258 B at 255 B/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/codehaus/mojo/exec-maven-plugin/maven-metadata.xml
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/codehaus/mojo/exec-maven-plugin/maven-metadata.xml
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/codehaus/mojo/exec-maven-plugin/maven-metadata.xml (1.1 kB at 2.0 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/codehaus/mojo/exec-maven-plugin/3.6.2/exec-maven-plugin-3.6.2.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/codehaus/mojo/exec-maven-plugin/3.6.2/exec-maven-plugin-3.6.2.pom (17 kB at 90 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/codehaus/mojo/mojo-parent/94/mojo-parent-94.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/codehaus/mojo/mojo-parent/94/mojo-parent-94.pom (38 kB at 304 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/junit/junit-bom/5.14.0/junit-bom-5.14.0.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/junit/junit-bom/5.14.0/junit-bom-5.14.0.pom (5.7 kB at 46 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/codehaus/mojo/exec-maven-plugin/3.6.2/exec-maven-plugin-3.6.2.jar
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/codehaus/mojo/exec-maven-plugin/3.6.2/exec-maven-plugin-3.6.2.jar (95 kB at 434 kB/s)
[INFO] 
[INFO] ------------------------< org.bytedeco:pytorch >------------------------
[INFO] Building JavaCPP Presets for PyTorch 2.9.0-1.5.13-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/maven-metadata.xml
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/maven-metadata.xml (3.3 kB at 4.8 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp-presets/1.5.13-SNAPSHOT/maven-metadata.xml
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp-presets/1.5.13-SNAPSHOT/maven-metadata.xml (606 B at 975 B/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cuda/13.0-9.14-1.5.13-SNAPSHOT/maven-metadata.xml
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cuda/13.0-9.14-1.5.13-SNAPSHOT/maven-metadata.xml (1.9 kB at 2.9 kB/s)
[INFO] 
[INFO] --- javacpp:1.5.13-SNAPSHOT:build (javacpp-validate) @ pytorch ---
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas-platform/0.3.30-1.5.13-SNAPSHOT/maven-metadata.xml
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas-platform/0.3.30-1.5.13-SNAPSHOT/maven-metadata.xml (1.2 kB at 1.2 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas-platform/0.3.30-1.5.13-SNAPSHOT/openblas-platform-0.3.30-1.5.13-20251017.051719-1.pom
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas-platform/0.3.30-1.5.13-SNAPSHOT/openblas-platform-0.3.30-1.5.13-20251017.051719-1.pom (8.5 kB at 8.3 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp-platform/1.5.13-SNAPSHOT/maven-metadata.xml
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp-platform/1.5.13-SNAPSHOT/maven-metadata.xml (1.2 kB at 1.6 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp-platform/1.5.13-SNAPSHOT/javacpp-platform-1.5.13-20250902.085346-8.pom
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp-platform/1.5.13-SNAPSHOT/javacpp-platform-1.5.13-20250902.085346-8.pom (74 kB at 58 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cuda-platform/13.0-9.14-1.5.13-SNAPSHOT/maven-metadata.xml
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cuda-platform/13.0-9.14-1.5.13-SNAPSHOT/maven-metadata.xml (1.3 kB at 1.4 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cuda-platform/13.0-9.14-1.5.13-SNAPSHOT/cuda-platform-13.0-9.14-1.5.13-20251022.164244-3.pom
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cuda-platform/13.0-9.14-1.5.13-SNAPSHOT/cuda-platform-13.0-9.14-1.5.13-20251022.164244-3.pom (4.9 kB at 8.0 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/numpy-platform/2.3.4-1.5.13-SNAPSHOT/maven-metadata.xml
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/numpy-platform/2.3.4-1.5.13-SNAPSHOT/maven-metadata.xml (1.2 kB at 1.1 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/numpy-platform/2.3.4-1.5.13-SNAPSHOT/numpy-platform-2.3.4-1.5.13-20251017.062421-1.pom
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/numpy-platform/2.3.4-1.5.13-SNAPSHOT/numpy-platform-2.3.4-1.5.13-20251017.062421-1.pom (6.6 kB at 6.5 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cpython-platform/3.14.0-1.5.13-SNAPSHOT/maven-metadata.xml
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cpython-platform/3.14.0-1.5.13-SNAPSHOT/maven-metadata.xml (1.2 kB at 1.2 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cpython-platform/3.14.0-1.5.13-SNAPSHOT/cpython-platform-3.14.0-1.5.13-20251017.031808-1.pom
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cpython-platform/3.14.0-1.5.13-SNAPSHOT/cpython-platform-3.14.0-1.5.13-20251017.031808-1.pom (6.5 kB at 6.2 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cpython/3.14.0-1.5.13-SNAPSHOT/maven-metadata.xml
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cpython/3.14.0-1.5.13-SNAPSHOT/maven-metadata.xml (2.4 kB at 2.1 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/numpy/2.3.4-1.5.13-SNAPSHOT/maven-metadata.xml
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/numpy/2.3.4-1.5.13-SNAPSHOT/maven-metadata.xml (2.4 kB at 2.6 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas-platform/0.3.30-1.5.13-SNAPSHOT/openblas-platform-0.3.30-1.5.13-20251017.051719-1.jar
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp-platform/1.5.13-SNAPSHOT/javacpp-platform-1.5.13-20250902.085346-8.jar
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-android-arm64.jar
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-android-x86_64.jar
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-ios-arm64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp-platform/1.5.13-SNAPSHOT/javacpp-platform-1.5.13-20250902.085346-8.jar (6.6 kB at 10 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-ios-x86_64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas-platform/0.3.30-1.5.13-SNAPSHOT/openblas-platform-0.3.30-1.5.13-20251017.051719-1.jar (3.2 kB at 3.7 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-linux-arm64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-ios-x86_64.jar (41 kB at 29 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-linux-ppc64le.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-linux-arm64.jar (43 kB at 29 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-linux-riscv64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-ios-arm64.jar (46 kB at 30 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-linux-x86_64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-android-arm64.jar (164 kB at 89 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-macosx-arm64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-android-x86_64.jar (167 kB at 84 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-macosx-x86_64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-linux-riscv64.jar (43 kB at 21 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-windows-arm64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-linux-ppc64le.jar (45 kB at 22 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-windows-x86_64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-linux-x86_64.jar (48 kB at 23 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-android-arm64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-macosx-arm64.jar (39 kB at 16 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-android-x86_64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-macosx-x86_64.jar (40 kB at 15 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-ios-arm64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-windows-x86_64.jar (2.0 MB at 542 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-ios-x86_64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-windows-arm64.jar (2.4 MB at 596 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-linux-x86_64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-android-arm64.jar (12 MB at 2.4 MB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-linux-arm64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-android-x86_64.jar (15 MB at 2.5 MB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-macosx-arm64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-ios-x86_64.jar (13 MB at 1.9 MB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-macosx-x86_64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-linux-arm64.jar (12 MB at 1.6 MB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-windows-x86_64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-ios-arm64.jar (14 MB at 1.7 MB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cuda-platform/13.0-9.14-1.5.13-SNAPSHOT/cuda-platform-13.0-9.14-1.5.13-20251022.164244-3.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-macosx-arm64.jar (13 MB at 1.4 MB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cuda/13.0-9.14-1.5.13-SNAPSHOT/cuda-13.0-9.14-1.5.13-20251022.164259-20-linux-arm64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cuda-platform/13.0-9.14-1.5.13-SNAPSHOT/cuda-platform-13.0-9.14-1.5.13-20251022.164244-3.jar (3.0 kB at 335 B/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cuda/13.0-9.14-1.5.13-SNAPSHOT/cuda-13.0-9.14-1.5.13-20251022.164259-20-linux-x86_64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-linux-x86_64.jar (20 MB at 2.1 MB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cuda/13.0-9.14-1.5.13-SNAPSHOT/cuda-13.0-9.14-1.5.13-20251022.164259-20-windows-x86_64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cuda/13.0-9.14-1.5.13-SNAPSHOT/cuda-13.0-9.14-1.5.13-20251022.164259-20-linux-arm64.jar (4.0 MB at 399 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/numpy-platform/2.3.4-1.5.13-SNAPSHOT/numpy-platform-2.3.4-1.5.13-20251017.062421-1.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cuda/13.0-9.14-1.5.13-SNAPSHOT/cuda-13.0-9.14-1.5.13-20251022.164259-20-linux-x86_64.jar (4.9 MB at 469 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cpython-platform/3.14.0-1.5.13-SNAPSHOT/cpython-platform-3.14.0-1.5.13-20251017.031808-1.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cuda/13.0-9.14-1.5.13-SNAPSHOT/cuda-13.0-9.14-1.5.13-20251022.164259-20-windows-x86_64.jar (6.6 MB at 608 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cpython/3.14.0-1.5.13-SNAPSHOT/cpython-3.14.0-1.5.13-20251017.031825-10-linux-arm64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/numpy-platform/2.3.4-1.5.13-SNAPSHOT/numpy-platform-2.3.4-1.5.13-20251017.062421-1.jar (3.1 kB at 288 B/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cpython/3.14.0-1.5.13-SNAPSHOT/cpython-3.14.0-1.5.13-20251017.031825-10-macosx-arm64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-macosx-x86_64.jar (20 MB at 1.8 MB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cpython/3.14.0-1.5.13-SNAPSHOT/cpython-3.14.0-1.5.13-20251017.031825-10-macosx-x86_64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cpython-platform/3.14.0-1.5.13-SNAPSHOT/cpython-platform-3.14.0-1.5.13-20251017.031808-1.jar (3.1 kB at 274 B/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cpython/3.14.0-1.5.13-SNAPSHOT/cpython-3.14.0-1.5.13-20251017.031825-10-windows-x86_64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-windows-x86_64.jar (49 MB at 3.3 MB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/numpy/2.3.4-1.5.13-SNAPSHOT/numpy-2.3.4-1.5.13-20251017.062441-6-linux-arm64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cpython/3.14.0-1.5.13-SNAPSHOT/cpython-3.14.0-1.5.13-20251017.031825-10-macosx-arm64.jar (26 MB at 1.6 MB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/numpy/2.3.4-1.5.13-SNAPSHOT/numpy-2.3.4-1.5.13-20251017.062441-6-linux-x86_64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cpython/3.14.0-1.5.13-SNAPSHOT/cpython-3.14.0-1.5.13-20251017.031825-10-macosx-x86_64.jar (25 MB at 1.6 MB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/numpy/2.3.4-1.5.13-SNAPSHOT/numpy-2.3.4-1.5.13-20251017.062441-6-macosx-arm64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cpython/3.14.0-1.5.13-SNAPSHOT/cpython-3.14.0-1.5.13-20251017.031825-10-windows-x86_64.jar (25 MB at 1.5 MB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/numpy/2.3.4-1.5.13-SNAPSHOT/numpy-2.3.4-1.5.13-20251017.062441-6-macosx-x86_64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/numpy/2.3.4-1.5.13-SNAPSHOT/numpy-2.3.4-1.5.13-20251017.062441-6-linux-arm64.jar (6.3 MB at 378 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/numpy/2.3.4-1.5.13-SNAPSHOT/numpy-2.3.4-1.5.13-20251017.062441-6-windows-x86_64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/numpy/2.3.4-1.5.13-SNAPSHOT/numpy-2.3.4-1.5.13-20251017.062441-6-linux-x86_64.jar (8.2 MB at 463 kB/s)
Downloading from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cpython/3.14.0-1.5.13-SNAPSHOT/cpython-3.14.0-1.5.13-20251017.031825-10-linux-x86_64.jar
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cpython/3.14.0-1.5.13-SNAPSHOT/cpython-3.14.0-1.5.13-20251017.031825-10-linux-arm64.jar (28 MB at 1.6 MB/s)
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/numpy/2.3.4-1.5.13-SNAPSHOT/numpy-2.3.4-1.5.13-20251017.062441-6-macosx-arm64.jar (5.4 MB at 305 kB/s)
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/numpy/2.3.4-1.5.13-SNAPSHOT/numpy-2.3.4-1.5.13-20251017.062441-6-macosx-x86_64.jar (7.0 MB at 377 kB/s)
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/numpy/2.3.4-1.5.13-SNAPSHOT/numpy-2.3.4-1.5.13-20251017.062441-6-windows-x86_64.jar (6.5 MB at 351 kB/s)
Downloaded from central-portal-snapshots: https://central.sonatype.com/repository/maven-snapshots/org/bytedeco/cpython/3.14.0-1.5.13-SNAPSHOT/cpython-3.14.0-1.5.13-20251017.031825-10-linux-x86_64.jar (27 MB at 1.2 MB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/codehaus/plexus/plexus-utils/1.1/plexus-utils-1.1.jar
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/codehaus/plexus/plexus-utils/1.1/plexus-utils-1.1.jar (169 kB at 622 kB/s)
[INFO] Detected platform "linux-x86_64"
[INFO] Building platform "linux-x86_64"
[INFO] 
[INFO] --- javacpp:1.5.13-SNAPSHOT:build (javacpp-cppbuild-install) @ pytorch ---
[INFO] Detected platform "linux-x86_64"
[INFO] Building platform "linux-x86_64"
[INFO] bash cppbuild.sh install pytorch -platform=linux-x86_64 -extension= 
Detected platform "linux-x86_64"
Building platform "linux-x86_64"
Installing "pytorch"
~/Documents/code/javacpp-presets/pytorch/cppbuild ~/Documents/code/javacpp-presets
正克隆到 'pytorch'...
remote: Enumerating objects: 1165831, done.
remote: Counting objects: 100% (227/227), done.
remote: Compressing objects: 100% (169/169), done.
remote: Total 1165831 (delta 152), reused 86 (delta 58), pack-reused 1165604 (from 2)
接收对象中: 100% (1165831/1165831), 1.11 GiB | 13.14 MiB/s, 完成.
处理 delta 中: 100% (928147/928147), 完成.
正在更新文件: 100% (20111/20111), 完成.
HEAD 现在位于 dbe61249eac [tutorial] typo fix, update torch.compiler_cudagraph_trees.md (#167713)
正在更新文件: 100% (4666/4666), 完成.
注意：正在切换到 'v2.9.0'。

您正处于分离头指针状态。您可以查看、做试验性的修改及提交，并且您可以在切换
回一个分支时，丢弃在此状态下所做的提交而不对分支造成影响。

如果您想要通过创建分支来保留在此状态下所做的提交，您可以通过在 switch 命令
中添加参数 -c 来实现（现在或稍后）。例如：

  git switch -c <新分支名>

或者撤销此操作：

  git switch -

通过将配置变量 advice.detachedHead 设置为 false 来关闭此建议

HEAD 目前位于 0fabc3ba448 CUDA aarch64 12.6 and 12.8 builds fix triton constraints (#165022)
子模组 'android/libs/fbjni'（https://github.com/facebookincubator/fbjni.git）已对路径 'android/libs/fbjni' 注册
子模组 'third_party/NNPACK_deps/FP16'（https://github.com/Maratyszcza/FP16.git）已对路径 'third_party/FP16' 注册
子模组 'third_party/NNPACK_deps/FXdiv'（https://github.com/Maratyszcza/FXdiv.git）已对路径 'third_party/FXdiv' 注册
子模组 'third_party/NNPACK'（https://github.com/Maratyszcza/NNPACK.git）已对路径 'third_party/NNPACK' 注册
子模组 'third_party/NVTX'（https://github.com/NVIDIA/NVTX.git）已对路径 'third_party/NVTX' 注册
子模组 'third_party/VulkanMemoryAllocator'（https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator.git）已对路径 'third_party/VulkanMemoryAllocator' 注册
子模组 'third_party/XNNPACK'（https://github.com/google/XNNPACK.git）已对路径 'third_party/XNNPACK' 注册
子模组 'third_party/aiter'（https://github.com/ROCm/aiter.git）已对路径 'third_party/aiter' 注册
子模组 'third_party/benchmark'（https://github.com/google/benchmark.git）已对路径 'third_party/benchmark' 注册
子模组 'third_party/composable_kernel'（https://github.com/ROCm/composable_kernel.git）已对路径 'third_party/composable_kernel' 注册
子模组 'third_party/cpp-httplib'（https://github.com/yhirose/cpp-httplib.git）已对路径 'third_party/cpp-httplib' 注册
子模组 'third_party/cpuinfo'（https://github.com/pytorch/cpuinfo.git）已对路径 'third_party/cpuinfo' 注册
子模组 'third_party/cudnn_frontend'（https://github.com/NVIDIA/cudnn-frontend.git）已对路径 'third_party/cudnn_frontend' 注册
子模组 'third_party/cutlass'（https://github.com/NVIDIA/cutlass.git）已对路径 'third_party/cutlass' 注册
子模组 'third_party/fbgemm'（https://github.com/pytorch/fbgemm）已对路径 'third_party/fbgemm' 注册
子模组 'third_party/flash-attention'（https://github.com/Dao-AILab/flash-attention.git）已对路径 'third_party/flash-attention' 注册
子模组 'third_party/flatbuffers'（https://github.com/google/flatbuffers.git）已对路径 'third_party/flatbuffers' 注册
子模组 'third_party/fmt'（https://github.com/fmtlib/fmt.git）已对路径 'third_party/fmt' 注册
子模组 'third_party/gemmlowp/gemmlowp'（https://github.com/google/gemmlowp.git）已对路径 'third_party/gemmlowp/gemmlowp' 注册
子模组 'third_party/gloo'（https://github.com/pytorch/gloo）已对路径 'third_party/gloo' 注册
子模组 'third_party/googletest'（https://github.com/google/googletest.git）已对路径 'third_party/googletest' 注册
子模组 'third_party/ideep'（https://github.com/intel/ideep）已对路径 'third_party/ideep' 注册
子模组 'third_party/ittapi'（https://github.com/intel/ittapi.git）已对路径 'third_party/ittapi' 注册
子模组 'third_party/kineto'（https://github.com/pytorch/kineto）已对路径 'third_party/kineto' 注册
子模组 'third_party/kleidiai'（https://github.com/ARM-software/kleidiai.git）已对路径 'third_party/kleidiai' 注册
子模组 'third_party/mimalloc'（https://github.com/microsoft/mimalloc.git）已对路径 'third_party/mimalloc' 注册
子模组 'third_party/nlohmann'（https://github.com/nlohmann/json.git）已对路径 'third_party/nlohmann' 注册
子模组 'third_party/onnx'（https://github.com/onnx/onnx.git）已对路径 'third_party/onnx' 注册
子模组 'third_party/opentelemetry-cpp'（https://github.com/open-telemetry/opentelemetry-cpp.git）已对路径 'third_party/opentelemetry-cpp' 注册
子模组 'third_party/pocketfft'（https://github.com/mreineck/pocketfft）已对路径 'third_party/pocketfft' 注册
子模组 'third_party/protobuf'（https://github.com/protocolbuffers/protobuf.git）已对路径 'third_party/protobuf' 注册
子模组 'third_party/NNPACK_deps/psimd'（https://github.com/Maratyszcza/psimd.git）已对路径 'third_party/psimd' 注册
子模组 'third_party/NNPACK_deps/pthreadpool'（https://github.com/Maratyszcza/pthreadpool.git）已对路径 'third_party/pthreadpool' 注册
子模组 'third_party/pybind11'（https://github.com/pybind/pybind11.git）已对路径 'third_party/pybind11' 注册
子模组 'third_party/python-peachpy'（https://github.com/malfet/PeachPy.git）已对路径 'third_party/python-peachpy' 注册
子模组 'third_party/sleef'（https://github.com/shibatch/sleef）已对路径 'third_party/sleef' 注册
子模组 'third_party/tensorpipe'（https://github.com/pytorch/tensorpipe.git）已对路径 'third_party/tensorpipe' 注册
正克隆到 '/home/muller/Documents/code/javacpp-presets/pytorch/cppbuild/linux-x86_64/pytorch/android/libs/fbjni'...
fatal: 无法访问 'https://github.com/facebookincubator/fbjni.git/'：Recv failure: 连接超时
fatal: 无法克隆 'https://github.com/facebookincubator/fbjni.git' 到子模组路径 '/home/muller/Documents/code/javacpp-presets/pytorch/cppbuild/linux-x86_64/pytorch/android/libs/fbjni'
克隆 'android/libs/fbjni' 失败。按计划重试
正克隆到 '/home/muller/Documents/code/javacpp-presets/pytorch/cppbuild/linux-x86_64/pytorch/third_party/FP16'...
fatal: 无法访问 'https://github.com/Maratyszcza/FP16.git/'：GnuTLS recv error (-110): The TLS connection was non-properly terminated.
fatal: 无法克隆 'https://github.com/Maratyszcza/FP16.git' 到子模组路径 '/home/muller/Documents/code/javacpp-presets/pytorch/cppbuild/linux-x86_64/pytorch/third_party/FP16'
克隆 'third_party/FP16' 失败。按计划重试
正克隆到 '/home/muller/Documents/code/javacpp-presets/pytorch/cppbuild/linux-x86_64/pytorch/third_party/FXdiv'...
fatal: 无法访问 'https://github.com/Maratyszcza/FXdiv.git/'：GnuTLS recv error (-110): The TLS connection was non-properly terminated.
fatal: 无法克隆 'https://github.com/Maratyszcza/FXdiv.git' 到子模组路径 '/home/muller/Documents/code/javacpp-presets/pytorch/cppbuild/linux-x86_64/pytorch/third_party/FXdiv'
克隆 'third_party/FXdiv' 失败。按计划重试
正克隆到 '/home/muller/Documents/code/javacpp-presets/pytorch/cppbuild/linux-x86_64/pytorch/third_party/NNPACK'...
fatal: 无法访问 'https://github.com/Maratyszcza/NNPACK.git/'：Failed to connect to github.com port 443 after 135765 ms: Couldn't connect to server
fatal: 无法克隆 'https://github.com/Maratyszcza/NNPACK.git' 到子模组路径 '/home/muller/Documents/code/javacpp-presets/pytorch/cppbuild/linux-x86_64/pytorch/third_party/NNPACK'
克隆 'third_party/NNPACK' 失败。按计划重试
正克隆到 '/home/muller/Documents/code/javacpp-presets/pytorch/cppbuild/linux-x86_64/pytorch/third_party/NVTX'...
fatal: 无法访问 'https://github.com/NVIDIA/NVTX.git/'：GnuTLS recv error (-110): The TLS connection was non-properly terminated.
fatal: 无法克隆 'https://github.com/NVIDIA/NVTX.git' 到子模组路径 '/home/muller/Documents/code/javacpp-presets/pytorch/cppbuild/linux-x86_64/pytorch/third_party/NVTX'
克隆 'third_party/NVTX' 失败。按计划重试
正克隆到 '/home/muller/Documents/code/javacpp-presets/pytorch/cppbuild/linux-x86_64/pytorch/third_party/VulkanMemoryAllocator'...

saudet Nov 24, 2025
Maintainer

Please try to use the precompiled binaries

mullerhai Nov 24, 2025
Author

Please try to use the precompiled binaries

[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/AllToAllOptions.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/BarrierOptions.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/DistributedBackendOptions.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/Work.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/WorkInfo.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/Backend.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/ProcessGroup.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/GradBucket.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/CommHookInterface.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/ProcessGroupCppCommHookInterface.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/ApproximateClockToUnixTimeConverter.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/Timer.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/BucketAccumulator.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/Reducer.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/ProcessGroupGloo.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/PrefixStore.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/ProcessGroupStatus.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/Logger.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/C10dLoggingData.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/C10dLogger.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/JavaDataset.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/JavaTensorDataset.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/JavaStreamDataset.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/JavaStreamTensorDataset.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/JavaStatefulDataset.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/JavaStatefulTensorDataset.java
[INFO] Targeting /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/global/torch_cuda.java
[INFO] Nothing targeted for /home/muller/Documents/code/javacpp-presets/pytorch/src/main/java/../../gen/java/org/bytedeco/pytorch/global/torch_cuda.java
[INFO]
[INFO] --- maven-resources-plugin:3.3.1:resources (default-resources) @ pytorch ---
[INFO] Copying 5 resources from src/main/resources to target/classes
[INFO]
[INFO] --- maven-compiler-plugin:3.8.0:compile (default-compile) @ pytorch ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 1607 source files to /home/muller/Documents/code/javacpp-presets/pytorch/target/classes
[INFO] /home/muller/Documents/code/javacpp-presets/pytorch/src/gen/java/org/bytedeco/pytorch/WarningHandler.java: 某些输入文件使用了未经检查或不安全的操作。
[INFO] /home/muller/Documents/code/javacpp-presets/pytorch/src/gen/java/org/bytedeco/pytorch/WarningHandler.java: 有关详细信息, 请使用 -Xlint:unchecked 重新编译。
[INFO]
[INFO] --- exec-maven-plugin:3.6.2:java (default-cli) @ pytorch ---
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/resolver/maven-resolver-util/1.9.24/maven-resolver-util-1.9.24.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/resolver/maven-resolver-util/1.9.24/maven-resolver-util-1.9.24.pom (2.2 kB at 226 B/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/resolver/maven-resolver/1.9.24/maven-resolver-1.9.24.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/resolver/maven-resolver/1.9.24/maven-resolver-1.9.24.pom (25 kB at 221 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/maven-parent/45/maven-parent-45.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/maven-parent/45/maven-parent-45.pom (53 kB at 407 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/apache/35/apache-35.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/apache/35/apache-35.pom (24 kB at 196 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/junit/junit-bom/5.13.1/junit-bom-5.13.1.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/junit/junit-bom/5.13.1/junit-bom-5.13.1.pom (5.6 kB at 55 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/resolver/maven-resolver-api/1.9.24/maven-resolver-api-1.9.24.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/resolver/maven-resolver-api/1.9.24/maven-resolver-api-1.9.24.pom (2.2 kB at 23 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/commons/commons-exec/1.5.0/commons-exec-1.5.0.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/commons/commons-exec/1.5.0/commons-exec-1.5.0.pom (11 kB at 120 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/commons/commons-parent/83/commons-parent-83.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/commons/commons-parent/83/commons-parent-83.pom (78 kB at 737 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/apache/34/apache-34.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/apache/34/apache-34.pom (24 kB at 166 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/junit/junit-bom/5.12.2/junit-bom-5.12.2.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/junit/junit-bom/5.12.2/junit-bom-5.12.2.pom (5.6 kB at 58 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/ow2/asm/asm/9.9/asm-9.9.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/ow2/asm/asm/9.9/asm-9.9.pom (2.4 kB at 4.9 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/ow2/asm/asm-commons/9.9/asm-commons-9.9.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/ow2/asm/asm-commons/9.9/asm-commons-9.9.pom (2.8 kB at 23 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/ow2/asm/asm-tree/9.9/asm-tree-9.9.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/ow2/asm/asm-tree/9.9/asm-tree-9.9.pom (2.6 kB at 24 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/resolver/maven-resolver-util/1.9.24/maven-resolver-util-1.9.24.jar
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/resolver/maven-resolver-api/1.9.24/maven-resolver-api-1.9.24.jar
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/codehaus/plexus/plexus-utils/4.0.2/plexus-utils-4.0.2.jar
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/codehaus/plexus/plexus-xml/3.0.1/plexus-xml-3.0.1.jar
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/commons/commons-exec/1.5.0/commons-exec-1.5.0.jar
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/codehaus/plexus/plexus-xml/3.0.1/plexus-xml-3.0.1.jar (94 kB at 684 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/ow2/asm/asm/9.9/asm-9.9.jar
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/ow2/asm/asm/9.9/asm-9.9.jar (126 kB at 495 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/ow2/asm/asm-commons/9.9/asm-commons-9.9.jar
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/ow2/asm/asm-commons/9.9/asm-commons-9.9.jar (74 kB at 188 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/ow2/asm/asm-tree/9.9/asm-tree-9.9.jar
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/ow2/asm/asm-tree/9.9/asm-tree-9.9.jar (52 kB at 97 kB/s)
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/commons/commons-exec/1.5.0/commons-exec-1.5.0.jar (68 kB at 59 kB/s)
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/resolver/maven-resolver-util/1.9.24/maven-resolver-util-1.9.24.jar (196 kB at 151 kB/s)
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/codehaus/plexus/plexus-utils/4.0.2/plexus-utils-4.0.2.jar (193 kB at 130 kB/s)
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/resolver/maven-resolver-api/1.9.24/maven-resolver-api-1.9.24.jar (157 kB at 104 kB/s)
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 33:10 min
[INFO] Finished at: 2025-11-24T17:00:28+08:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:3.6.2:java (default-cli) on project pytorch: The parameters 'mainClass' for goal org.codehaus.mojo:exec-maven-plugin:3.6.2:java are missing or invalid -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginParameterException

saudet Nov 24, 2025
Maintainer

Please try to build the pom.xml file in the samples directory here:
https://github.com/bytedeco/javacpp-presets/tree/master/pytorch/samples

mullerhai · 2025-11-24T09:08:23Z

mullerhai
Nov 24, 2025
Author

(base) muller@muller-Dell-G16-7630:~/Documents/code/javacpp-presets/pytorch$ mvn --version
Apache Maven 3.8.7
Maven home: /usr/share/maven
Java version: 21.0.8, vendor: Ubuntu, runtime: /usr/lib/jvm/java-21-openjdk-amd64
Default locale: zh_CN, platform encoding: UTF-8
OS name: "linux", version: "6.14.0-35-generic", arch: "amd64", family: "unix"

jdk 21

0 replies

mullerhai · 2025-11-24T09:17:42Z

mullerhai
Nov 24, 2025
Author

hi, @saudet ,huawei company Ai engineer tell me ,they also want to try javacpp-pytorch with cuda 1.13.0 ,they are write python ,they don't want to like me build the all project ,just want to make dependency the jar, I think you need really release and publish to maven center repo

1 reply

saudet Nov 24, 2025
Maintainer

What error do you get with the pom.xml file in this samples directory?
https://github.com/bytedeco/javacpp-presets/tree/master/pytorch/samples

mullerhai · 2025-11-24T09:34:31Z

mullerhai
Nov 24, 2025
Author

@saudet the javacpp-pytorch all part build is so slowly and complex and often meet error , not suit for all people ,please consider ,make easier for most beginer

3 replies

saudet Nov 24, 2025
Maintainer

What error do you get with the pom.xml file in this samples directory?
https://github.com/bytedeco/javacpp-presets/tree/master/pytorch/samples

mullerhai Nov 25, 2025
Author

What error do you get with the pom.xml file in this samples directory? https://github.com/bytedeco/javacpp-presets/tree/master/pytorch/samples

(base) muller@muller-Dell-G16-7630:/Documents/code/javacpp-presets/pytorch/samples$ mvn compile exec:java
[WARNING]
[WARNING] Some problems were encountered while building the effective settings
[WARNING] expected START_TAG or END_TAG not TEXT (position: TEXT seen ...OM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">\n``\n <m... @6:9) @ /home/muller/.m2/settings.xml, line 6, column 9
[WARNING]
[INFO] Scanning for projects...
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-resources-plugin/2.6/maven-resources-plugin-2.6.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-resources-plugin/2.6/maven-resources-plugin-2.6.pom (8.1 kB at 16 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-plugins/23/maven-plugins-23.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-plugins/23/maven-plugins-23.pom (9.2 kB at 108 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-resources-plugin/2.6/maven-resources-plugin-2.6.jar
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-resources-plugin/2.6/maven-resources-plugin-2.6.jar (30 kB at 343 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-jar-plugin/2.4/maven-jar-plugin-2.4.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-jar-plugin/2.4/maven-jar-plugin-2.4.pom (5.8 kB at 76 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-jar-plugin/2.4/maven-jar-plugin-2.4.jar
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-jar-plugin/2.4/maven-jar-plugin-2.4.jar (34 kB at 425 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-compiler-plugin/3.1/maven-compiler-plugin-3.1.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-compiler-plugin/3.1/maven-compiler-plugin-3.1.pom (10 kB at 126 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-compiler-plugin/3.1/maven-compiler-plugin-3.1.jar
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-compiler-plugin/3.1/maven-compiler-plugin-3.1.jar (43 kB at 580 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-install-plugin/2.4/maven-install-plugin-2.4.pom
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-install-plugin/2.4/maven-install-plugin-2.4.pom (6.4 kB at 40 kB/s)
Downloading from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-install-plugin/2.4/maven-install-plugin-2.4.jar
Downloaded from alimaven: http://maven.aliyun.com/nexus/content/groups/public/org/apache/maven/plugins/maven-install-plugin/2.4/maven-install-plugin-2.4.jar (27 kB at 341 kB/s)
[INFO]
[INFO] ------------------< org.bytedeco.pytorch:simplemnist >------------------
[INFO] Building simplemnist 1.5.13-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[WARNING] The POM for org.bytedeco:pytorch-platform:jar:2.9.0-1.5.13-SNAPSHOT is missing, no dependency information available
[WARNING] The POM for org.bytedeco:pytorch-platform-gpu:jar:2.9.0-1.5.13-SNAPSHOT is missing, no dependency information available
[WARNING] The POM for org.bytedeco:cuda-platform-redist-cudnn:jar:13.0-9.14-1.5.13-SNAPSHOT is missing, no dependency information available
[WARNING] The POM for org.bytedeco:cuda-platform-redist-cusolver:jar:13.0-9.14-1.5.13-SNAPSHOT is missing, no dependency information available
[WARNING] The POM for org.bytedeco:cuda-platform-redist-nccl:jar:13.0-9.14-1.5.13-SNAPSHOT is missing, no dependency information available
[WARNING] The POM for org.bytedeco:mkl-platform-redist:jar:2025.2-1.5.13-SNAPSHOT is missing, no dependency information available
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.621 s
[INFO] Finished at: 2025-11-25T10:17:19+08:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project simplemnist: Could not resolve dependencies for project org.bytedeco.pytorch:simplemnist:jar:1.5.13-SNAPSHOT: The following artifacts could not be resolved: org.bytedeco:pytorch-platform:jar:2.9.0-1.5.13-SNAPSHOT, org.bytedeco:pytorch-platform-gpu:jar:2.9.0-1.5.13-SNAPSHOT, org.bytedeco:cuda-platform-redist-cudnn:jar:13.0-9.14-1.5.13-SNAPSHOT, org.bytedeco:cuda-platform-redist-cusolver:jar:13.0-9.14-1.5.13-SNAPSHOT, org.bytedeco:cuda-platform-redist-nccl:jar:13.0-9.14-1.5.13-SNAPSHOT, org.bytedeco:mkl-platform-redist:jar:2025.2-1.5.13-SNAPSHOT: Could not find artifact org.bytedeco:pytorch-platform:jar:2.9.0-1.5.13-SNAPSHOT -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
(base) muller@muller-Dell-G16-7630:/Documents/code/javacpp-presets/pytorch/samples$

saudet Nov 25, 2025
Maintainer

Please follow the instructions here and try again:
https://central.sonatype.org/publish/publish-portal-snapshots/#consuming-via-maven

mullerhai · 2025-11-25T02:26:58Z

mullerhai
Nov 25, 2025
Author

the Huawei company ai engineer is eagerly try to use javacpp-pytorch with cuda 1.13.0, two choices,publish 1.5.13 snapshot version all to maven snap repo ,please not build by user , another release 1.5.13 version to maven center repo this week.
otherwise ,you will Miss an opportunity for collaboration with HUAWEI @saudet ,please know [no build them by user]

26 replies

mullerhai Nov 26, 2025
Author

Would you have working sample code in C++? Let's start from known working code in C++ and port that to Java/Scala

this RPC module is important, HUAWEI cloud user need use parameter server to do online training ,we need implement RPC
https://github.com/pytorch/pytorch/tree/main/torch/csrc/distributed/rpc

mullerhai Nov 26, 2025
Author

for EmbeddingImpl EmbeddingBagImpl both not found Embedding::from_pretrained , only have Options for the Embedding::from_pretrained ,does from_pretrained method implement by javacpp is hard ?@saudet. for llm we really need it

mullerhai Nov 26, 2025
Author

ProcessGroupNCCL I think we need implement in new version, I think the gloo problem meet is similar with nccl ,

mullerhai Nov 26, 2025
Author

@saudet ,Mnist really have bug, I let other people try to load the dataset, also error

WARNING: A restricted method in java.lang.System has been called
WARNING: java.lang.System::load has been called by org.bytedeco.javacpp.Loader in an unnamed module (file:/home/muller/.cache/coursier/v1/https/central.sonatype.com/repository/maven-snapshots/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106.jar)
WARNING: Use --enable-native-access=ALL-UNNAMED to avoid a warning for callers in this module
WARNING: Restricted methods will be blocked in a future release unless native access is enabled

Exception in thread "main" java.lang.RuntimeException: stream.read(reinterpret_cast<char*>(&value), sizeof value) INTERNAL ASSERT FAILED at "/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/torch/csrc/api/src/data/datasets/mnist.cpp":38, please report a bug to PyTorch. 
Exception raised from read_int32 at /home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/torch/csrc/api/src/data/datasets/mnist.cpp:38 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x70b80428e15c in /home/muller/.javacpp/cache/pytorch-2.9.0-1.5.13-20251023.024851-17-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x72 (0x70b804218dca in /home/muller/.javacpp/cache/pytorch-2.9.0-1.5.13-20251023.024851-17-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libc10.so)
frame #2: <unknown function> + 0x6406a27 (0x7094d2c06a27 in /home/muller/.javacpp/cache/pytorch-2.9.0-1.5.13-20251023.024851-17-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cpu.so)
frame #3: <unknown function> + 0x6407291 (0x7094d2c07291 in /home/muller/.javacpp/cache/pytorch-2.9.0-1.5.13-20251023.024851-17-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cpu.so)
frame #4: torch::data::datasets::MNIST::MNIST(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::data::datasets::MNIST::Mode) + 0x48 (0x7094d2c08028 in /home/muller/.javacpp/cache/pytorch-2.9.0-1.5.13-20251023.024851-17-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cpu.so)
frame #5: Java_org_bytedeco_pytorch_MNIST_allocate__Ljava_lang_String_2 + 0xb6 (0x709fd6c1cef6 in /home/muller/.javacpp/cache/pytorch-2.9.0-1.5.13-20251023.024851-17-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libjnitorch.so)
frame #6: [0x70b899fde712]

	at org.bytedeco.pytorch.MNIST.allocate(Native Method)
	at org.bytedeco.pytorch.MNIST.<init>(MNIST.java:48)
	at SimpleMNIST.main(SimpleMNIST.java:41)

saudet Nov 26, 2025
Maintainer

The MNIST dataset works just fine for me. I can't fix what isn't broken!

As for the rest, I'm sure we can get everything running as per the C++ API. We just need someone to spend enough time on it.

mullerhai · 2025-11-26T15:02:22Z

mullerhai
Nov 26, 2025
Author

Hi @saudet ,I have write one gloo example with pytorch , after javacpp-pytorch gloo could work fun, I will implement these code in scala3, these code also like https://github.com/pytorch/examples/blob/main/cpp/distributed/dist-mnist.cpp

import torch
import torch.distributed as dist
import os
import sys

# Initialize distributed environment
def init_process(rank, world_size, backend='gloo'):
    """
    Initialize distributed environment
    rank: ID of the current process (starting from 0)
    world_size: Total number of processes
    backend: Backend to use, here 'gloo'
    """
    # Set MASTER_ADDR and MASTER_PORT environment variables
    os.environ['MASTER_ADDR'] = 'localhost'  # Master node address
    os.environ['MASTER_PORT'] = '12355'       # Master node port
    
    # Initialize process group
    dist.init_process_group(backend, rank=rank, world_size=world_size)
    print(f"Process {rank}/{world_size} initialized, backend: {backend}")

# Simple distributed computing example
def run_distributed_example(rank, world_size):
    # Create a tensor
    tensor = torch.tensor([rank + 1.0])
    print(f"Process {rank} initial tensor: {tensor}")
    
    # Example 1: Broadcast operation (broadcast from process 0 to all processes)
    dist.broadcast(tensor, src=0)
    print(f"Process {rank} tensor after broadcast: {tensor}")
    
    # Example 2: Summation operation
    sum_tensor = torch.zeros(1)
    dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
    print(f"Process {rank} global sum result: {tensor}")
    
    # Example 3: Data gathering (gather data from all processes to process 0)
    # Fix: Only the destination process (rank 0) needs to specify gather_list
    send_tensor = torch.tensor([rank + 1.0])
    if rank == 0:
        gather_list = [torch.zeros(1) for _ in range(world_size)]
        dist.gather(tensor=send_tensor, gather_list=gather_list, dst=0)
        print(f"Process {rank} gather result: {gather_list}")
    else:
        # Non-destination processes should not specify gather_list parameter
        dist.gather(tensor=send_tensor, dst=0)

# Data parallel training example
def run_parallel_training_example(rank, world_size):
    # Fix: Avoid using rank directly as device_ids
    # Create a simple linear model
    device = torch.device("cuda" if torch.cuda.is_available() and rank < torch.cuda.device_count() else "cpu")
    model = torch.nn.Linear(10, 1).to(device)
    
    # Wrap model with DistributedDataParallel
    ddp_model = torch.nn.parallel.DistributedDataParallel(
        model, device_ids=[rank] if torch.cuda.is_available() and rank < torch.cuda.device_count() else None
    )
    
    # Create random inputs and labels
    inputs = torch.randn(64, 10).to(device)
    labels = torch.randn(64, 1).to(device)
    
    # Define loss function and optimizer
    criterion = torch.nn.MSELoss()
    optimizer = torch.optim.SGD(ddp_model.parameters(), lr=0.01)
    
    # Forward pass
    outputs = ddp_model(inputs)
    loss = criterion(outputs, labels)
    
    # Backward pass and optimization
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print(f"Process {rank} training loss: {loss.item()}")

# Cleanup distributed environment
def cleanup():
    dist.destroy_process_group()

# Main function
def main(rank, world_size):
    # Initialize distributed environment
    init_process(rank, world_size)
    
    try:
        # Run distributed computing example
        run_distributed_example(rank, world_size)
        
        # Run parallel training example
        run_parallel_training_example(rank, world_size)
    finally:
        # Cleanup
        cleanup()

if __name__ == '__main__':
    # Get command line argument as rank
    # In practice, you can use torch.distributed.launch or torchrun to start multiple processes
    # Here for simplicity, we directly get rank from command line arguments
    rank = int(sys.argv[1]) if len(sys.argv) > 1 else 0
    world_size = 2  # Assume using 2 processes
    
    # Check if backend is available
    if not dist.is_available():
        print("Distributed functionality not available")
        sys.exit(1)
    
    if not dist.is_gloo_available():
        print("Gloo backend not available")
        sys.exit(1)
    
    print(f"Using backend: gloo, rank: {rank}, world_size: {world_size}")
    main(rank, world_size)

we need to open two terminal windows

python pytorch_gloo_example.py 0

python pytorch_gloo_example.py 1

then the output

Using backend: gloo, rank: 1, world_size: 2
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
Process 1/2 initialized, backend: gloo
Process 1 initial tensor: tensor([2.])
Process 1 tensor after broadcast: tensor([1.])
Process 1 global sum result: tensor([2.])
Process 1 training loss: 1.834309458732605

Using backend: gloo, rank: 0, world_size: 2
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
Process 0/2 initialized, backend: gloo
Process 0 initial tensor: tensor([1.])
Process 0 tensor after broadcast: tensor([1.])
Process 0 global sum result: tensor([2.])
Process 0 gather result: [tensor([1.]), tensor([2.])]
Process 0 training loss: 0.8520431518554688

10 replies

saudet Nov 27, 2025
Maintainer

I've added FileStore and TCPStore in commit bbf8c5b so please try again with that.

mullerhai Nov 27, 2025
Author

I've added FileStore and TCPStore in commit bbf8c5b so please try again with that.

perfect I will try

mullerhai Nov 27, 2025
Author

I've added FileStore and TCPStore in commit bbf8c5b so please try again with that.

oh, I delete the pytorch snapshot jar on my disk ,then I try to make new pytorch snapshot jar with mvn compile cmd, but download file is OT/pytorch-2.9.0-1.5.13-20251023.024851-17-* version ? have you publish the new commit and deploy new jar to the maven snapshot repo ?

and I see the commit log you add two line in pytorch/src/main/java/org/bytedeco/pytorch/presets/torch.java file to compile

            "torch/csrc/distributed/c10d/FileStore.hpp",
            "torch/csrc/distributed/c10d/TCPStore.hpp",

so could you add ProcessGroupGlooCuda.cpp
and ProcessGroupNCLL.cpp with same time , I don't know why not todo it

saudet Nov 28, 2025
Maintainer

NCCL isn't available on Windows, so that makes the build for Linux more complicated, but let me see...

mullerhai Nov 28, 2025
Author

NCCL isn't available on Windows, so that makes the build for Linux more complicated, but let me see...

Let's not hesitate; let's work and push to make it happen. I firmly believe the goal for javacpp-pytorch is to build an enterprise-grade framework. Now, it's rare for anyone to deploy deep learning frameworks on Windows. The deployment is almost exclusively done on Linux distributions, including: Debian, Ubuntu, Fedora, CentOS, and Red Hat. This is precisely the main battleground where NCCL shines.

Alternatively, we could issue an additional declaration to users that certain classes are not available on Windows. This is determined by the underlying CUDA dependencies, which hopefully,maybe NVIDIA will support nccl on windows one day. NVIDIA, in fact, doesn't support Windows for many of its tools, including their own Spark DataFrame plugin called cuDF. This isn't our fault, but it means we should maximize the effectiveness of javacpp-pytorch in an enterprise Linux environment.

I recently reached out to a tool called executorch, hoping they would support Linux with java, but they only support Android on the JVM. This is a significant drawback for its growth.

On another front, we are working on implementing some of the APIs for huggingface Transformers model api and VLLM apinatively in Java. In the future, this will allow for seamless interaction using javacpp-pyorch's tensors.

Also, regarding the pytorch.distributed.rpc module: could we consider implementing it? This would give us an additional channel for deploying distributed models for user invoke,will make more powerful. I have read Huawei NPU PYTORCH code, they also use pytorch rpc to inference and deploy the model

Thank you for your hard work and consideration.

mullerhai · 2025-11-26T16:09:00Z

mullerhai
Nov 26, 2025
Author

I write a gloo cuda example pytorch , it could work ,I think future in scala3 we just need implement like this @saudet

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.distributed as dist
import torch.optim as optim
from torchvision import datasets, transforms
import torch.utils.data.distributed
import os
import sys

# Define a Convolutional Model
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, 0.5, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

# Initialize distributed environment with NCCL backend
def init_process(rank, world_size, backend='gloo'):
    """
    Initialize distributed environment with NCCL backend
    rank: ID of the current process (starting from 0)
    world_size: Total number of processes
    backend: Backend to use, here 'nccl'
    """
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group(backend, rank=rank, world_size=world_size)
    
    # Set device to a specific GPU based on process rank
    torch.cuda.set_device(rank % torch.cuda.device_count())
    print(f"Process {rank} initialized on GPU {torch.cuda.current_device()}")

# Wait for all works to complete
def wait_work(works):
    for work in works:
        try:
            work.wait()
        except Exception as ex:
            print(f"Exception received: {ex}")
            dist.abort()

# Average gradients across all processes
def average_gradients(model, world_size):
    works = []
    for param in model.parameters():
        # All-reduce gradients across processes
        work = dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM, async_op=True)
        works.append(work)
    
    # Wait for all reductions to complete
    wait_work(works)
    
    # Average the gradients
    for param in model.parameters():
        param.grad.data = param.grad.data / world_size

# Main training function with CUDA support
def train(rank, world_size):
    # Initialize distributed environment with NCCL backend
    init_process(rank, world_size, backend='gloo')
    
    # Get the current device (GPU)
    device = torch.device(f"cuda:{torch.cuda.current_device()}")
    
    # Set manual seed
    torch.manual_seed(0)
    torch.cuda.manual_seed(0)
    
    # Create model and move to GPU
    model = Model().to(device)
    
    # Define hyperparameters
    learning_rate = 1e-2
    optimizer = optim.SGD(model.parameters(), lr=learning_rate)
    num_epochs = 10
    total_batch_size = 64
    batch_size_per_proc = total_batch_size // world_size
    
    # Data loading and preprocessing
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    
    # Distributed training dataset
    train_dataset = datasets.MNIST('/home/muller/Documents/code/bigscala/mnist', 
                                  train=True, 
                                  download=True, 
                                  transform=transform)
    train_sampler = torch.utils.data.distributed.DistributedSampler(
        train_dataset, num_replicas=world_size, rank=rank, shuffle=True
    )
    
    # DataLoader with distributed sampler
    train_loader = torch.utils.data.DataLoader(
        train_dataset, 
        batch_size=batch_size_per_proc, 
        sampler=train_sampler,
        num_workers=2,  # Add workers for faster data loading
        pin_memory=True  # Pin memory for faster CPU to GPU transfers
    )
    
    num_train_samples_per_proc = len(train_dataset) // world_size
    
    # Training loop
    for epoch in range(1, num_epochs + 1):
        # Set epoch for distributed sampler (important for shuffling)
        train_sampler.set_epoch(epoch)
        
        model.train()
        num_correct = 0
        total_loss = 0.0
        
        for batch_idx, (data, target) in enumerate(train_loader):
            # Move data to GPU
            data, target = data.to(device), target.to(device)
            
            # Reset gradients
            optimizer.zero_grad()
            
            # Forward pass
            output = model(data)
            
            # Calculate loss
            loss = F.nll_loss(output, target)
            total_loss += loss.item()
            
            # Backward pass
            loss.backward()
            
            # Average gradients across all processes using NCCL
            average_gradients(model, world_size)
            
            # Update parameters
            optimizer.step()
            
            # Calculate accuracy
            pred = output.argmax(dim=1, keepdim=True)
            num_correct += pred.eq(target.view_as(pred)).sum().item()
        
        # Calculate and print accuracy
        accuracy = 100.0 * num_correct / num_train_samples_per_proc
        avg_loss = total_loss / len(train_loader)
        print(f"Rank {rank}, Epoch {epoch}: Loss = {avg_loss:.4f}, Accuracy = {accuracy:.4f}%")
    
    # Testing only in rank 0
    if rank == 0:
        # Load test dataset
        test_dataset = datasets.MNIST('/home/muller/Documents/code/bigscala/mnist', 
                                     train=False, 
                                     transform=transform)
        test_loader = torch.utils.data.DataLoader(
            test_dataset, 
            batch_size=1000,  # Smaller batch size for testing to fit in GPU memory
            shuffle=False,
            num_workers=2,
            pin_memory=True
        )
        
        model.eval()
        num_correct = 0
        total_test_loss = 0.0
        
        with torch.no_grad():
            for data, target in test_loader:
                # Move data to GPU
                data, target = data.to(device), target.to(device)
                
                # Forward pass
                output = model(data)
                
                # Calculate loss
                loss = F.nll_loss(output, target, reduction='sum')
                total_test_loss += loss.item()
                
                # Calculate accuracy
                pred = output.argmax(dim=1, keepdim=True)
                num_correct += pred.eq(target.view_as(pred)).sum().item()
        
        # Print test results
        num_test_samples = len(test_dataset)
        avg_test_loss = total_test_loss / num_test_samples
        test_accuracy = 100.0 * num_correct / num_test_samples
        print(f"Test results (Rank {rank}):")
        print(f"  Average loss: {avg_test_loss:.4f}")
        print(f"  Accuracy: {num_correct}/{num_test_samples} ({test_accuracy:.4f}%)")
    
    # Cleanup distributed environment
    dist.destroy_process_group()

# Main function to handle command-line arguments
def main():
    # Parse command line arguments for rank and world_size
    if len(sys.argv) > 2:
        rank = int(sys.argv[1])
        world_size = int(sys.argv[2])
    else:
        # Default to single process if no arguments provided
        rank = 0
        world_size = 1
    
    # Check if backend is available
    if not dist.is_available():
        print("Distributed functionality not available")
        sys.exit(1)
    
    # Check for CUDA availability
    if not torch.cuda.is_available():
        print("CUDA is not available. Falling back to CPU and Gloo backend.")
        os.environ['BACKEND'] = 'gloo'
        backend = 'gloo'
    else:
        # Check for NCCL availability
        if not dist.is_nccl_available():
            print("NCCL backend not available, falling back to Gloo for GPU")
            backend = 'gloo'
        else:
            backend =  'gloo' #'nccl'
        
        # Check if we have enough GPUs
        if world_size > torch.cuda.device_count():
            print(f"Warning: Requested {world_size} processes but only {torch.cuda.device_count()} GPUs available.")
            print("Some processes will share GPUs.")
    
    print(f"Using backend: {backend}, rank: {rank}, world_size: {world_size}")
    
    # Run training
    train(rank, world_size)

if __name__ == '__main__':
    main()

1 reply

saudet Nov 27, 2025
Maintainer

I've added FileStore and TCPStore in commit bbf8c5b so please try again with that.

mullerhai · 2025-11-27T13:15:21Z

mullerhai
Nov 27, 2025
Author

@saudet I'm scheduled to give another demonstration to Huawei this weekend with the new javacpp-pytorch version 2.9.1-1.5.13, but gloo verification keeps failing, which is making me extremely anxious. Currently, the ProcessGroupGloo creation is failing and causing a JVM crash. I'm not sure if it's a bug in the software itself, due to my incorrect usage, or if I'm passing parameters incorrectly.
Gloo is also very important for us. Gloo allows us to use multiple GPUs for distributed training of large models on a single machine with CUDA. The training performance is more than 20 times faster on a single GTX-4060 compared to an i9-13900HX CPU, and it also converges faster. I'm very hopeful that we can fix this issue.

the final I try ,also failed


    System.setProperty("org.bytedeco.javacpp.nopointergc", "true")
    println(s"GLOO Backend Name: ${torch_native.GLOO_BACKEND_NAME().getString}")
    println(s"GLOO Backend lazy init: ${torch_native.getDefaultGlooLazyInit().toString}")
    println(s"Default Backend Init : ${torch_native.TORCH_GLOO_LAZY_INIT().toString}")
    println(s"GLOO Init Backend Name: ${torch_native.GLOO_BACKEND_NAME().getString}")
    val timeout = new Milliseconds(new Seconds(10000))
    val dev = ProcessGroupGloo.createDeviceForHostname("127.0.0.1", true)
    val dev2 = ProcessGroupGloo.createDefaultDevice(true)
    val tran = dev.createContext(0, 2)
    //    val dev3 = ProcessGroupGloo.create
//    val vec = new GlooDeviceVector(dev)
//    vec.put(dev2)
    val options = GlooOptions.create(timeout)
    options.threads(4)
    options.devices.put(new GlooDeviceVector(dev,dev2))
    val store = new Store(new Pointer())
    println(s"store ${store} \r  device ${dev.str().getString}")
    print(s"\r device2 ${dev2.str().getString}")
    val gloo = new ProcessGroupGloo(store, 1, 2, options)
    println(s"gloo ->  ${gloo}")

13 replies

mullerhai Nov 29, 2025
Author

let us review and try to run this code ,the logical is same ,but not not running ? make me confuse @saudet ,I don't know why , it is magical?

package org.example;
import org.bytedeco.javacpp.chrono.Milliseconds;
import org.bytedeco.pytorch.GlooDeviceVector;
import org.bytedeco.pytorch.ProcessGroupGloo;
import org.bytedeco.pytorch.TCPStore;
import org.bytedeco.pytorch.TCPStoreOptions;
import org.bytedeco.pytorch.TensorVector;
import org.bytedeco.pytorch.ProcessGroupGloo.Options;
import org.bytedeco.pytorch.global.torch;
import org.bytedeco.javacpp.PointerScope;

public class SimpleGloo2 {
    public static void main(String[] args) {
        System.out.println("Hello and welcome gloo distribution!");
        
        // 设置为true进行单进程测试，设置为false进行多进程测试
        boolean singleProcessTest = false;
        
        var timeout = new Milliseconds(10000); // 10秒超时
        
        try (PointerScope scope = new PointerScope()) {
            // 根据模式配置TCPStore
            int rank;
            int worldSize;
            boolean isServer;
            
            if (singleProcessTest) {
                // 单进程测试模式
                rank = 0;
                worldSize = 1;
                isServer = true;
                System.out.println("Running in single process test mode");
            } else {
                // 多进程分布式模式
                // 注意：需要在不同的终端分别启动3个进程，并分别设置rank为0、1、2
                rank = 0; // 这里可以根据命令行参数动态设置
                worldSize = 3;
                isServer = (rank == 0); // 只有rank 0作为服务器
                System.out.println("Running in distributed mode - rank: " + rank + ", worldSize: " + worldSize);
            }
            
            // TCPStore配置
            var storeOption = new TCPStoreOptions();
            storeOption.timeout(timeout);
            storeOption.port((short) 12455);
            storeOption.isServer(isServer);
            storeOption.multiTenant(!singleProcessTest); // 多进程时启用
            
            // 创建TCPStore
            var store = new TCPStore("127.0.0.1", storeOption);
            System.out.println("tcp store -> " + store.getHost().getString() + ":" + store.getPort());
            System.out.println("storeOption -> port: " + storeOption.port() + ", isServer: " + isServer);

            // 设备创建
            var dev = ProcessGroupGloo.createDeviceForHostname("127.0.0.1", true);
            System.out.println("device -> " + dev.str().getString());

            // 设备向量设置
            var vec = new GlooDeviceVector();
            vec.put(dev);
            System.out.println("GlooDeviceVector size: " + vec.size());

            // 选项配置
            var options = Options.create(timeout);
            options.threads(4);
            options.devices().put(vec);
            // 创建ProcessGroupGloo实例
            var gloo = new ProcessGroupGloo(store, rank, worldSize, options);
            System.out.println("gloo -> " + gloo);
            
            // 测试Tensor
            var rnd = torch.rand(2, 3);
            System.out.println("Original tensor:");
            torch.print(rnd);
            
            // 只有在单进程模式或成功连接到其他进程时才能执行allreduce
            if (singleProcessTest) {
                // 在单进程模式下，allreduce应该能正常工作
                gloo.allreduce(new TensorVector(rnd));
                System.out.println("After allreduce (single process):");
                torch.print(rnd);
            } else {
                // 在多进程模式下，需要等待所有进程连接
                System.out.println("Waiting for all processes to connect...");
                // 这里可以添加一些同步机制
                
                // 尝试执行allreduce
                try {
                    gloo.allreduce(new TensorVector(rnd));
                    System.out.println("After allreduce (distributed):");
                    torch.print(rnd);
                } catch (Exception e) {
                    System.err.println("Allreduce failed: " + e.getMessage());
                }
            }
            
            System.out.println("Test completed successfully!");
            
        } catch (Exception e) {
            System.err.println("Error: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

the output error

Hello and welcome gloo distribution!
Running in distributed mode - rank: 0, worldSize: 3
tcp store -> 127.0.0.1:12455
storeOption -> port: 12455, isServer: true
device -> tcp, pci=, iface=lo, speed=-1, addr=[127.0.0.1]
GlooDeviceVector size: 1
Error: vector::reserve
java.lang.RuntimeException: vector::reserve
	at org.bytedeco.pytorch.ProcessGroupGloo.allocate(Native Method)
	at org.bytedeco.pytorch.ProcessGroupGloo.<init>(ProcessGroupGloo.java:273)
	at org.example.SimpleGloo2.main(SimpleGloo2.java:68)

mullerhai Nov 29, 2025
Author

after I change FileStore also can't connect to the rank

java code

        var path = "/tmp/gloo_store";
        var numWorkers = 4;
        var fileStore = new FileStore(path, numWorkers);
        var dev = ProcessGroupGloo.createDeviceForHostname("127.0.0.1", true);
        var vec = new GlooDeviceVector();
        vec.put(dev);
        var options = Options.create(timeout);
        options.threads(4);
        options.devices().put(vec);//new GlooDeviceVector(dev));
        System.out.println("storeOption -> port: " + storeOption.port() +", is server: "+ isServer +", multiTenant: " +storeOption.multiTenant()  + ", world size: " + worldSize + ", GlooDeviceVector size: " + vec.size());
        System.out.println("file store ->  " + fileStore.getPath().getString()+":"+fileStore.getNumKeys());

//        System.out.println("tcp store ->  " + store.getHost().getString()+":"+store.getPort());
        System.out.println("device ->  " + dev.str().getString());
        var gloo = new ProcessGroupGloo(fileStore, rank, worldSize, options);
        var rnd = torch.rand(2l, 3l);
        gloo.allreduce(new TensorVector(rnd));

output

file store ->  /tmp/gloo_store:2
device ->  tcp, pci=, iface=lo, speed=-1, addr=[127.0.0.1]
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 2
[Gloo] Rank 0 is NOT connected to: [1, 2]

mullerhai Nov 29, 2025
Author

I think there have some tiny bug , some logical code sometime could run , sometime cannot run ,get
Error: vector::reserve
java.lang.RuntimeException: vector::reserve
at org.bytedeco.pytorch.ProcessGroupGloo.allocate(Native Method)
at org.bytedeco.pytorch.ProcessGroupGloo.(ProcessGroupGloo.java:273)
at org.example.SimpleGlooFile.main(SimpleGloo2.java:)

        var timeout = new Milliseconds(new Seconds(10));
        var path = "/tmp/gloo_store";
        var numWorkers = 4;
        var fileStore = new FileStore(path, numWorkers);
        var dev = ProcessGroupGloo.createDeviceForHostname("127.0.0.1", true);
        var vec = new GlooDeviceVector();
        vec.put(dev);
        System.out.println("GlooDeviceVector size: " + vec.size());
        var options = ProcessGroupGloo.Options.create(timeout);
        options.threads(4);
        options.devices().put(vec);//new GlooDeviceVector(dev));
        System.out.println("file store ->  " + fileStore.getPath().getString()+":"+fileStore.getNumKeys());
        System.out.println("device ->  " + dev.str().getString());
        var gloo = new ProcessGroupGloo(fileStore, rank, worldSize, options);
        var rnd = torch.rand(2l, 3l);
        gloo.allreduce(new TensorVector(rnd));

saudet Nov 29, 2025
Maintainer

There are at least issues with the wrappers for c10::intrusive_ptr. I'll let you know if I manage to fix them.

saudet Nov 29, 2025
Maintainer

I've pushed a fix for c10::intrusive_ptr in commit a0dca3b and the following code doesn't crash anymore at least, so please try again with this update.

        FileStore store = new FileStore("/tmp/foo", 2);
        ProcessGroupGloo.Options options = ProcessGroupGloo.Options.create();
        options.timeout(new Milliseconds(50));
        options.devices().push_back(
            ProcessGroupGloo.createDeviceForHostname("127.0.0.1"));
        ProcessGroupGloo processGroup = new ProcessGroupGloo(store, 1, 2, options);

mullerhai · 2025-11-29T05:10:48Z

mullerhai
Nov 29, 2025
Author

now try to use TCPStore and FileStore with ProcessGroupGloo, I think if we could make the gloo multi ranks could discover each other ,the gloo maybe could wok , I have try how to set ,but all failed ,maybe is the gloo.Device cause? I am not sure, but the ProcessGroupGloo should already is running .
,the gloo.Device maybe could not detect the GPU, when I invoke device.hasGPUDirect get false @saudet
by the way ,the ProcessGroup is not really work ,maybe need more time to fix it

8 replies

saudet Nov 29, 2025
Maintainer

Please try to translate working C++ code. I can't guess how we're supposed to use the API

mullerhai Nov 29, 2025
Author

Please try to translate working C++ code. I can't guess how we're supposed to use the API

Sorry, I didn't quite understand what you meant. Are you saying that there's no need to fix ProcessGroup or ProcessGroupNCCL? Or is it that we're still unclear on how we'll use these APIs? My understanding is that ProcessGroup seems to be the parent class of ProcessGroupGloo, ProcessGroupCCL, ProcessGroupMPI, and ProcessGroupUCC. Also, ProcessGroup acts as a factory method; by specifying the backend and device, it can delegate to subclasses like ProcessGroupGLOO and ProcessGroupNCCL to run. I imagine that distribute.init_process_group() in PyTorch's Python API is implemented using the ProcessGroup.

saudet Nov 30, 2025
Maintainer

I mean, for example, I asked Gemini, "Write a very small application that demonstrates how to use c10d::ProcessGroupGloo from the C++ API of PyTorch using CUDA as backend", and it came up with the following. It doesn't call setBackend(), so I don't think we need to call setBackend(), among other things.

#include <torch/torch.h>
#include <c10d/GlooProcessGroup.hpp>
#include <c10d/FileStore.hpp>
#include <c10d/ProcessGroup.hpp>
#include <iostream>
#include <memory>
#include <chrono>

// Helper function to create the communication store and process group
std::shared_ptr<c10d::ProcessGroup> init_gloo_pg(int rank, int world_size, const std::string& master_path) {
    // 1. Initialize the shared store (using FileStore for simplicity)
    // FileStore uses a shared file path for synchronization. In a real multi-node
    // setting, you would use c10d::TCPStore for network-based coordination.
    auto store = std::make_shared<c10d::FileStore>(master_path, rank);
    
    // 2. Set up Gloo Process Group Options
    auto options = c10d::GlooOptions::create();
    options->timeout = std::chrono::seconds(10); // Set a reasonable timeout
    
    // 3. Create the Gloo Process Group
    // Gloo handles the actual low-level communication and synchronization.
    // Since the tensors are on CUDA, Gloo manages the peer-to-peer data transfers on the GPU.
    return std::make_shared<c10d::ProcessGroupGloo>(
        store, 
        rank, 
        world_size, 
        std::move(options)
    );
}

int main(int argc, char* argv[]) {
    // --- Argument Parsing ---
    if (argc != 3) {
        std::cerr << "Usage: " << argv[0] << " <RANK> <WORLD_SIZE>" << std::endl;
        std::cerr << "Example for Rank 0: " << argv[0] << " 0 2" << std::endl;
        return 1;
    }
    
    const int rank = std::stoi(argv[1]);
    const int world_size = std::stoi(argv[2]);
    // This file path must be accessible by both processes (e.g., in /tmp/ or an NFS mount)
    const std::string master_file_path = "/tmp/pytorch_gloo_store"; 

    std::cout << "Starting process [Rank " << rank << "/" << world_size << "]" << std::endl;

    // --- CUDA Device Check ---
    if (!torch::cuda::is_available()) {
        std::cerr << "CUDA is NOT available. This example requires a CUDA device." << std::endl;
        return 1;
    }
    
    // Use the first CUDA device
    torch::Device device(torch::kCUDA, 0);
    
    // --- 1. Initialize Gloo Process Group ---
    auto process_group = init_gloo_pg(rank, world_size, master_file_path);
    std::cout << "Process Group initialized successfully." << std::endl;

    // --- 2. Create CUDA Tensor ---
    // Create a tensor on the CUDA device, initialized based on the rank.
    torch::Tensor tensor = torch::ones({2, 2}, device) * (rank + 1);

    std::cout << "\nRank " << rank << " initial tensor on CUDA:\n" << tensor << std::endl;

    // Rank 0 starts with: [[1, 1], [1, 1]]
    // Rank 1 starts with: [[2, 2], [2, 2]]

    // --- 3. Perform AllReduce Collective Operation ---
    // AllReduce sums all tensors and sends the result back to all ranks.
    // The operation is performed on the CUDA device.
    
    std::vector<torch::Tensor> tensors_to_reduce = {tensor};
    c10d::ProcessGroup::Options allreduce_opts(c10d::OpType::SUM);
    
    // Start the collective operation
    auto work = process_group->allreduce(tensors_to_reduce, allreduce_opts);
    
    // Wait for the collective operation to complete
    work->wait();

    // --- 4. Verify Result ---
    // Expected result (1 + 2 = 3): [[3, 3], [3, 3]]
    
    std::cout << "\nRank " << rank << " final tensor after AllReduce (SUM):\n" << tensor << std::endl;

    // Verification check
    torch::Tensor expected_result = torch::ones({2, 2}, device) * (1 + world_size); 
    if (tensor.equal(expected_result)) {
        std::cout << "\nRank " << rank << ": SUCCESS! AllReduce result is correct." << std::endl;
    } else {
        std::cerr << "\nRank " << rank << ": ERROR! AllReduce result is incorrect." << std::endl;
    }

    return 0;
}

https://gemini.google.com/app/a8345c7fe68fb7e1

mullerhai Nov 30, 2025
Author

I mean, for example, I asked Gemini, "Write a very small application that demonstrates how to use c10d::ProcessGroupGloo from the C++ API of PyTorch using CUDA as backend", and it came up with the following. It doesn't call setBackend(), so I don't think we need to call setBackend(), among other things.

#include <torch/torch.h>
#include <c10d/GlooProcessGroup.hpp>
#include <c10d/FileStore.hpp>
#include <c10d/ProcessGroup.hpp>
#include <iostream>
#include <memory>
#include <chrono>

// Helper function to create the communication store and process group
std::shared_ptr<c10d::ProcessGroup> init_gloo_pg(int rank, int world_size, const std::string& master_path) {
    // 1. Initialize the shared store (using FileStore for simplicity)
    // FileStore uses a shared file path for synchronization. In a real multi-node
    // setting, you would use c10d::TCPStore for network-based coordination.
    auto store = std::make_shared<c10d::FileStore>(master_path, rank);
    
    // 2. Set up Gloo Process Group Options
    auto options = c10d::GlooOptions::create();
    options->timeout = std::chrono::seconds(10); // Set a reasonable timeout
    
    // 3. Create the Gloo Process Group
    // Gloo handles the actual low-level communication and synchronization.
    // Since the tensors are on CUDA, Gloo manages the peer-to-peer data transfers on the GPU.
    return std::make_shared<c10d::ProcessGroupGloo>(
        store, 
        rank, 
        world_size, 
        std::move(options)
    );
}

int main(int argc, char* argv[]) {
    // --- Argument Parsing ---
    if (argc != 3) {
        std::cerr << "Usage: " << argv[0] << " <RANK> <WORLD_SIZE>" << std::endl;
        std::cerr << "Example for Rank 0: " << argv[0] << " 0 2" << std::endl;
        return 1;
    }
    
    const int rank = std::stoi(argv[1]);
    const int world_size = std::stoi(argv[2]);
    // This file path must be accessible by both processes (e.g., in /tmp/ or an NFS mount)
    const std::string master_file_path = "/tmp/pytorch_gloo_store"; 

    std::cout << "Starting process [Rank " << rank << "/" << world_size << "]" << std::endl;

    // --- CUDA Device Check ---
    if (!torch::cuda::is_available()) {
        std::cerr << "CUDA is NOT available. This example requires a CUDA device." << std::endl;
        return 1;
    }
    
    // Use the first CUDA device
    torch::Device device(torch::kCUDA, 0);
    
    // --- 1. Initialize Gloo Process Group ---
    auto process_group = init_gloo_pg(rank, world_size, master_file_path);
    std::cout << "Process Group initialized successfully." << std::endl;

    // --- 2. Create CUDA Tensor ---
    // Create a tensor on the CUDA device, initialized based on the rank.
    torch::Tensor tensor = torch::ones({2, 2}, device) * (rank + 1);

    std::cout << "\nRank " << rank << " initial tensor on CUDA:\n" << tensor << std::endl;

    // Rank 0 starts with: [[1, 1], [1, 1]]
    // Rank 1 starts with: [[2, 2], [2, 2]]

    // --- 3. Perform AllReduce Collective Operation ---
    // AllReduce sums all tensors and sends the result back to all ranks.
    // The operation is performed on the CUDA device.
    
    std::vector<torch::Tensor> tensors_to_reduce = {tensor};
    c10d::ProcessGroup::Options allreduce_opts(c10d::OpType::SUM);
    
    // Start the collective operation
    auto work = process_group->allreduce(tensors_to_reduce, allreduce_opts);
    
    // Wait for the collective operation to complete
    work->wait();

    // --- 4. Verify Result ---
    // Expected result (1 + 2 = 3): [[3, 3], [3, 3]]
    
    std::cout << "\nRank " << rank << " final tensor after AllReduce (SUM):\n" << tensor << std::endl;

    // Verification check
    torch::Tensor expected_result = torch::ones({2, 2}, device) * (1 + world_size); 
    if (tensor.equal(expected_result)) {
        std::cout << "\nRank " << rank << ": SUCCESS! AllReduce result is correct." << std::endl;
    } else {
        std::cerr << "\nRank " << rank << ": ERROR! AllReduce result is incorrect." << std::endl;
    }

    return 0;
}

* https://gemini.google.com/app/a8345c7fe68fb7e1

I think you should be clear that in the init_gloo_pg method within the Gemini-generated code you provided, you actually used ProcessGroupGloo to create an instance of the parent class ProcessGroup, which is equivalent to an internal operation, essentially calling setBackend(torch.DeviceType.cpu, ProcessGroup.BackendType.GLOO, BackendOptional). I have also tested with Gemini, and I believe it can have hallucinations; I couldn't find any direct C++ usage of creating a gloo instance with ProcessGroup on GitHub, as it's mostly done with ProcessGroupGloo, ProcessGroupMPI, or ProcessGroupNCCL. However, I did see in C++ that ProcessGroup is a parent class with factory methods to create gloo and NCCL subclasses. Now, I believe that python pytorch's dist.init_process_group(backend, rank=rank, world_size=world_size) is a call to the ProcessGroup or Backend factory method.

mullerhai Nov 30, 2025
Author

I've pushed a fix for c10::intrusive_ptr in commit a0dca3b and the following code doesn't crash anymore at least, so please try again with this update.

        FileStore store = new FileStore("/tmp/foo", 2);
        ProcessGroupGloo.Options options = ProcessGroupGloo.Options.create();
        options.timeout(new Milliseconds(50));
        options.devices().push_back(
            ProcessGroupGloo.createDeviceForHostname("127.0.0.1"));
        ProcessGroupGloo processGroup = new ProcessGroupGloo(store, 1, 2, options);

It's truly incredible, it seems ProcessGroupGloo has actually succeeded, and I'm still really shocked and curious. Although it shows that ranks are not discovering each other, the ProcessGroupGloo seems to be running normally. @saudet , I'm simply in awe of you. Below are the output and code; I've verified with multiple codebases, and it's confirmed that there are no more errors and it no longer hangs and can proceed to execute.

output

/home/muller/.jdks/ms-21.0.9/bin/java -javaagent:/snap/intellij-idea-ultimate/690/lib/idea_rt.jar=34065 -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.stderr.encoding=UTF-8 -classpath /home/muller/Documents/code/torchSnap/target/classes:/home/muller/.m2/repository/org/bytedeco/pytorch-platform/2.9.0-1.5.13-SNAPSHOT/pytorch-platform-2.9.0-1.5.13-20251128.164059-4.jar:/home/muller/.m2/repository/org/bytedeco/javacpp-platform/1.5.13-SNAPSHOT/javacpp-platform-1.5.13-20250902.085346-8.jar:/home/muller/.m2/repository/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106.jar:/home/muller/.m2/repository/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-android-arm64.jar:/home/muller/.m2/repository/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-android-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-ios-arm64.jar:/home/muller/.m2/repository/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-ios-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-linux-arm64.jar:/home/muller/.m2/repository/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-linux-ppc64le.jar:/home/muller/.m2/repository/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-linux-riscv64.jar:/home/muller/.m2/repository/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-linux-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-macosx-arm64.jar:/home/muller/.m2/repository/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-macosx-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-windows-arm64.jar:/home/muller/.m2/repository/org/bytedeco/javacpp/1.5.13-SNAPSHOT/javacpp-1.5.13-20250902.085359-106-windows-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/openblas-platform/0.3.30-1.5.13-SNAPSHOT/openblas-platform-0.3.30-1.5.13-20251017.051719-1.jar:/home/muller/.m2/repository/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10.jar:/home/muller/.m2/repository/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-android-arm64.jar:/home/muller/.m2/repository/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-android-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-ios-arm64.jar:/home/muller/.m2/repository/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-ios-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-linux-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-linux-arm64.jar:/home/muller/.m2/repository/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-macosx-arm64.jar:/home/muller/.m2/repository/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-macosx-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/openblas/0.3.30-1.5.13-SNAPSHOT/openblas-0.3.30-1.5.13-20251017.051733-10-windows-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/pytorch/2.9.0-1.5.13-SNAPSHOT/pytorch-2.9.0-1.5.13-20251129.100515-38.jar:/home/muller/.m2/repository/org/bytedeco/pytorch/2.9.0-1.5.13-SNAPSHOT/pytorch-2.9.0-1.5.13-20251129.102321-36-linux-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/pytorch/2.9.0-1.5.13-SNAPSHOT/pytorch-2.9.0-1.5.13-20251129.100414-37-macosx-arm64.jar:/home/muller/.m2/repository/org/bytedeco/pytorch/2.9.0-1.5.13-SNAPSHOT/pytorch-2.9.0-1.5.13-20251129.100515-38-macosx-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/pytorch/2.9.0-1.5.13-SNAPSHOT/pytorch-2.9.0-1.5.13-20251129.100946-35-windows-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/pytorch-platform-gpu/2.9.0-1.5.13-SNAPSHOT/pytorch-platform-gpu-2.9.0-1.5.13-20251128.164008-4.jar:/home/muller/.m2/repository/org/bytedeco/pytorch/2.9.0-1.5.13-SNAPSHOT/pytorch-2.9.0-1.5.13-20251129.100928-34-linux-x86_64-gpu.jar:/home/muller/.m2/repository/org/bytedeco/pytorch/2.9.0-1.5.13-SNAPSHOT/pytorch-2.9.0-1.5.13-20251128.164112-33-windows-x86_64-gpu.jar:/home/muller/.m2/repository/org/bytedeco/cuda-platform-redist-cudnn/13.0-9.14-1.5.13-SNAPSHOT/cuda-platform-redist-cudnn-13.0-9.14-1.5.13-20251022.164327-3.jar:/home/muller/.m2/repository/org/bytedeco/cuda-platform-redist-cublas/13.0-9.14-1.5.13-SNAPSHOT/cuda-platform-redist-cublas-13.0-9.14-1.5.13-20251022.164329-3.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist-cublas/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-cublas-13.0-9.14-1.5.13-20251022.164348-20.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist-cublas/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-cublas-13.0-9.14-1.5.13-20251022.164348-20-linux-arm64.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist-cublas/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-cublas-13.0-9.14-1.5.13-20251022.164348-20-linux-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist-cublas/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-cublas-13.0-9.14-1.5.13-20251022.164348-20-windows-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist-cudnn/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-cudnn-13.0-9.14-1.5.13-20251022.164345-20.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist-cudnn/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-cudnn-13.0-9.14-1.5.13-20251022.164345-20-linux-arm64.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist-cudnn/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-cudnn-13.0-9.14-1.5.13-20251022.164345-20-linux-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist-cudnn/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-cudnn-13.0-9.14-1.5.13-20251022.164345-20-windows-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/cuda-platform-redist-cusolver/13.0-9.14-1.5.13-SNAPSHOT/cuda-platform-redist-cusolver-13.0-9.14-1.5.13-20251022.164325-3.jar:/home/muller/.m2/repository/org/bytedeco/cuda-platform-redist-cusparse/13.0-9.14-1.5.13-SNAPSHOT/cuda-platform-redist-cusparse-13.0-9.14-1.5.13-20251022.164314-3.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist-cusparse/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-cusparse-13.0-9.14-1.5.13-20251022.164332-20-linux-arm64.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist-cusparse/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-cusparse-13.0-9.14-1.5.13-20251022.164332-20.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist-cusparse/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-cusparse-13.0-9.14-1.5.13-20251022.164332-20-linux-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist-cusparse/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-cusparse-13.0-9.14-1.5.13-20251022.164332-20-windows-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist-cusolver/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-cusolver-13.0-9.14-1.5.13-20251022.164345-20.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist-cusolver/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-cusolver-13.0-9.14-1.5.13-20251022.164345-20-linux-arm64.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist-cusolver/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-cusolver-13.0-9.14-1.5.13-20251022.164345-20-linux-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist-cusolver/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-cusolver-13.0-9.14-1.5.13-20251022.164345-20-windows-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/cuda-platform-redist-nccl/13.0-9.14-1.5.13-SNAPSHOT/cuda-platform-redist-nccl-13.0-9.14-1.5.13-20251022.164320-3.jar:/home/muller/.m2/repository/org/bytedeco/cuda-platform-redist/13.0-9.14-1.5.13-SNAPSHOT/cuda-platform-redist-13.0-9.14-1.5.13-20251022.164259-3.jar:/home/muller/.m2/repository/org/bytedeco/cuda-platform/13.0-9.14-1.5.13-SNAPSHOT/cuda-platform-13.0-9.14-1.5.13-20251022.164244-3.jar:/home/muller/.m2/repository/org/bytedeco/cuda/13.0-9.14-1.5.13-SNAPSHOT/cuda-13.0-9.14-1.5.13-20251022.164259-20.jar:/home/muller/.m2/repository/org/bytedeco/cuda/13.0-9.14-1.5.13-SNAPSHOT/cuda-13.0-9.14-1.5.13-20251022.164259-20-linux-arm64.jar:/home/muller/.m2/repository/org/bytedeco/cuda/13.0-9.14-1.5.13-SNAPSHOT/cuda-13.0-9.14-1.5.13-20251022.164259-20-linux-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/cuda/13.0-9.14-1.5.13-SNAPSHOT/cuda-13.0-9.14-1.5.13-20251022.164259-20-windows-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-13.0-9.14-1.5.13-20251022.164318-20.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-13.0-9.14-1.5.13-20251022.164318-20-linux-arm64.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-13.0-9.14-1.5.13-20251022.164318-20-linux-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-13.0-9.14-1.5.13-20251022.164318-20-windows-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist-nccl/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-nccl-13.0-9.14-1.5.13-20251022.164339-20.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist-nccl/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-nccl-13.0-9.14-1.5.13-20251022.164339-20-linux-arm64.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist-nccl/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-nccl-13.0-9.14-1.5.13-20251022.164339-20-linux-x86_64.jar:/home/muller/.m2/repository/org/bytedeco/cuda-redist-nccl/13.0-9.14-1.5.13-SNAPSHOT/cuda-redist-nccl-13.0-9.14-1.5.13-20251022.164339-20-windows-x86_64.jar org.example.SimpleGloo2
Hello and welcome gloo distribution!
Running in single process test mode
TCPStore -> 127.0.0.1:12455
storeOption -> port: 12455, isServer: true
device -> tcp, pci=, iface=lo, speed=-1, addr=[127.0.0.1]
GlooDeviceVector size: 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
ProcessGroupGloo backend: getBackendName-> gloo
Original tensor:CPUFloatType
After allreduce (single process):
Allreduce completed successfully!
Test completed successfully!
 0.0126  0.8373  0.9733
 0.0315  0.3175  0.4175
[ CPUFloatType{2,3} ] 0.0126  0.8373  0.9733
 0.0315  0.3175  0.4175
[ CPUFloatType{2,3} ]
Process finished with exit code 0

java code

package org.example;
import org.bytedeco.javacpp.chrono.Milliseconds;
import org.bytedeco.pytorch.GlooDeviceVector;
import org.bytedeco.pytorch.ProcessGroupGloo;
import org.bytedeco.pytorch.TCPStore;
import org.bytedeco.pytorch.TCPStoreOptions;
import org.bytedeco.pytorch.TensorVector;
import org.bytedeco.pytorch.ProcessGroupGloo.Options;
import org.bytedeco.pytorch.global.torch;
import org.bytedeco.javacpp.PointerScope;

public class SimpleGloo2 {
    public static void main(String[] args) {
        System.out.println("Hello and welcome gloo distribution!");
        
        // set singleProcessTest to true for single process test, set to false for distributed test
        boolean singleProcessTest = true;
        
        var timeout = new Milliseconds(10000); // 10秒超时
        
        try (PointerScope scope = new PointerScope()) {
            // select model to config rank、worldSize和isServer
            int rank;
            int worldSize;
            boolean isServer;
            
            if (singleProcessTest) {
                // single process test mode
                rank = 0;
                worldSize = 1;
                isServer = true;
                System.out.println("Running in single process test mode");
            } else {
                // distributed mode
                // note: need to start 3 processes in different terminals, and set rank to 0, 1, 2 respectively
                rank = 0; // can be set dynamically by command line argument
                worldSize = 3;
                isServer = (rank == 0); // only rank 0 is server
                System.out.println("Running in distributed mode - rank: " + rank + ", worldSize: " + worldSize);
            }
            
            // TCPStore configuration
            var storeOption = new TCPStoreOptions();
            storeOption.timeout(timeout);
            storeOption.port((short) 12455);
            storeOption.isServer(isServer);
            storeOption.multiTenant(!singleProcessTest); // 多进程时启用
            
            // create TCPStore
            var store = new TCPStore("127.0.0.1", storeOption);
            System.out.println("TCPStore -> " + store.getHost().getString() + ":" + store.getPort());
            System.out.println("storeOption -> port: " + storeOption.port() + ", isServer: " + isServer);

            // create device for hostname
            var dev = ProcessGroupGloo.createDeviceForHostname("127.0.0.1", true);
            System.out.println("device -> " + dev.str().getString());

            // create GlooDeviceVector
            var vec = new GlooDeviceVector();
            vec.put(dev);
            System.out.println("GlooDeviceVector size: " + vec.size());

            // create ProcessGroupGloo Options
            var options = Options.create(timeout);
            options.threads(4);
            options.devices().put(vec);
            // create ProcessGroupGloo instance
            var gloo = new ProcessGroupGloo(store, rank, worldSize, options);
            System.out.println("ProcessGroupGloo backend: getBackendName-> " + gloo.getBackendName().getString());
            
            // test Tensor
            var rnd = torch.rand(2, 3);
            System.out.println("Original tensor:" + rnd);
            torch.print(rnd);
            
            // only in single process mode or successfully connected to other processes, can execute allreduce
            if (singleProcessTest) {
                // in single process mode, allreduce should work normally
                var res = gloo.allreduce(new TensorVector(rnd));
                System.out.println("After allreduce (single process):" );
                torch.print(rnd);
                if(res.isCompleted()) {
                    System.out.println("Allreduce completed successfully!");
                } else {
                    System.err.println("Allreduce failed: " + res.isNull());
                }
            } else {
                // in distributed mode, need to wait for all processes to connect
                System.out.println("Waiting for all processes to connect...");
                // can add some synchronization mechanism here
                
                // try to execute allreduce in distributed mode
                try {
                    var res = gloo.allreduce(new TensorVector(rnd));
                    System.out.println("After allreduce (distributed):" + res);
                    torch.print(rnd);
                } catch (Exception e) {
                    System.err.println("Allreduce failed: " + e.getMessage());
                }
            }
            
            System.out.println("Test completed successfully!");
            
        } catch (Exception e) {
            System.err.println("Error: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

HI @saudet ，I review my test code and try to run it,
Today, I modified the code and ran it again, confirming once more that ProcessGroupGloo still has an issue. The problem is that they cannot communicate with or discover each other, which means they are unable to communicate or synchronize the update of parameters and tensor gradients. It seems that this is the only critical issue left for gloo. Once the problem of communication between different ranks is solved, ProcessGroupGloo can be used normally. I believe ProcessGroupGloo is starting up, so why can't they discover each other? This is truly baffling.

Today, I obtained some important error logs; we can analyze the reasons for the connection rejection together. I'm not sure if it's caused by my local computer, the code library, or the way I wrote the code example. I've checked my computer, and the port should not be in use. The VPN is also turned off, which suggests that some of its dependent tools may not have started, and the port is not taking effect. ProcessGroupGloo likely still requires some adjustments and checks.

output 3ranks

rank0

Hello and welcome gloo distribution!
Running in distributed mode - rank: 0, worldSize: 3
device -> tcp, pci=, iface=lo, speed=-1, addr=[127.0.0.1]
GlooDeviceVector size: 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 2
[Gloo] Rank 0 is NOT connected to: [1, 2]
ProcessGroupGloo backend: getBackendName-> gloo
Original tensor:CUDAFloatType
 tensor over
Waiting for all processes to connect...
 0.9094  0.7379  0.1742
 0.3312  0.0546  0.2633
[ CUDAFloatType{2,3} ][/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=1, retry=1, retryLimit=3, rank=0, size=3, local=[127.0.0.1]:47245, remote=[127.0.0.1]:4516$0, error=SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:4516$0
[/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=1, retry=2, retryLimit=3, rank=0, size=3, local=[127.0.0.1]:47757, remote=[127.0.0.1]:4516$0, error=SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:4516$0
[/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=1, retry=3, retryLimit=3, rank=0, size=3, local=[127.0.0.1]:48781, remote=[127.0.0.1]:4516$0, error=SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:4516$0
[/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=0, retry=4, retryLimit=3, rank=0, size=3, local=[127.0.0.1]:50829, remote=[127.0.0.1]:4516$0, error=SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:4516$0


rank1

Hello and welcome gloo distribution!
Running in distributed mode - rank: 1, worldSize: 3
device -> tcp, pci=, iface=lo, speed=-1, addr=[127.0.0.1]
GlooDeviceVector size: 1
[Gloo] Rank 1 is connected to 0 peer ranks. Expected number of connected peer ranks is : 2
[Gloo] Rank 1 is NOT connected to: [0, 2]
ProcessGroupGloo backend: getBackendName-> gloo
Original tensor:CUDAFloatType
 tensor over
Waiting for all processes to connect...
 0.1675  0.2954  0.5990
 0.2442  0.8954  0.8723
[ CUDAFloatType{2,3} ][/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=1, retry=1, retryLimit=3, rank=1, size=3, local=[127.0.0.1]:59582, remote=[127.0.0.1]:4516$1, error=SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:4516$1
[/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=1, retry=2, retryLimit=3, rank=1, size=3, local=[127.0.0.1]:63678, remote=[127.0.0.1]:4516$1, error=SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:4516$1
[/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=1, retry=3, retryLimit=3, rank=1, size=3, local=[127.0.0.1]:64190, remote=[127.0.0.1]:4516$1, error=SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:4516$1
[/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=0, retry=4, retryLimit=3, rank=1, size=3, local=[127.0.0.1]:65214, remote=[127.0.0.1]:4516$1, error=SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:4516$1


rank2
Hello and welcome gloo distribution!
Running in distributed mode - rank: 2, worldSize: 3
device -> tcp, pci=, iface=lo, speed=-1, addr=[127.0.0.1]
GlooDeviceVector size: 1
[Gloo] Rank 2 is connected to 0 peer ranks. Expected number of connected peer ranks is : 2
[Gloo] Rank 2 is NOT connected to: [0, 1]
ProcessGroupGloo backend: getBackendName-> gloo
Original tensor:CUDAFloatType
 tensor over
Waiting for all processes to connect...

after I  use Tcp Store

Hello and welcome gloo distribution!
Running in distributed mode - rank: 0, worldSize: 3
TCPStore -> 127.0.0.1:12455
storeOption -> port: 12455, isServer: true
device -> tcp, pci=, iface=lo, speed=-1, addr=[127.0.0.1]
GlooDeviceVector size: 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 2
[Gloo] Rank 0 is NOT connected to: [1, 2]
ProcessGroupGloo backend: getBackendName-> gloo
Original tensor:CUDAFloatType
 tensor over
Waiting for all processes to connect...
 0.3191  0.1508  0.5966
 0.2538  0.7589  0.2222
[ CUDAFloatType{2,3} ] 0.3191  0.1508  0.5966
 0.2538  0.7589  0.2222
[ CUDAFloatType{2,3} ][W1130 18:08:58.397370374 socket.cpp:472] [c10d] waitForInput: poll for socket SocketImpl(fd=86, addr=[localhost]:54004, remote=[localhost]:12455) returned 0, likely a timeout
[W1130 18:08:58.397626971 socket.cpp:497] [c10d] waitForInput: socket SocketImpl(fd=86, addr=[localhost]:54004, remote=[localhost]:12455) timed out after 10000ms

the javacode

package org.example;
import org.bytedeco.javacpp.chrono.Milliseconds;
import org.bytedeco.pytorch.*;
import org.bytedeco.pytorch.ProcessGroupGloo.Options;
import org.bytedeco.pytorch.global.torch;
import org.bytedeco.javacpp.PointerScope;

public class SimpleGloo2 {


    public static void main(String[] args) {
        System.out.println("Hello and welcome gloo distribution!");
        var reRank = 2;
        //    os.environ['MASTER_ADDR'] = 'localhost'
        //    os.environ['MASTER_PORT'] = '12356'
        boolean singleProcessTest = false;
        var host = "localhost";
        var port = 12455;

        System.setProperty("MASTER_ADDR", host);
        System.setProperty("MASTER_PORT", Integer.toString(port));
        // set singleProcessTest to true for single process test, set to false for distributed test

        var timeout = new Milliseconds(10000); // 10秒超时
        
        try (PointerScope scope = new PointerScope()) {
            // select model to config rank、worldSize和isServer
            int rank;
            int worldSize;
            boolean isServer;
            
            if (singleProcessTest) {
                // single process test mode
                rank = 0;
                worldSize = 1;
                isServer = true;
                System.out.println("Running in single process test mode");
            } else {
                // distributed mode
                // note: need to start 3 processes in different terminals, and set rank to 0, 1, 2 respectively
                rank = reRank; // can be set dynamically by command line argument
                worldSize = 3;
                isServer = (rank == 0); // only rank 0 is server
                System.out.println("Running in distributed mode - rank: " + rank + ", worldSize: " + worldSize);
            }
            
            // TCPStore configuration
            var storeOption = new TCPStoreOptions();
            storeOption.timeout(timeout);
            storeOption.port((short) port);
            storeOption.isServer(isServer);
            storeOption.multiTenant(!singleProcessTest); // 多进程时启用
            
            // create TCPStore
            var store = new TCPStore("127.0.0.1", storeOption);
            System.out.println("TCPStore -> " + store.getHost().getString() + ":" + store.getPort());
            System.out.println("storeOption -> port: " + storeOption.port() + ", isServer: " + isServer);

            var allreduceOptions = new AllreduceOptions();
            var reduceOp = new ReduceOp(ReduceOp.RedOpType.SUM);
            allreduceOptions.reduceOp(reduceOp);
//            var path = "/tmp/gloo_store";
//            var numWorkers = 4;
//            var store = new FileStore(path, numWorkers);


            // create device for hostname
            var dev = ProcessGroupGloo.createDeviceForHostname(host, true);
            System.out.println("device -> " + dev.str().getString());

            // create GlooDeviceVector
            var vec = new GlooDeviceVector();
            vec.put(dev);
            System.out.println("GlooDeviceVector size: " + vec.size());

            // create ProcessGroupGloo Options
            var options = Options.create(timeout);
            options.threads(4);
            options.devices().put(vec);
            // create ProcessGroupGloo instance
            var gloo = new ProcessGroupGloo(store, rank, worldSize, options);
            System.out.println("ProcessGroupGloo backend: getBackendName-> " + gloo.getBackendName().getString());

            // test Tensor
            var rnd = torch.rand(2, 3).cuda();
            System.out.println("Original tensor:" + rnd);
            torch.print(rnd);
            Thread.sleep(20);
            System.out.println(" tensor over");
            // only in single process mode or successfully connected to other processes, can execute allreduce
            if (singleProcessTest) {
                // in single process mode, allreduce should work normally
                var res = gloo.allreduce(new TensorVector(rnd), allreduceOptions);
                Thread.sleep(20);
                synchronized (res) {
                    res.wait();
                }
                System.out.println("After allreduce (single process):" +res );
                if(res.isCompleted()) {
                    System.out.println("Allreduce completed successfully!" );
                    torch.print(rnd);
                } else {
                    System.err.println("Allreduce failed: " + res.isNull());
                }
            } else {
                // in distributed mode, need to wait for all processes to connect
                System.out.println("Waiting for all processes to connect...");
                // can add some synchronization mechanism here
                
                // try to execute allreduce in distributed mode
                try {
                    var res = gloo.allreduce(new TensorVector(rnd),allreduceOptions);
                    torch.print(rnd);
                    Thread.sleep(20);
                    synchronized (res) {
                        res.wait();
                    }
                    System.out.println("After allreduce (distributed):" + res);
                    if(res.isCompleted()) {
                        System.out.println("Allreduce completed successfully!"+res);
                    } else {

                        System.err.println("Allreduce failed: " + res.isNull());
                    }
                } catch (Exception e) {
                    System.err.println("Catch Ex Allreduce failed: " + e.getMessage());
                }
            }
            
            System.out.println("Test completed successfully!");
            
        } catch (Exception e) {
            System.err.println("Error: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

mullerhai · 2025-12-01T05:07:05Z

mullerhai
Dec 1, 2025
Author

I agree that gloo distributed training will be a key challenge we need to tackle before our new version release. After resolving the distributed communication and mutual rank discovery issues, gloo should be able to function normally. The error log shows "connection refused."

I suggest you could run the Java code I wrote, starting three separate terminals and specifying rank=0, rank=1, and rank=2 respectively, to see what results you get. If you also get a connection refused error, it might be a problem with the javacpp-pytorch library; if the connection is successful, then it's an issue with my computer's network configuration. I tried again today and still couldn't connect on my computer.

Hello and welcome gloo distribution!
Running in distributed mode - rank: 0, worldSize: 3
device -> tcp, pci=, iface=lo, speed=-1, addr=[127.0.0.1]
GlooDeviceVector size: 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 2
[Gloo] Rank 0 is NOT connected to: [1, 2]
ProcessGroupGloo backend: getBackendName-> gloo
Original tensor:CUDAFloatType
 tensor over
Waiting for all processes to connect...
 0.0062  0.6930  0.0525
 0.5868  0.3208  0.7014
[ CUDAFloatType{2,3} ][/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=1, retry=1, retryLimit=3, rank=0, size=3, local=[127.0.0.1]:11494, remote=[127.0.0.1]:432$0, error=SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:432$0
[/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=1, retry=1, retryLimit=3, rank=0, size=3, local=[127.0.0.1]:2286, remote=[127.0.0.1]:29587$0, error=SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:29587$0
[/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=1, retry=2, retryLimit=3, rank=0, size=3, local=[127.0.0.1]:12006, remote=[127.0.0.1]:432$0, error=SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:432$0
[/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=1, retry=2, retryLimit=3, rank=0, size=3, local=[127.0.0.1]:2798, remote=[127.0.0.1]:29587$0, error=SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:29587$0
[/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=1, retry=3, retryLimit=3, rank=0, size=3, local=[127.0.0.1]:15590, remote=[127.0.0.1]:432$0, error=SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:432$0
[/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=1, retry=3, retryLimit=3, rank=0, size=3, local=[127.0.0.1]:3310, remote=[127.0.0.1]:29587$0, error=SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:29587$0
[/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=0, retry=4, retryLimit=3, rank=0, size=3, local=[127.0.0.1]:17126, remote=[127.0.0.1]:432$0, error=SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:432$0
[/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=0, retry=4, retryLimit=3, rank=0, size=3, local=[127.0.0.1]:3822, remote=[127.0.0.1]:29587$0, error=SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:29587$0
 0.0062

7 replies

mullerhai Dec 1, 2025
Author

For me it just waits on "Waiting for all processes to connect..."

I'm pretty sure the same code written in C++ would fail the same way. The issue is most likely not related to JavaCPP, but in the usage of the C++ API of PyTorch. In any case, if you have working C++ code, please show it to me.

Perhaps what you said is also correct; it might be that I'm using these APIs incorrectly. I haven't set up a C++ LibTorch environment on my computer yet, but I can try to set it up. However, I'm a bit rusty with C++. If I get it set up and try running some code, also, the PyTorch on GitHub indeed doesn't provide any usable gloo usage examples, but there seems to be a test file called ProcessGroupGlooTest.cpp. I was wondering, what do you think about using that as a template?

https://github.com/Minipeps/pytorch/blob/3979cb0656fe2d8b0445768a769bd624b10778b5/torch/lib/c10d/test/ProcessGroupGlooTest.cpp#L691

However, it has to be admitted that the normal implementation and operation of Gloo in Python PyTorch definitely rely on Gloo in C++ LibTorch. Currently, though, we are not quite sure about the specific operations and configurations made in Python that enable Gloo to run properly. Actually, I would prefer it to be a configuration issue, but the failure of TCP socket connection should not be caused by the too-short timeout setting. Even when I set the timeout to 100 times the original value, it still remains in the waiting phase. It should be noted that this "wait allreduce" message is the print output I set myself, not from Gloo. Personally, I think the incorrect TCP socket connection is still the main cause, but I am not sure about the exact reason.

[W1201 14:32:51.794314750 socket.cpp:472] [c10d] waitForInput: poll for socket SocketImpl(fd=80, addr=[localhost]:36124, remote=[localhost]:12455) returned 0, likely a timeout
[W1201 14:32:51.794647801 socket.cpp:497] [c10d] waitForInput: socket SocketImpl(fd=80, addr=[localhost]:36124, remote=[localhost]:12455) timed out after 10000ms

saudet Dec 1, 2025
Maintainer

FileStore doesn't work either, so there's definitely something wrong with PyTorch when running inside the JVM. That's going to be hard to debug...

saudet Dec 1, 2025
Maintainer

Ah, no, after fixing up Gemini's output some more, this works just fine for me:

import org.bytedeco.javacpp.*;
import org.bytedeco.javacpp.chrono.*;
import org.bytedeco.pytorch.*;
import org.bytedeco.pytorch.global.torch;
import org.bytedeco.pytorch.global.torch_cuda;

import static org.bytedeco.pytorch.global.gloo.*;
import static org.bytedeco.pytorch.global.torch.*;
import static org.bytedeco.pytorch.global.torch_cuda.*;

public class GlooCudaExample {

    public static void main(String[] args) {
        // Use a PointerScope to manage memory automatically when objects go out of scope
        try (PointerScope scope = new PointerScope()) {

            // --- Argument Parsing ---
            if (args.length != 2) {
                System.err.println("Usage: java GlooCudaExample <RANK> <WORLD_SIZE>");
                System.err.println("Example for Rank 0: java GlooCudaExample 0 2");
                return;
            }

            final int rank = Integer.parseInt(args[0]);
            final int worldSize = Integer.parseInt(args[1]);
            // This file path must be accessible by all processes for synchronization
            final String masterFilePath = "/tmp/pytorch_gloo_store";

            System.out.println("Starting process [Rank " + rank + "/" + worldSize + "]");

            // --- CUDA Device Check ---
            if (!torch_cuda.is_available()) {
                System.err.println("CUDA is NOT available. This example requires a CUDA device.");
                return;
            }

            // Use the first CUDA device
            Device device = new Device(DeviceType.CUDA, (byte)0);
            
            // --- 1. Initialize Gloo Process Group (Conceptual Placeholder) ---           
            FileStore store = new FileStore(masterFilePath, rank);
            ProcessGroupGloo.Options options = ProcessGroupGloo.Options.create();
            options.timeout(new Milliseconds(new Seconds(10)));
            options.devices().push_back(
                ProcessGroupGloo.createDeviceForHostname("127.0.0.1"));
            ProcessGroupGloo processGroup = new ProcessGroupGloo(store, rank, worldSize, options);
 
            // --- 2. Create CUDA Tensor ---
            // Create a tensor on the CUDA device, initialized based on the rank (rank + 1).
            // Note the use of .mul_() for in-place multiplication to match the C++ initialization.
            Tensor tensor = torch.ones(new long[]{2, 2}, new TensorOptions(device)).mul_(new Scalar((float)(rank + 1)));

            System.out.println("\nRank " + rank + " initial tensor on CUDA:\n" + tensor);

            // Rank 0 starts with: [[1.0, 1.0], [1.0, 1.0]]
            // Rank 1 starts with: [[2.0, 2.0], [2.0, 2.0]]

            // --- 3. Perform AllReduce Collective Operation ---
            // AllReduce sums all tensors and sends the result back to all ranks.
            // The operation is performed on the CUDA device.
            
            TensorVector tensorsToReduce = new TensorVector(tensor);
            AllreduceOptions allreduceOps = new AllreduceOptions().reduceOp(new ReduceOp(ReduceOp.RedOpType.SUM));
            
            // Start the collective operation
            Work work = processGroup.allreduce(tensorsToReduce, allreduceOps);
            
            // Wait for the collective operation to complete
            work._wait();
            
            // --- 4. Verify Result ---
            // Expected result (1 + 2 = 3): [[3.0, 3.0], [3.0, 3.0]]
            
            System.out.println("\nRank " + rank + " final tensor after AllReduce (SUM) Simulation:\n" + tensor);

            // Verification check
            Tensor expectedResult = torch.ones(new long[]{2, 2}, new TensorOptions(device)).mul_(new Scalar(1 + worldSize));
            
            if (tensor.equal(expectedResult)) {
                System.out.println("\nRank " + rank + ": SUCCESS! Simulated AllReduce result is correct (Expected value: " + (1 + worldSize) + ").");
            } else {
                System.err.println("\nRank " + rank + ": ERROR! Simulated AllReduce result is incorrect.");
            }

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

mullerhai Dec 1, 2025
Author

very nice ,very NB! yeah ,it is really work for me !!!! you are the HERO ,SO PERFECT. Now TcpStore and FileStore all could work. The ProcessGroupGloo is validate pass!!! it is magical , so just my fault that bad gloo use case before
thanks
let us try NCCL!

Starting process [Rank 0/2]
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1

Rank 0 initial tensor on CUDA:
CUDAFloatType

Rank 0 final tensor after AllReduce (SUM) Simulation:
CUDAFloatType

Rank 0: SUCCESS! Simulated AllReduce result is correct (Expected value: 3).
 1  1
 1  1
[ CUDAFloatType{2,2} ] 3  3
 3  3
[ CUDAFloatType{2,2} ]
Process finished with exit code 0



Starting process [Rank 1/2]
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1

Rank 1 initial tensor on CUDA:
CUDAFloatType

Rank 1 final tensor after AllReduce (SUM) Simulation:
CUDAFloatType

Rank 1: SUCCESS! Simulated AllReduce result is correct (Expected value: 3).
 2  2
 2  2
[ CUDAFloatType{2,2} ] 3  3
 3  3
[ CUDAFloatType{2,2} ]

mullerhai Dec 1, 2025
Author

First, it did succeed, with tcpstore and filestore. I looked at my previous code, and it would also succeed if the timeout was set longer. However, it seems that sometimes using filestore, initializing gloo would throw an error. The error was java.lang.RuntimeException: Gloo connectFullMesh failed with timed out connecting: SO_ERROR: Connection refused. But this is related to our programming code and should have nothing to do with javacpp.

Starting process [Rank 2/3]
[/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=1, retry=1, retryLimit=3, rank=2, size=3, local=[127.0.0.1]:24801, remote=[127.0.0.1]:17812$2, error=SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:17812$2
[/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=1, retry=2, retryLimit=3, rank=2, size=3, local=[127.0.0.1]:25825, remote=[127.0.0.1]:17812$2, error=SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:17812$2
[/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=1, retry=3, retryLimit=3, rank=2, size=3, local=[127.0.0.1]:29921, remote=[127.0.0.1]:17812$2, error=SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:17812$2
[/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/debug_logger.cc:9] ERROR failed to connect, willRetry=0, retry=4, retryLimit=3, rank=2, size=3, local=[127.0.0.1]:31457, remote=[127.0.0.1]:17812$2, error=SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:17812$2
[E1201 22:25:37.412525914 ProcessGroupGloo.cpp:71] Gloo connectFullMesh failed with [/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:152] timed out connecting: SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:17812$2
java.lang.RuntimeException: Gloo connectFullMesh failed with [/home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:152] timed out connecting: SO_ERROR: 连接被拒绝, remote=[127.0.0.1]:17812$2
Exception raised from logAndThrow at /home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/torch/csrc/distributed/c10d/ProcessGroupGloo.cpp:72 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x7dac9419115c in /home/muller/.javacpp/cache/pytorch-2.9.0-1.5.13-20251129.100928-34-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x68 (0x7dac9411bce6 in /home/muller/.javacpp/cache/pytorch-2.9.0-1.5.13-20251129.100928-34-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libc10.so)
frame #2: <unknown function> + 0x62a7dd0 (0x7daa9d0a7dd0 in /home/muller/.javacpp/cache/pytorch-2.9.0-1.5.13-20251129.100928-34-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cpu.so)
frame #3: <unknown function> + 0x125461d (0x7daa9805461d in /home/muller/.javacpp/cache/pytorch-2.9.0-1.5.13-20251129.100928-34-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cpu.so)
frame #4: Java_org_bytedeco_pytorch_ProcessGroupGloo_allocate__Lorg_bytedeco_pytorch_Store_2IILorg_bytedeco_pytorch_ProcessGroupGloo_00024Options_2 + 0x24b (0x7daa964ba14b in /home/muller/.javacpp/cache/pytorch-2.9.0-1.5.13-20251129.100928-34-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libjnitorch.so)
frame #5: [0x7dae63c239c0]

	at org.bytedeco.pytorch.ProcessGroupGloo.allocate(Native Method)
	at org.bytedeco.pytorch.ProcessGroupGloo.<init>(ProcessGroupGloo.java:273)
	at src.main.java.org.example.GlooCudaExample.main(GlooCudaExample.java:64)

mullerhai · 2025-12-01T08:04:53Z

mullerhai
Dec 1, 2025
Author

Hi @saudet

I'm so eager to see the Linux version of ProcessGroupNCCL implemented with javacpp. Regarding suspending gloo and other issues, is it something we can implement? The javacpp pytorch on scala3 meetup will be held in Beijing, China on Saturday, December 13th. I will likely be the second speaker and will join via remote video conference. I'm really keen on recommending to the attendees that they try using javacpp. For the issue they care about most, NCCL, I'm truly looking forward to announcing to them the feasibility of our ProcessGroupNCCL support. This would be a huge draw for them to try javacpp-pytorch. They are currently making the event promotional poster, and I will share it with you all once it's done. I predict that if our implementation is robust enough, they will choose ours.

27 replies

mullerhai Dec 9, 2025
Author

It looks like you're trying to run that in your IDE. I told you not to do that. Please try again with Maven on the command line

I have try to use the mvn compile exec:java to run the nccl example ,also failed,

saudet Dec 9, 2025
Maintainer

I see, I think I found the issue. This should be fixed with commit 8a1a114. Please download the latest binaries and try again!

mullerhai Dec 9, 2025
Author

I see, I think I found the issue. This should be fixed with commit 8a1a114. Please download the latest binaries and try again!

you are right , now nccl is working ,it could init

Starting Rank: 0 / 2
Rank 0: NCCL ProcessGroup initialized with custom options.
Rank 0 bound to GPU Device: 0
Rank 0 Local Loss: 1.7742742

Starting Rank: 1 / 2
Rank 1: NCCL ProcessGroup initialized with custom options.
Rank 1 bound to GPU Device: 0
Rank 1 Local Loss: 1.2971381

but not sure nccl module all working correctly, I need time to validate ,
the error maybe cause by only one gpu use, nccl need at least two gpu

Starting Rank: 0 / 2
Rank 0: NCCL ProcessGroup initialized with custom options.
Rank 0 bound to GPU Device: 0
Rank 0 Local Loss: 1.8041222
Original tensor:CUDAFloatType
 tensor over
 0.3098  0.4848
 0.5745  0.6005
[ CUDAFloatType{2,2} ][W1209 23:20:02.644250894 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
java.lang.RuntimeException: NCCL error in: /home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:94, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.28.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1000
Exception raised from create at /home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:94 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x7aefa819115c in /home/muller/.javacpp/cache/pytorch-2.9.1-1.5.13-20251209.110751-13-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libc10.so)
frame #1: <unknown function> + 0xc7a0f2 (0x7ad48027a0f2 in /home/muller/.javacpp/cache/pytorch-2.9.1-1.5.13-20251209.110751-13-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cuda.so)
frame #2: <unknown function> + 0x669768 (0x7ad47fc69768 in /home/muller/.javacpp/cache/pytorch-2.9.1-1.5.13-20251209.110751-13-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::initNCCLComm(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::Device&, c10d::OpType, int, bool) + 0x1a36 (0x7ad4802ab4d6 in /home/muller/.javacpp/cache/pytorch-2.9.1-1.5.13-20251209.110751-13-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cuda.so)
frame #4: <unknown function> + 0xcafe66 (0x7ad4802afe66 in /home/muller/.javacpp/cache/pytorch-2.9.1-1.5.13-20251209.110751-13-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, char const*, c10d::AllreduceOptions const&) + 0x114 (0x7ad4802b0744 in /home/muller/.javacpp/cache/pytorch-2.9.1-1.5.13-20251209.110751-13-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x60b (0x7ad4802b0dab in /home/muller/.javacpp/cache/pytorch-2.9.1-1.5.13-20251209.110751-13-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cuda.so)
frame #7: Java_org_bytedeco_pytorch_nccl_ProcessGroupNCCL_allreduce__Lorg_bytedeco_pytorch_TensorVector_2Lorg_bytedeco_pytorch_AllreduceOptions_2 + 0x185 (0x7aefa804a945 in /home/muller/.javacpp/cache/pytorch-2.9.1-1.5.13-20251209.110751-13-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libjnitorch_nccl.so)
frame #8: [0x7af019fde712]

	at org.bytedeco.pytorch.nccl.ProcessGroupNCCL.allreduce(Native Method)
	at org.example.DistributedNcclExample.main(DistributedNcclExample.java:213)


Starting Rank: 1 / 2
Rank 1: NCCL ProcessGroup initialized with custom options.
Rank 1 bound to GPU Device: 0
Rank 1 Local Loss: 1.3828936
Original tensor:CUDAFloatType
 tensor over
[W1209 23:20:02.644142851 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
java.lang.RuntimeException: NCCL error in: /home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:94, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.28.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1000
Exception raised from create at /home/runner/work/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:94 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x7d249425615c in /home/muller/.javacpp/cache/pytorch-2.9.1-1.5.13-20251209.110751-13-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libc10.so)
frame #1: <unknown function> + 0xc7a0f2 (0x7cfb5d07a0f2 in /home/muller/.javacpp/cache/pytorch-2.9.1-1.5.13-20251209.110751-13-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cuda.so)
frame #2: <unknown function> + 0x669768 (0x7cfb5ca69768 in /home/muller/.javacpp/cache/pytorch-2.9.1-1.5.13-20251209.110751-13-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::initNCCLComm(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::Device&, c10d::OpType, int, bool) + 0x1a36 (0x7cfb5d0ab4d6 in /home/muller/.javacpp/cache/pytorch-2.9.1-1.5.13-20251209.110751-13-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cuda.so)
frame #4: <unknown function> + 0xcafe66 (0x7cfb5d0afe66 in /home/muller/.javacpp/cache/pytorch-2.9.1-1.5.13-20251209.110751-13-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, char const*, c10d::AllreduceOptions const&) + 0x114 (0x7cfb5d0b0744 in /home/muller/.javacpp/cache/pytorch-2.9.1-1.5.13-20251209.110751-13-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x60b (0x7cfb5d0b0dab in /home/muller/.javacpp/cache/pytorch-2.9.1-1.5.13-20251209.110751-13-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cuda.so)
frame #7: Java_org_bytedeco_pytorch_nccl_ProcessGroupNCCL_allreduce__Lorg_bytedeco_pytorch_TensorVector_2Lorg_bytedeco_pytorch_AllreduceOptions_2 + 0x185 (0x7d2444733945 in /home/muller/.javacpp/cache/pytorch-2.9.1-1.5.13-20251209.110751-13-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libjnitorch_nccl.so)
frame #8: [0x7d24b9fde712]

	at org.bytedeco.pytorch.nccl.ProcessGroupNCCL.allreduce(Native Method)
	at org.example.DistributedNcclExample.main(DistributedNcclExample.java:213)
 0.3499  0.5121
 0.8400  0.1123
[ CUDAFloatType{2,2} ]

mullerhai Dec 9, 2025
Author

one thing ,I think HashStore and ChunkSequentialDataLoader need to implement , how do you think

mullerhai Dec 10, 2025
Author

something make me confused , after create processgroupnccl ,but when I invoke ''pgnccl.isInitialized()" get false

    ProcessGroupNCCL pg = new ProcessGroupNCCL(store, rank, worldSize, ncclOptions); // new ProcessGroupNCCL.Options(true));//



            System.out.println("Rank " + rank + ": NCCL ProcessGroup initialized with custom options. isInitialized: "+pg.isInitialized() +" backend name: " + pg.getBackendName().getString() +" uid: "+ pg.getUid() + " opt:" +pg.getOptions().is_high_priority_stream());

console log

Starting Rank: 0 / 2
Rank 0: NCCL ProcessGroup initialized with custom options. isInitialized: false backend name: nccl uid: 0 opt:true
Rank 0 bound to GPU Device: 0
Rank 0 Local Loss: 1.3270324

another things ,I think we should solve the processGroup the backendtype can not detect problem

        var raw_pg = new ProcessGroup(3, 10);
        System.out.println("currentProcessGroup hasBackends "+raw_pg.hasBackends());
        raw_pg.setBackend(torch.DeviceType.CUDA, ProcessGroup.BackendType.NCCL, new BackendOptional());
//        println(s"currentProcessGroup id : ${pg.getID}")
        System.out.println("ProcessGroup getBackendName-> "+raw_pg.getBackendName().getString());
        System.out.println("ProcessGroup getID-> "+raw_pg.getID());
        System.out.println("ProcessGroup getRank-> "+raw_pg.getRank());
        System.out.println("ProcessGroup getDeviceTypes is cuda -> "+raw_pg.getDeviceTypes().is_cuda());
        System.out.println("currentProcessGroup hasBackends "+raw_pg.hasBackends()+" getBackendName "+raw_pg.getBackendName().getString());
        var rnd = torch.rand(2, 3).cuda();
        System.out.println("Original tensor:" + rnd);

console log

currentProcessGroup hasBackends false
ProcessGroup getBackendName-> undefined
ProcessGroup getID-> 123229913555776
ProcessGroup getRank-> 3
ProcessGroup getDeviceTypes is cuda -> true
currentProcessGroup hasBackends true getBackendName undefined
Original tensor:CUDAFloatType

Uh oh!

for javacpp-pytorch 1.5.13 I have some idea #1712

Uh oh!

mullerhai Oct 20, 2025

Replies: 17 comments · 125 replies

Uh oh!

sbrunk Oct 20, 2025

Uh oh!

saudet Oct 21, 2025 Maintainer

Uh oh!

Uh oh!

mullerhai Oct 21, 2025 Author

Uh oh!

Uh oh!

sbrunk Oct 26, 2025

Uh oh!

mullerhai Oct 27, 2025 Author

Uh oh!

mullerhai Oct 27, 2025 Author

Uh oh!

mullerhai Oct 27, 2025 Author

Uh oh!

saudet Oct 28, 2025 Maintainer

Uh oh!

mullerhai Oct 28, 2025 Author

Uh oh!

saudet Oct 29, 2025 Maintainer

Uh oh!

mullerhai Oct 28, 2025 Author

Uh oh!

saudet Oct 29, 2025 Maintainer

Uh oh!

mullerhai Oct 28, 2025 Author

Uh oh!

mullerhai Nov 6, 2025 Author

Uh oh!

saudet Nov 7, 2025 Maintainer

Uh oh!

mullerhai Nov 8, 2025 Author

Uh oh!

Uh oh!

saudet Nov 9, 2025 Maintainer

Uh oh!

mullerhai Nov 9, 2025 Author

Uh oh!

mullerhai Nov 9, 2025 Author

Uh oh!

mullerhai Nov 15, 2025 Author

Uh oh!

saudet Nov 16, 2025 Maintainer

Uh oh!

mullerhai Nov 23, 2025 Author

Uh oh!

saudet Nov 23, 2025 Maintainer

Uh oh!

mullerhai Nov 24, 2025 Author

Uh oh!

saudet Nov 24, 2025 Maintainer

Uh oh!

mullerhai Nov 24, 2025 Author

Uh oh!

saudet Nov 24, 2025 Maintainer

Uh oh!

mullerhai Nov 24, 2025 Author

Uh oh!

mullerhai Nov 24, 2025 Author

Uh oh!

Uh oh!

saudet Nov 24, 2025 Maintainer

Uh oh!

mullerhai
Oct 20, 2025

Replies: 17 comments 125 replies

sbrunk
Oct 20, 2025

saudet Oct 21, 2025
Maintainer

mullerhai Oct 21, 2025
Author

mullerhai Oct 27, 2025
Author

mullerhai Oct 27, 2025
Author

mullerhai
Oct 27, 2025
Author

saudet Oct 28, 2025
Maintainer

mullerhai Oct 28, 2025
Author

saudet Oct 29, 2025
Maintainer

mullerhai
Oct 28, 2025
Author

saudet Oct 29, 2025
Maintainer

mullerhai
Oct 28, 2025
Author

mullerhai Nov 6, 2025
Author

saudet Nov 7, 2025
Maintainer

mullerhai Nov 8, 2025
Author

saudet Nov 9, 2025
Maintainer

mullerhai Nov 9, 2025
Author

mullerhai
Nov 9, 2025
Author

mullerhai
Nov 15, 2025
Author

saudet Nov 16, 2025
Maintainer

mullerhai
Nov 23, 2025
Author

saudet Nov 23, 2025
Maintainer

mullerhai Nov 24, 2025
Author

saudet Nov 24, 2025
Maintainer

mullerhai Nov 24, 2025
Author

saudet Nov 24, 2025
Maintainer

mullerhai
Nov 24, 2025
Author

mullerhai
Nov 24, 2025
Author

saudet Nov 24, 2025
Maintainer