Run decentralized LLMs. Distribute your inference and training requests across a vast peer-to-peer network without a need for a centralized server. EdgeLLM gives users the ability to perform inference on models much larger than they could with their own machine, utilizing a network of volunteers' machines to perform inference on their behalf.
OpenAI's ChatGPT, Google's Gemini, Anthropic's Claude. With their powerful models backed by their supercomputers, they're a great resource to have... until you start thinking what if one day they pull the plug? What if they begin charging insane prices? What happens to all the embarassing questions you have about your codebase (just me?). All of these LLM providers make it super easy to begin chatting with insanely complex models (even up to 1T parameters) but it comes at the cost of an increased reliance on their services and trusting them to handle your data and trust correctly. I, like many, will never put their full trust into something they do not personally own entirely and cannot see myself relying day-to-day on intelligent LLMs that are running on someone else's computer thousands of kilometers away with sketchy privacy policies at best.
As we spammed ChatGPT, Claude, Gemini, and more with our mindless typing, we asked a question to ourselves. Why leave some of the most powerful technologies in the 21st century to the hands of a few's large investments in thousands of GPUs when so many individuals already have their own laptop and even a GPU? An individual might not be able to run Gemini Pro themselves but borrow 100 people's GPUs for a bit and it'll be possible. We see large P2P projects like BitTorrent already do this with filesharing, so let's do it with LLMs too. We think that as the internet should truly be, LLMs should be likewise: decentralized and powered by the people.
- Get a vLLM cluster up and running triggered by EdgeLLM requests (we are here)
- Implement P2P connections
- Involves adding standards for capabilities, metadata, etc. and how to propagate this around
- Add join/leave network capabilities
- KV Cache dissaggregation/share with IPFS
- P/D Dissaggregation
- Model hotswapping
- More as we go along
- Peer-to-peer decentralized inference
- vLLM support
- Distributed KV Cache over P2P filesystem
- P/D Dissaggregation
- more to come...
In prototype stage so subject to change The goal of this is a truly decentralized p2p inference server that allows interaction with large LLM models (100B+) may not be able to be run by individuals alone.
This is a wrapper on top of runners such as vLLM and llama.cpp that support distributed inference already. The key is that these do not support true peer-to-peer networking and either relies on a head node or other centralized methods of performing the inference.
We take these runners and add bindings for them, then wrap the bindings with our p2p node implementation. Each node will have the ability to host a model or perform inference. When a user wishes to start hosting a model, they will join the network with metadata about their hosting capabilities as well as what model to host. Using go-libp2p, we can use its DHT functionalities to achieve this.
Once users join the network, inference requests can now be handled. When a user joins the network as a "leecher", they will initiate an inference request that will create an object that handles which user requested it, the model, the prompt/response, and various metadata. Our code will then handle finding the optimal peers with a combined minimum availability (more details will need to be fleshed out for this) for that request. Once the workers are established, we then call the inference backends such as vLLM to create a one-time distributed inference network with the requester as the head and the workers from the network. The inference will be performed and then the vLLM/other backend server will be closed. There should also be the option to open a "session" and only close and open on a session.
Users hosting a model will need to have the entire model installed, but there needs to be more testing as to whether the whole model needs to be loaded for use. This also depends on the inference server.
Inference will be slow. But to start, we want to achieve decentralized p2p inference first. We will then evaluate how much of a bottleneck inference over network is. Details for how to balance traffic/rate-limiting have not been thought of yet and will proceed after the above is working.
TODO: Startup script to automate this
- Go 1.21 or later
- Docker and Docker Compose (optional for now)
- Make (optional, for using Makefile commands)
- Python 3.10 or later
- protoc and the go extension for protoc
-
Clone and setup dependencies
go mod download
-
Run worker node (only functional part currently)
make run-workerOutdated
-
Clone and setup dependencies:
go mod download
-
Setup a local Python environment:
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
pip install -r requirements.txtpython3 -m pip install -r backend/python/requirements.txt-
Build proto
-
Install protoc and go's extension for protoc
-
Build protobuf
# For python spec python3 -m grpc_tools.protoc -Ibackend --python_out=backend/python/proto --grpc_python_out=backend/python/proto --pyi_out=backend/python/proto backend/backend.proto # For go spec protoc --go_out=. --go-grpc_out=. backend/backend.proto
-
-
Run main to view commands
Make sure you've activated your Python virtual environment in the same terminal session as you're running the go application.
go run main.go-
Build and run with Docker Compose:
docker-compose up --build # or make dev -
Stop the containers:
docker-compose down # or make docker-stop
GET /health- Health check endpointGET /api/v1/hello- Example API endpoint
make help # Show available commands
make build # Build the Go application
make run # Run the application locally
make test # Run tests
make docker-build # Build Docker image
make docker-run # Run with Docker Compose
make docker-stop # Stop Docker containers
make dev # Start development environment
make clean # Clean build artifactsPORT- Server port (default: 8080)GIN_MODE- Gin mode (debug/release, default: debug)
-
Build the Docker image:
make docker-build
-
Push to registry (configure DOCKER_REGISTRY in Makefile):
make docker-push
-
Deploy using Docker Compose:
docker-compose up -d
This project takes inspiration and code snippets from several sources:
-
- Main source of inspiration. This project aims to take the distributed inference capabilities of LocalAI and extend them to truly decentralized networks that work on more inference servers than just llama.cpp
-
- P/D Dissaggregation, KV Cache aware routing and other distributed systems techniques
-
- P2P Capabilities
-
- Distributed file systems
- Fork the repository
- Create your feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request