Updating serverless documentation as part of rewrite #58

DavidatVast · 2026-01-26T22:32:51Z

Self explanatory - rewrote / re-ordered serverless documentation

https://vastai.atlassian.net/browse/AUTO-1104

robertfernandez-vast · 2026-02-01T19:07:03Z

Developers want to use serverless. Based off of this fact, the following should be assumed true:

For a developer to use serverless, they must know how to set-up their serverless environment.
For a developer to use serverless, they must know how to interact with their serverless environment.
For a developer to use serverless, they must know how to maintain their serverless environment.

All sections should serve and be organized based off of these statements.

Users can set-up their serverless environment via:

The vast website
The vast-sdk
The vast cli

We should have sections which serve each of these form factors.

Setting up their serverless environment includes:

Creating and destroying endpoints and what that does to their data and serverless environment
Creating and destroying workergroups and what that does to their data and serverless environment
Using our predefined templates or defining custom templates
- This includes instructing developers on how to make their own pyworkers for their own use cases.
Addressing frequently occurring questions/pain points with set-up
- Why does it take so long for workers to load?
- How do I know when my serverless environment is ready for use?
- How do I know what serverless environment is best suited for my use case?
Definitions for each worker state and how to understand how to use these states to get your serverless environment setup.

Users can interact with their serverless environment via:

The vast-sdk
The vast-cli

We should organize the sections so that it is clear to developers how they interact with serverless for each form factor.

Interacting includes:

How to send requests to your endpoint group once its set-up

Users can maintain their serverless environment via:

The vast website
The vast-sdk
The vast-cli

We should organize the sections so that it is clear how they maintain their serverless environment.

Maintaining includes:

Paying for their serverless environment
Keeping their pyworkers up-to-date
Best Practices for keeping their data safe (ex: backing up data)
Describing what our metrics mean on the endpoint page, and what assumptions go into each metric. (Units, limitations, definitions)
How to understand how your serverless environment is behaving based on worker states.

LucasArmandVast · 2026-02-02T17:55:02Z

documentation/serverless/architecture.mdx

+An endpoint consists of:

-It's important to note that having multiple Worker Groups within a single Endpoint is not always necessary. For most users, a single Worker Group within an Endpoint provides an optimal setup.
+- A named endpoint identifier


This should just be "A name"

Actually I think the current proposal is correct?

LucasArmandVast · 2026-02-02T17:56:45Z

documentation/serverless/architecture.mdx

+- Endpoint parameters such as `max_workers`, `min_load`, `min_workers`, `cold_mult`, `min_cold_load`, and `target_util`

-You can create Worker Groups using our [Serverless-Compatible Templates](/documentation/serverless/text-generation-inference-tgi), which are customized versions of popular templates on Vast, designed to be used on the serverless system.
+Users typically create one endpoint per **function** (for example, text generation or image generation) and per **environment** (production, staging, development).


Might be useful to explain here that endpoints act as "routing targets", and that all requests sent to an endpoint are load-balanced across that endpoints workers.

LucasArmandVast · 2026-02-02T17:57:47Z

documentation/serverless/architecture.mdx

+### Workergroups

-The system architecture for an application using Vast.ai Serverless includes the following components:
+A **Workergroup** defines how workers are recruited and created. Workergroups are configured with [**workergroup-level parameters**](./workergroup-parameters) and are responsible for selecting which GPU offers are eligible for worker creation.


More than just GPU selection, workergroups decide what code is actually running on the endpoint via the template. It's probably important to mention this here.

LucasArmandVast · 2026-02-02T17:59:29Z

documentation/serverless/architecture.mdx

-<Frame caption="Serverless Architecture">
-![Serverless Architecture](/images/serverless-architecture.webp)
-</Frame>
+- A serverless-compatible template (referenced by `template_id` or `template_hash`)


Users would only set these fields in the CLI, in the UI they can select the template from the templates page/modal. Might be more clear to just drop the () here and say "A serverless-compatible template".

Also, we probably want to clearly define what makes a template serverless-compatible or not. This issue comes up frequently in support chats.

Maybe we could reference a different set of docs that define serverless compatability?

LucasArmandVast · 2026-02-02T18:03:52Z

documentation/serverless/architecture.mdx

+- A serverless-compatible template (referenced by `template_id` or `template_hash`)
+- Hardware and marketplace filters defined via `search_params`
+- Optional instance configuration overrides via `launch_args`
+- Hardware requirements such as `gpu_ram`


This is another point of confusion. There are two places where users currently have to specify GPU RAM requirements: in search_params and gpu_ram.
Their difference is:

search_params filters out offers/instances that don't have at least the required GPU RAM

gpu_ram informs the autoscaler roughly how large the "model" it needs to download, so it can estimate the loading timeout information.

There should probably be a larger discussion on the necessity of this extra gpu_ram field, since it would be less confusing if we can just remove it and figure out a better solution for loading timeouts.

LucasArmandVast · 2026-02-02T18:06:11Z

documentation/serverless/architecture.mdx

+- Hardware requirements such as `gpu_ram`
+- A set of GPU instances (workers) created from the template
+
+Multiple Workergroups can exist within a single Endpoint, each with different configurations. This enables advanced use cases such as hardware comparison, gradual model rollout, or mixed-model serving. For many applications, a single Workergroup is sufficient.


Yes, but important to note that currently routing between workergroups is basically random, not controllable. We could add workergroup-level routing to the autoscaler to enable this.

Eh, probably more accurate to say routing is workergroup agnostic. It does not factor in whatsoever. Routing considers all workers from all workergroups in the endpoint

LucasArmandVast · 2026-02-02T18:10:14Z

documentation/serverless/architecture.mdx

+
+### Workers
+
+**Workers** are individual GPU instances created and managed by the Serverless engine. Each worker runs a [**PyWorker**](./overview), a Python web server that loads the machine learning model and serves requests.


PyWorker does not load the machine learning model. Both the pyworker and the "machine learning model" (maybe a better, more generic name for this?) are launched by the Docker entrypoint or on_start.sh script. The pyworker runs in parallel with the machine learning model and listens to it (via it's log output). It reports the readiness of the model to the autoscaler, and acts as a proxy for incoming requests on the worker, keeping track of requests and passing them along to the model server.

Taking Lucas' comment, a more accurate sentence might be:
"Each worker runs a PyWorker, a Python web server that runs the machine learning model's on-start script, orchestrates hardware benchmarking, and serves requests."

But the PyWorker does not run the machine learning model's on-start script. Both the PyWorker and the machine learning model are started completely separately, both from the same on-start script. Two different processes, each started from the same script. They do not invoke one another. They communicate only by logs and passing HTTP requests along.

LucasArmandVast · 2026-02-02T18:15:18Z

documentation/serverless/architecture.mdx

+
+- Receiving and processing inference requests
+- Reporting performance metrics (load, utilization, benchmark results)
+- Participating in automated scaling and routing decisions


"- Participating in automated scaling and routing decisions"
This is vague and only sort-of true. They do "participate" in the sense that they report information on their health, current load, dropped requests, and benchmark speed, but outside of just reporting that information, they don't actually take action for routing or scaling. All decision making is handled by the autoscaler based on the reports of the pyworkers.

"Inform automated scaling and routing decisions" is more correct

LucasArmandVast · 2026-02-02T18:17:52Z

documentation/serverless/architecture.mdx

+### Serverless Engine
+
+The **Serverless Engine** is the decision-making service that manages workers across all endpoints and workergroups. Using configuration parameters and real-time metrics, it determines when to:
+
+- Recruit new workers
+- Activate inactive workers
+- Release or destroy workers
+
+The engine continuously evaluates cost-performance tradeoffs using automated performance testing and measured load.
+


Important to note that the "Serverless Engine" / autoscaler does not just manage workers, but also acts as a load balancer for incoming requests. It spreads requests out across all available workers to minimize overall waiting time.

LucasArmandVast · 2026-02-02T18:19:08Z

documentation/serverless/architecture.mdx

-5. The PyWorker sends the model results back to the client.
-6. Independently and concurrently, each PyWorker in the Endpoint sends its operational metrics to the Serverless system, which it uses to make scaling decisions.
+- Authentication
+- Routing requests to appropriate workers


This is handled by the autoscaler, not the SDK. The SDK asks the autoscaler to "route" it to a worker, and then the SDK passes the request to that worker once it's routed.

LucasArmandVast · 2026-02-02T18:25:49Z

documentation/serverless/automatedperformancetesting.mdx

+
+## How Benchmark Testing Works
+
+When a new Workergroup is created, the serverless engine enters a **learning phase**. During this phase, it recruits a variety of machine types from those specified in `search_params`. For each new worker, the engine runs the user-configured benchmark and evaluates performance.


The "learning phase" doesn't literally exist, but is a label to basically say "Your endpoint is inefficient right now because we haven't gone through enough scaling cycles to optimize it yet."

Also, the PyWorker, not the Serverless Engine, is responsible for running the benchmark, the results of which are then reported to the Serverless Engine.

LucasArmandVast · 2026-02-02T18:30:35Z

documentation/serverless/automatedperformancetesting.mdx

+## Best Practices for Initial Scaling
+
+The speed at which the serverless engine “settles” into the most cost-effective mix of workers can vary depending on how quickly workers are recruited and released. Because of this, it is recommended to apply a **test load during the first day of operation** to help the system efficiently explore and converge on optimal hardware choices.
+


Specifically, the act of scaling up and scaling down is where optimization occurs. Just supplying load to an endpoint likely will not be enough. Depending on the size of the endpoint and the workload type, users may need to scale their endpoint up and down a few times (in my experience, usually 3 or more times) before it settles on mostly efficient, working machines.

LucasArmandVast · 2026-02-02T18:32:16Z

documentation/serverless/automatedperformancetesting.mdx

+
+For examples of how to simulate load against your endpoint, see the client examples in the Vast SDK repository:
+
+https://github.com/vast-ai/vast-sdk/tree/main/examples/client


We can point to a load-specific example:
https://github.com/vast-ai/vast-sdk/blob/main/examples/client/vllm_load_example.py

LucasArmandVast · 2026-02-02T18:36:58Z

documentation/serverless/creating-new-pyworkers.mdx

  - Output:
    - A `float` representing workload; larger means “more expensive.”
+  - Recommendation:
+    - For many applications that have a vast majority of similarly complex tasks, utilizing a single constant value per task is sufficient (example cost = 100)


This sentence could be worded better.
"For applications where requests do not vary much in complexity, returning a constant value (i.e. 100) is often sufficient."

LucasArmandVast · 2026-02-02T18:40:39Z

documentation/serverless/managing-scale.mdx

@@ -0,0 +1,37 @@
+---
+title: Managing Scale
+description: Some configuration change strategies to manage different load scenarios.


"configuration change strategies" doesn't make obvious sense to me, but I think I get what you're going for.

Maybe "Learn how to configure your Serverless endpoint for different load scenarios" ?

LucasArmandVast · 2026-02-02T18:43:12Z

documentation/serverless/managing-scale.mdx

+      {
+        "@type": "HowToStep",
+        "name": "Manage for Bursty Load",
+        "text": "Adjust min_workers to increase managed inactive workers and add capacity for peak demand. Increase cold_mult to change the rate (and in extreme cases, the number) of worker recruitment to handle fast demand transitions. Check max_workers to ensure it's set high enough for the serverless engine to create the required number of workers."


" Increase cold_mult to change the rate (and in extreme cases, the number) of worker recruitment to handle fast demand transitions"

I don't really understand this explanation of using cold_mult. Cold_mult just sets the cold worker target to hot_worker_target * cold_mult (roughly). It always increases the number of workers, and the rate of worker recruitment is identical, just with a higher target.

LucasArmandVast · 2026-02-02T18:44:12Z

documentation/serverless/managing-scale.mdx

+## Managing for Bursty Load
+
+- **Adjust** `min_workers`: This will change the number of managed inactive workers, and increase capacity for high peak
+- **Increase** `cold_mult`: This will change the rate (and in extreme cases, the number) of worker recruitment. Use this to manage for fast transitions in demand


Same for comment above.

LucasArmandVast · 2026-02-02T18:44:38Z

documentation/serverless/managing-scale.mdx

+
+## Managing for Low Demand or Idle Periods
+
+- **Adjust** `min_load`: Reducing `min_load` will reduce the minimum number of active workers. Set to `1` to reduce the number to its minimum value of 1 worker


We support scaling to zero as of 2/3/26 (assuming we deploy)

LucasArmandVast · 2026-02-02T18:47:51Z

documentation/serverless/overview.mdx

 In the diagram's example, a user's client is attempting to infer from a machine learning model. With Vast's Serverless setup, the client:

-1. Sends a [`/route/`](/documentation/serverless/route) POST request to the serverless system. This asks the system for a GPU instance to send the inference request.
+1. Sends a `/route/` POST request to the serverless system. This asks the system for a GPU instance to send the inference request.


For consistency, we want to change "serverless system" to "Serverless engine"

LucasArmandVast · 2026-02-02T18:49:07Z

documentation/serverless/overview.mdx

 2. The serverless system selects a ready and available worker instance from the user's endpoint and replies with a JSON object containing the URL of the selected instance.
 3. The client then constructs a new POST request with it's payload, authentication data, and the URL of the worker instance. This is sent to the worker.


This flow is abstracted away by the SDK

LucasArmandVast · 2026-02-02T18:51:45Z

documentation/serverless/pricing.mdx


-Vast.ai Serverless offers pay-as-you-go pricing for all workloads at the same rates as Vast.ai's non-Serverless GPU instances. Each instance accrues cost on a per second basis.
-This guide explains how pricing works.
+Unlike other providers, Vast Serverless offers pay-per-second pricing for all workloads at the same as Vast.ai’s non-Serverless GPU instances.


"at the same as" -> "at the same price as"

LucasArmandVast · 2026-02-02T18:53:21Z

documentation/serverless/pricing.mdx

+| State    | Description       | GPU compute | Storage | Bandwidth (in/out) |
+|----------|-------------------|-------------|---------|--------------------|
+| Ready    | An active worker  | Billed      | Billed  | Billed             |
+| Loading  | Model is loading  | Billed      | Billed  | Billed             |


"Starting" is included in "Loading", so users will also have to pay full price while their instance is starting up from a cold state.

But users never see "Starting", right? They will see Model Loading, which we should add here

"Model Loading" actually didn't make it into the new UI (maybe we need to talk to Madison about that), but yes users will see starting. Cold workers go from inactive -> starting -> ready, whereas new workers go from created -> loading -> ready

@LucasArmandVast Could you provide all states that a user would see and put which states have GPU compute, Storage and Bandwith (in/out) billed. This would be helpful for David.

LucasArmandVast · 2026-02-02T18:59:21Z

documentation/serverless/SDKoverview.mdx

+
+## Why Use the SDK
+
+While there are other ways to interact with Vast Serverless—such as the CLI and the REST API—the SDK is the **most powerful and easiest** method to use. It is the recommended approach for most applications due to its higher-level abstractions, reliability, and ease of integration into Python-based workflows.


The SDK (in its current state) is not really a replacement for the CLI/UI, since they are still needed for endpoint/workergroup/template creation. Deployments project will change this.

It is true though that using the SDK is pre-packaged "correct" way to use the API, which still remains available for advanced users.

Colter-Downing · 2026-02-02T20:15:49Z

documentation/serverless/architecture.mdx

+<Frame caption="Serverless Architecture">
+![Serverless Architecture](/images/serverless-architecture.webp)
+</Frame>



This picture will need updating. The whole back-and-forth between the engine and the worker is now simplified through the SDK. Ie the SDK does this for you and you never really have to know about it. I guess it could still be important for the Anthony persona since he needs to know how all his requests are being handled.

Colter-Downing · 2026-02-02T20:28:00Z

documentation/serverless/workergroup-parameters.mdx

+
+The parameters below are specific to only Workergroups, not Endpoints. Pre-configured serverless templates from Vast will have these values already set.
+
+## `gpu_ram`


Goes to the previously mentioned point, can we remove this and include in search_params somehow?

This is a question for devs

Parsing search_params might be weird, but I don't see why we couldn't do it. There is a real difference between "how many GB VRAM do you need" vs. "how many GB is your model weights", but they are probably close enough.

There is this whole infra built for tracking "additional disk usage" since pyworker start, and in theory "download progress" should be additional_disk_usage / model_size. Then we could provide approximate loading progress bars for each worker. I think this was an intended but unimplemented feature. It would probably look like changing this gpu_ram parameter to something like (optional) model_size.

Colter-Downing · 2026-02-02T20:46:35Z

documentation/serverless/serverless-parameters.mdx

-The Target Utilization ratio determines how much spare capacity (headroom) the serverless engine maintains. For example, if your predicted load 
-is 900 tokens/second and target\_util is 0.9, the serverless engine will plan for 1000 tokens/second of capacity (900 ÷ 0.9 = 1000), leaving 100 
-tokens/second (11%) as buffer for traffic spikes.
+## Minimum Inactive Workers (`min_workers`)


We should drop the 'Inactive'. This value is state agnostic, meaning 2 Ready and 3 Inactive still satisfies min_workers=5

Colter-Downing · 2026-02-02T20:53:26Z

documentation/serverless/workergroup-parameters.mdx

@@ -0,0 +1,69 @@
+---
+title: Worker Recruitment


This doc is actually a list of workergroup parameters. Maybe we rename the title to "Workergroup Parameters" in a similar fashion to the existing "Endpoint Parameters"?

robertfernandez-vast · 2026-02-01T17:48:55Z

documentation/serverless/architecture.mdx

 ---
-title: Architecture
-description: Understand the architecture of Vast.ai Serverless, including the Serverless System, GPU Instances, and User (Client Application). Learn how the system works, how to use the routing process, and how to create Worker Groups.
+title: Overview


We should rename the previous section to Serverless Feature Overview. This section should be named Architecture Overview. Having Serverless Overview followed Overview is a bit confusing.

robertfernandez-vast · 2026-02-01T17:57:25Z

documentation/serverless/architecture.mdx

+<Frame caption="Serverless Architecture">
+![Serverless Architecture](/images/serverless-architecture.webp)
+</Frame>


I think it would be clearer if each Pyworker <--> Model Inference Group was encapsulated in a box labeled GPU Instance.

robertfernandez-vast · 2026-02-01T18:12:18Z

documentation/serverless/architecture.mdx

+- Endpoint parameters such as `max_workers`, `min_load`, `min_workers`, `cold_mult`, `min_cold_load`, and `target_util`

-You can create Worker Groups using our [Serverless-Compatible Templates](/documentation/serverless/text-generation-inference-tgi), which are customized versions of popular templates on Vast, designed to be used on the serverless system.
+Users typically create one endpoint per **function** (for example, text generation or image generation) and per **environment** (production, staging, development).


I prefer the term use case instead of function for this sentence. For a software dev, a function a specific term.

robertfernandez-vast · 2026-02-01T18:16:24Z

documentation/serverless/architecture.mdx

+An endpoint consists of:

-It's important to note that having multiple Worker Groups within a single Endpoint is not always necessary. For most users, a single Worker Group within an Endpoint provides an optimal setup.
+- A named endpoint identifier


Are you trying to capture endpoint name and endpoint id with this phrasing? I think The endpoint's name would be sufficient here.

robertfernandez-vast · 2026-02-01T18:18:24Z

documentation/serverless/architecture.mdx

+<Frame caption="Serverless Architecture">
+![Serverless Architecture](/images/serverless-architecture.webp)
+</Frame>


The diagram should show all primary components so a user knows where each important component fits into the this flow.

robertfernandez-vast · 2026-02-01T18:29:01Z

documentation/serverless/architecture.mdx

+While CLI and API access are available, the SDK is the recommended method for most applications.

-This 2-step routing process is used for security and flexibility. By having the client send payloads directly to the GPU instances, your payload information is never stored on Vast servers.
+## Example Workflow


This example workflow describes what happens after a dev client sets up an already has their serverless environment setup. This should be made clear so that a new dev client that is trying to set up serverless doesn't think this workflow is what they have to follow to setup serverless.

robertfernandez-vast · 2026-02-01T18:30:07Z

documentation/serverless/architecture.mdx

+2. The Serverless system routes the request and returns a suitable worker address based on current load and capacity.
+3. The client sends the request directly to the selected worker’s API endpoint, including the required authentication data.
+4. The PyWorker running on the GPU instance forwards the request to the machine learning model and performs inference.
+5. The inference result is returned to the client.


The inference result is returned to the client's application which is then forwarded to their users.

robertfernandez-vast · 2026-02-01T18:30:32Z

documentation/serverless/architecture.mdx

+3. The client sends the request directly to the selected worker’s API endpoint, including the required authentication data.
+4. The PyWorker running on the GPU instance forwards the request to the machine learning model and performs inference.
+5. The inference result is returned to the client.
+6. Independently and continuously, each PyWorker reports operational and performance metrics back to the Serverless Engine, which uses this data to make ongoing scaling decisions.


I like this sentence.

robertfernandez-vast · 2026-02-01T19:19:33Z

documentation/serverless/workergroup-parameters.mdx

+A command-line style string containing additional parameters for instance creation that will be parsed and applied when the serverless engine creates new workers. This allows you to customize instance configuration beyond what’s specified in templates.
+
+There is no default value for `launch_args`.


This parameter seems nebulous to me. We should have an example or something that tells developers what kind of format they should expect to use for this parameter.

robertfernandez-vast · 2026-02-01T19:30:50Z

documentation/serverless/serverless-parameters.mdx

+- `target_util = 0.9` → 11.1% spare capacity
+- `target_util = 0.8` → 25% spare capacity
+- `target_util = 0.5` → 100% spare capacity
+- `target_util = 0.4` → 150% spare capacity


We should give a general equation so user's aren't constrained to only these examples.

robertfernandez-vast · 2026-02-02T23:16:48Z

documentation/serverless/pricing.mdx

+| State    | Description       | GPU compute | Storage | Bandwidth (in/out) |
+|----------|-------------------|-------------|---------|--------------------|
+| Ready    | An active worker  | Billed      | Billed  | Billed             |
+| Loading  | Model is loading  | Billed      | Billed  | Billed             |


@LucasArmandVast Could you provide all states that a user would see and put which states have GPU compute, Storage and Bandwith (in/out) billed. This would be helpful for David.

robertfernandez-vast · 2026-02-02T23:18:19Z

documentation/serverless/pricing.mdx

-| Inactive | Not billed  | Billed  | Billed       | Billed        |
-
-GPU compute refers to the per-second GPU rental charges. See the [Billing Help](/documentation/reference/billing#ugwiY) page for rate details.
+| State    | Description                                                                                 | GPU compute |


Change the header label from GPU compute to Billing Description

robertfernandez-vast · 2026-02-02T23:21:15Z

documentation/serverless/pricing.mdx

+|-----------|--------------------------------------------------------------------------------------------|-------------|
+| Active    | - Engine is actively managing worker recruitment and release <br /> - Workers are active   | All workers billed at their relevant states      |
+| Suspended | - Engine is NOT managing worker recruitment and release <br /> - Workers are active.       | Workers are billed based on their state at time of suspension. <br /> Any workers that are currently being created or are loading, will complete to a ready state (and be billed as such).      |
+| Stopped   | - Engine is NOT managing worker recruitment and release <br /> - Workers are all inactive  | All workers are changed to and billed in inactive state  |


All workers are changed to and are billed in the inactive state.

robertfernandez-vast · 2026-02-02T23:49:22Z

documentation/serverless/serverless-parameters.mdx

+## Minimum Load (`min_load`)

-If not specified during endpoint creation, the default value is 3.
+Vast Serverless utilizes a concept of **load** as a metric of work that is performed by a worker, measured in performance (“perf”) per second. This is an internally computed value derived from benchmark tests and is normalized across different work types (tokens for LLMs, images for image generation, etc.). It is used to make scaling and capacity decisions.


I need some clarification about this from @LucasArmandVast and @Colter-Downing. Measuring load as perf per second doesn't feel accurate even though that's what is in the code base right now.

robertfernandez-vast · 2026-02-02T23:50:24Z

documentation/serverless/serverless-parameters.mdx

+### Best practice for setting `min_load`

-If not specified during endpoint creation, the default value is 5.
+- Start with `min_load = 1` (the default), which guarantees at least one active worker


We should mention that if a developer wants zero scaling (scaling to 0 hot workers), the min_load should be 0.

robertfernandez-vast · 2026-02-03T00:05:48Z

documentation/serverless/serverless-parameters.mdx

+## Cold Multiplier (`cold_mult`)

-The parameters below are specific to only Workergroups, not Endpoints. Pre-configured serverless templates from Vast will have these values already set. 
+While `min_workers` is fixed regardless of traffic patterns, `cold_mult` defines inactive capacity as a multiplier of the current active workload.


The serverless engine attempts to plan its scaling for both predicted short term loads (1-30s) and for predicted long term loads (1 hour and above).

cold_mult is a scalar multiplier that allows developers to tune and plan for expected longer term loads.

cold_mult = (target_perf x target_util)/(predicted_load)

Updating serverless documentation as part of rewrite

8f745f3

DavidatVast requested review from Colter-Downing, LucasArmandVast and robertfernandez-vast January 26, 2026 22:32

mintlify bot deployed to staging January 26, 2026 22:33 View deployment

LucasArmandVast reviewed Feb 2, 2026

View reviewed changes

Colter-Downing reviewed Feb 2, 2026

View reviewed changes

robertfernandez-vast reviewed Feb 2, 2026

View reviewed changes

robertfernandez-vast reviewed Feb 3, 2026

View reviewed changes


		### Workers

		Workers are individual GPU instances created and managed by the Serverless engine. Each worker runs a [PyWorker](./overview), a Python web server that loads the machine learning model and serves requests.


		## How Benchmark Testing Works

		When a new Workergroup is created, the serverless engine enters a learning phase. During this phase, it recruits a variety of machine types from those specified in `search_params`. For each new worker, the engine runs the user-configured benchmark and evaluates performance.

		## Best Practices for Initial Scaling

		The speed at which the serverless engine “settles” into the most cost-effective mix of workers can vary depending on how quickly workers are recruited and released. Because of this, it is recommended to apply a test load during the first day of operation to help the system efficiently explore and converge on optimal hardware choices.


		For examples of how to simulate load against your endpoint, see the client examples in the Vast SDK repository:

		https://github.com/vast-ai/vast-sdk/tree/main/examples/client


		## Managing for Low Demand or Idle Periods

		- Adjust `min_load`: Reducing `min_load` will reduce the minimum number of active workers. Set to `1` to reduce the number to its minimum value of 1 worker

		2. The serverless system selects a ready and available worker instance from the user's endpoint and replies with a JSON object containing the URL of the selected instance.
		3. The client then constructs a new POST request with it's payload, authentication data, and the URL of the worker instance. This is sent to the worker.


		## Why Use the SDK

		While there are other ways to interact with Vast Serverless—such as the CLI and the REST API—the SDK is the most powerful and easiest method to use. It is the recommended approach for most applications due to its higher-level abstractions, reliability, and ease of integration into Python-based workflows.


		The parameters below are specific to only Workergroups, not Endpoints. Pre-configured serverless templates from Vast will have these values already set.

		## `gpu_ram`

		A command-line style string containing additional parameters for instance creation that will be parsed and applied when the serverless engine creates new workers. This allows you to customize instance configuration beyond what’s specified in templates.

		There is no default value for `launch_args`.

Updating serverless documentation as part of rewrite #58

Are you sure you want to change the base?

Updating serverless documentation as part of rewrite #58

Uh oh!

Conversation

DavidatVast commented Jan 26, 2026

Uh oh!

robertfernandez-vast commented Feb 1, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasArmandVast Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasArmandVast Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

LucasArmandVast Feb 2, 2026 •

edited

Loading

LucasArmandVast Feb 2, 2026 •

edited

Loading