Skip to content

Conversation

@DavidatVast
Copy link

Self explanatory - rewrote / re-ordered serverless documentation

https://vastai.atlassian.net/browse/AUTO-1104

@robertfernandez-vast
Copy link
Contributor

Developers want to use serverless. Based off of this fact, the following should be assumed true:

  1. For a developer to use serverless, they must know how to set-up their serverless environment.
  2. For a developer to use serverless, they must know how to interact with their serverless environment.
  3. For a developer to use serverless, they must know how to maintain their serverless environment.

All sections should serve and be organized based off of these statements.

Users can set-up their serverless environment via:

  • The vast website
  • The vast-sdk
  • The vast cli

We should have sections which serve each of these form factors.

Setting up their serverless environment includes:

  • Creating and destroying endpoints and what that does to their data and serverless environment
  • Creating and destroying workergroups and what that does to their data and serverless environment
  • Using our predefined templates or defining custom templates
    • This includes instructing developers on how to make their own pyworkers for their own use cases.
  • Addressing frequently occurring questions/pain points with set-up
    • Why does it take so long for workers to load?
    • How do I know when my serverless environment is ready for use?
    • How do I know what serverless environment is best suited for my use case?
  • Definitions for each worker state and how to understand how to use these states to get your serverless environment setup.

Users can interact with their serverless environment via:

  • The vast-sdk
  • The vast-cli

We should organize the sections so that it is clear to developers how they interact with serverless for each form factor.

Interacting includes:

  • How to send requests to your endpoint group once its set-up

Users can maintain their serverless environment via:

  • The vast website
  • The vast-sdk
  • The vast-cli

We should organize the sections so that it is clear how they maintain their serverless environment.

Maintaining includes:

  • Paying for their serverless environment
  • Keeping their pyworkers up-to-date
  • Best Practices for keeping their data safe (ex: backing up data)
  • Describing what our metrics mean on the endpoint page, and what assumptions go into each metric. (Units, limitations, definitions)
  • How to understand how your serverless environment is behaving based on worker states.

An endpoint consists of:

It's important to note that having multiple Worker Groups within a single Endpoint is not always necessary. For most users, a single Worker Group within an Endpoint provides an optimal setup.
- A named endpoint identifier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just be "A name"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think the current proposal is correct?

- Endpoint parameters such as `max_workers`, `min_load`, `min_workers`, `cold_mult`, `min_cold_load`, and `target_util`

You can create Worker Groups using our [Serverless-Compatible Templates](/documentation/serverless/text-generation-inference-tgi), which are customized versions of popular templates on Vast, designed to be used on the serverless system.
Users typically create one endpoint per **function** (for example, text generation or image generation) and per **environment** (production, staging, development).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful to explain here that endpoints act as "routing targets", and that all requests sent to an endpoint are load-balanced across that endpoints workers.

### Workergroups

The system architecture for an application using Vast.ai Serverless includes the following components:
A **Workergroup** defines how workers are recruited and created. Workergroups are configured with [**workergroup-level parameters**](./workergroup-parameters) and are responsible for selecting which GPU offers are eligible for worker creation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More than just GPU selection, workergroups decide what code is actually running on the endpoint via the template. It's probably important to mention this here.

<Frame caption="Serverless Architecture">
![Serverless Architecture](/images/serverless-architecture.webp)
</Frame>
- A serverless-compatible template (referenced by `template_id` or `template_hash`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users would only set these fields in the CLI, in the UI they can select the template from the templates page/modal. Might be more clear to just drop the () here and say "A serverless-compatible template".

Also, we probably want to clearly define what makes a template serverless-compatible or not. This issue comes up frequently in support chats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could reference a different set of docs that define serverless compatability?

- A serverless-compatible template (referenced by `template_id` or `template_hash`)
- Hardware and marketplace filters defined via `search_params`
- Optional instance configuration overrides via `launch_args`
- Hardware requirements such as `gpu_ram`
Copy link
Contributor

@LucasArmandVast LucasArmandVast Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another point of confusion. There are two places where users currently have to specify GPU RAM requirements: in search_params and gpu_ram.
Their difference is:

  1. search_params filters out offers/instances that don't have at least the required GPU RAM
  2. gpu_ram informs the autoscaler roughly how large the "model" it needs to download, so it can estimate the loading timeout information.

There should probably be a larger discussion on the necessity of this extra gpu_ram field, since it would be less confusing if we can just remove it and figure out a better solution for loading timeouts.

- Hardware requirements such as `gpu_ram`
- A set of GPU instances (workers) created from the template

Multiple Workergroups can exist within a single Endpoint, each with different configurations. This enables advanced use cases such as hardware comparison, gradual model rollout, or mixed-model serving. For many applications, a single Workergroup is sufficient.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but important to note that currently routing between workergroups is basically random, not controllable. We could add workergroup-level routing to the autoscaler to enable this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, probably more accurate to say routing is workergroup agnostic. It does not factor in whatsoever. Routing considers all workers from all workergroups in the endpoint


### Workers

**Workers** are individual GPU instances created and managed by the Serverless engine. Each worker runs a [**PyWorker**](./overview), a Python web server that loads the machine learning model and serves requests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyWorker does not load the machine learning model. Both the pyworker and the "machine learning model" (maybe a better, more generic name for this?) are launched by the Docker entrypoint or on_start.sh script. The pyworker runs in parallel with the machine learning model and listens to it (via it's log output). It reports the readiness of the model to the autoscaler, and acts as a proxy for incoming requests on the worker, keeping track of requests and passing them along to the model server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking Lucas' comment, a more accurate sentence might be:
"Each worker runs a PyWorker, a Python web server that runs the machine learning model's on-start script, orchestrates hardware benchmarking, and serves requests."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the PyWorker does not run the machine learning model's on-start script. Both the PyWorker and the machine learning model are started completely separately, both from the same on-start script. Two different processes, each started from the same script. They do not invoke one another. They communicate only by logs and passing HTTP requests along.


- Receiving and processing inference requests
- Reporting performance metrics (load, utilization, benchmark results)
- Participating in automated scaling and routing decisions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"- Participating in automated scaling and routing decisions"
This is vague and only sort-of true. They do "participate" in the sense that they report information on their health, current load, dropped requests, and benchmark speed, but outside of just reporting that information, they don't actually take action for routing or scaling. All decision making is handled by the autoscaler based on the reports of the pyworkers.

"Inform automated scaling and routing decisions" is more correct

Comment on lines +66 to +75
### Serverless Engine

The **Serverless Engine** is the decision-making service that manages workers across all endpoints and workergroups. Using configuration parameters and real-time metrics, it determines when to:

- Recruit new workers
- Activate inactive workers
- Release or destroy workers

The engine continuously evaluates cost-performance tradeoffs using automated performance testing and measured load.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important to note that the "Serverless Engine" / autoscaler does not just manage workers, but also acts as a load balancer for incoming requests. It spreads requests out across all available workers to minimize overall waiting time.

5. The PyWorker sends the model results back to the client.
6. Independently and concurrently, each PyWorker in the Endpoint sends its operational metrics to the Serverless system, which it uses to make scaling decisions.
- Authentication
- Routing requests to appropriate workers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is handled by the autoscaler, not the SDK. The SDK asks the autoscaler to "route" it to a worker, and then the SDK passes the request to that worker once it's routed.


## How Benchmark Testing Works

When a new Workergroup is created, the serverless engine enters a **learning phase**. During this phase, it recruits a variety of machine types from those specified in `search_params`. For each new worker, the engine runs the user-configured benchmark and evaluates performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "learning phase" doesn't literally exist, but is a label to basically say "Your endpoint is inefficient right now because we haven't gone through enough scaling cycles to optimize it yet."

Also, the PyWorker, not the Serverless Engine, is responsible for running the benchmark, the results of which are then reported to the Serverless Engine.

Comment on lines +32 to +35
## Best Practices for Initial Scaling

The speed at which the serverless engine “settles” into the most cost-effective mix of workers can vary depending on how quickly workers are recruited and released. Because of this, it is recommended to apply a **test load during the first day of operation** to help the system efficiently explore and converge on optimal hardware choices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, the act of scaling up and scaling down is where optimization occurs. Just supplying load to an endpoint likely will not be enough. Depending on the size of the endpoint and the workload type, users may need to scale their endpoint up and down a few times (in my experience, usually 3 or more times) before it settles on mostly efficient, working machines.


For examples of how to simulate load against your endpoint, see the client examples in the Vast SDK repository:

https://github.com/vast-ai/vast-sdk/tree/main/examples/client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Output:
- A `float` representing workload; larger means “more expensive.”
- Recommendation:
- For many applications that have a vast majority of similarly complex tasks, utilizing a single constant value per task is sufficient (example cost = 100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence could be worded better.
"For applications where requests do not vary much in complexity, returning a constant value (i.e. 100) is often sufficient."

@@ -0,0 +1,37 @@
---
title: Managing Scale
description: Some configuration change strategies to manage different load scenarios.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"configuration change strategies" doesn't make obvious sense to me, but I think I get what you're going for.

Maybe "Learn how to configure your Serverless endpoint for different load scenarios" ?

{
"@type": "HowToStep",
"name": "Manage for Bursty Load",
"text": "Adjust min_workers to increase managed inactive workers and add capacity for peak demand. Increase cold_mult to change the rate (and in extreme cases, the number) of worker recruitment to handle fast demand transitions. Check max_workers to ensure it's set high enough for the serverless engine to create the required number of workers."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

" Increase cold_mult to change the rate (and in extreme cases, the number) of worker recruitment to handle fast demand transitions"

I don't really understand this explanation of using cold_mult. Cold_mult just sets the cold worker target to hot_worker_target * cold_mult (roughly). It always increases the number of workers, and the rate of worker recruitment is identical, just with a higher target.

## Managing for Bursty Load

- **Adjust** `min_workers`: This will change the number of managed inactive workers, and increase capacity for high peak
- **Increase** `cold_mult`: This will change the rate (and in extreme cases, the number) of worker recruitment. Use this to manage for fast transitions in demand
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for comment above.


## Managing for Low Demand or Idle Periods

- **Adjust** `min_load`: Reducing `min_load` will reduce the minimum number of active workers. Set to `1` to reduce the number to its minimum value of 1 worker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We support scaling to zero as of 2/3/26 (assuming we deploy)

In the diagram's example, a user's client is attempting to infer from a machine learning model. With Vast's Serverless setup, the client:

1. Sends a [`/route/`](/documentation/serverless/route) POST request to the serverless system. This asks the system for a GPU instance to send the inference request.
1. Sends a `/route/` POST request to the serverless system. This asks the system for a GPU instance to send the inference request.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, we want to change "serverless system" to "Serverless engine"

Comment on lines 35 to 36
2. The serverless system selects a ready and available worker instance from the user's endpoint and replies with a JSON object containing the URL of the selected instance.
3. The client then constructs a new POST request with it's payload, authentication data, and the URL of the worker instance. This is sent to the worker.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flow is abstracted away by the SDK


Vast.ai Serverless offers pay-as-you-go pricing for all workloads at the same rates as Vast.ai's non-Serverless GPU instances. Each instance accrues cost on a per second basis.
This guide explains how pricing works.
Unlike other providers, Vast Serverless offers pay-per-second pricing for all workloads at the same as Vast.ai’s non-Serverless GPU instances.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"at the same as" -> "at the same price as"

| State | Description | GPU compute | Storage | Bandwidth (in/out) |
|----------|-------------------|-------------|---------|--------------------|
| Ready | An active worker | Billed | Billed | Billed |
| Loading | Model is loading | Billed | Billed | Billed |
Copy link
Contributor

@LucasArmandVast LucasArmandVast Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Starting" is included in "Loading", so users will also have to pay full price while their instance is starting up from a cold state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But users never see "Starting", right? They will see Model Loading, which we should add here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Model Loading" actually didn't make it into the new UI (maybe we need to talk to Madison about that), but yes users will see starting. Cold workers go from inactive -> starting -> ready, whereas new workers go from created -> loading -> ready

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LucasArmandVast Could you provide all states that a user would see and put which states have GPU compute, Storage and Bandwith (in/out) billed. This would be helpful for David.


## Why Use the SDK

While there are other ways to interact with Vast Serverless—such as the CLI and the REST API—the SDK is the **most powerful and easiest** method to use. It is the recommended approach for most applications due to its higher-level abstractions, reliability, and ease of integration into Python-based workflows.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SDK (in its current state) is not really a replacement for the CLI/UI, since they are still needed for endpoint/workergroup/template creation. Deployments project will change this.

It is true though that using the SDK is pre-packaged "correct" way to use the API, which still remains available for advanced users.

<Frame caption="Serverless Architecture">
![Serverless Architecture](/images/serverless-architecture.webp)
</Frame>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This picture will need updating. The whole back-and-forth between the engine and the worker is now simplified through the SDK. Ie the SDK does this for you and you never really have to know about it. I guess it could still be important for the Anthony persona since he needs to know how all his requests are being handled.

Image


The parameters below are specific to only Workergroups, not Endpoints. Pre-configured serverless templates from Vast will have these values already set.

## `gpu_ram`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goes to the previously mentioned point, can we remove this and include in search_params somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a question for devs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing search_params might be weird, but I don't see why we couldn't do it. There is a real difference between "how many GB VRAM do you need" vs. "how many GB is your model weights", but they are probably close enough.

There is this whole infra built for tracking "additional disk usage" since pyworker start, and in theory "download progress" should be additional_disk_usage / model_size. Then we could provide approximate loading progress bars for each worker. I think this was an intended but unimplemented feature. It would probably look like changing this gpu_ram parameter to something like (optional) model_size.

The Target Utilization ratio determines how much spare capacity (headroom) the serverless engine maintains. For example, if your predicted load
is 900 tokens/second and target\_util is 0.9, the serverless engine will plan for 1000 tokens/second of capacity (900 ÷ 0.9 = 1000), leaving 100
tokens/second (11%) as buffer for traffic spikes.
## Minimum Inactive Workers (`min_workers`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should drop the 'Inactive'. This value is state agnostic, meaning 2 Ready and 3 Inactive still satisfies min_workers=5

@@ -0,0 +1,69 @@
---
title: Worker Recruitment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is actually a list of workergroup parameters. Maybe we rename the title to "Workergroup Parameters" in a similar fashion to the existing "Endpoint Parameters"?

---
title: Architecture
description: Understand the architecture of Vast.ai Serverless, including the Serverless System, GPU Instances, and User (Client Application). Learn how the system works, how to use the routing process, and how to create Worker Groups.
title: Overview
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rename the previous section to Serverless Feature Overview. This section should be named Architecture Overview. Having Serverless Overview followed Overview is a bit confusing.

Comment on lines +24 to +26
<Frame caption="Serverless Architecture">
![Serverless Architecture](/images/serverless-architecture.webp)
</Frame>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be clearer if each Pyworker <--> Model Inference Group was encapsulated in a box labeled GPU Instance.

- Endpoint parameters such as `max_workers`, `min_load`, `min_workers`, `cold_mult`, `min_cold_load`, and `target_util`

You can create Worker Groups using our [Serverless-Compatible Templates](/documentation/serverless/text-generation-inference-tgi), which are customized versions of popular templates on Vast, designed to be used on the serverless system.
Users typically create one endpoint per **function** (for example, text generation or image generation) and per **environment** (production, staging, development).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the term use case instead of function for this sentence. For a software dev, a function a specific term.

An endpoint consists of:

It's important to note that having multiple Worker Groups within a single Endpoint is not always necessary. For most users, a single Worker Group within an Endpoint provides an optimal setup.
- A named endpoint identifier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you trying to capture endpoint name and endpoint id with this phrasing? I think The endpoint's name would be sufficient here.

Comment on lines +24 to +26
<Frame caption="Serverless Architecture">
![Serverless Architecture](/images/serverless-architecture.webp)
</Frame>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diagram should show all primary components so a user knows where each important component fits into the this flow.

While CLI and API access are available, the SDK is the recommended method for most applications.

This 2-step routing process is used for security and flexibility. By having the client send payloads directly to the GPU instances, your payload information is never stored on Vast servers.
## Example Workflow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example workflow describes what happens after a dev client sets up an already has their serverless environment setup. This should be made clear so that a new dev client that is trying to set up serverless doesn't think this workflow is what they have to follow to setup serverless.

2. The Serverless system routes the request and returns a suitable worker address based on current load and capacity.
3. The client sends the request directly to the selected worker’s API endpoint, including the required authentication data.
4. The PyWorker running on the GPU instance forwards the request to the machine learning model and performs inference.
5. The inference result is returned to the client.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inference result is returned to the client's application which is then forwarded to their users.

3. The client sends the request directly to the selected worker’s API endpoint, including the required authentication data.
4. The PyWorker running on the GPU instance forwards the request to the machine learning model and performs inference.
5. The inference result is returned to the client.
6. Independently and continuously, each PyWorker reports operational and performance metrics back to the Serverless Engine, which uses this data to make ongoing scaling decisions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this sentence.

Comment on lines +43 to +45
A command-line style string containing additional parameters for instance creation that will be parsed and applied when the serverless engine creates new workers. This allows you to customize instance configuration beyond what’s specified in templates.

There is no default value for `launch_args`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter seems nebulous to me. We should have an example or something that tells developers what kind of format they should expect to use for this parameter.

- `target_util = 0.9` → 11.1% spare capacity
- `target_util = 0.8` → 25% spare capacity
- `target_util = 0.5` → 100% spare capacity
- `target_util = 0.4` → 150% spare capacity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should give a general equation so user's aren't constrained to only these examples.

| State | Description | GPU compute | Storage | Bandwidth (in/out) |
|----------|-------------------|-------------|---------|--------------------|
| Ready | An active worker | Billed | Billed | Billed |
| Loading | Model is loading | Billed | Billed | Billed |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LucasArmandVast Could you provide all states that a user would see and put which states have GPU compute, Storage and Bandwith (in/out) billed. This would be helpful for David.

| Inactive | Not billed | Billed | Billed | Billed |

GPU compute refers to the per-second GPU rental charges. See the [Billing Help](/documentation/reference/billing#ugwiY) page for rate details. No newline at end of file
| State | Description | GPU compute |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the header label from GPU compute to Billing Description

|-----------|--------------------------------------------------------------------------------------------|-------------|
| Active | - Engine is actively managing worker recruitment and release <br /> - Workers are active | All workers billed at their relevant states |
| Suspended | - Engine is NOT managing worker recruitment and release <br /> - Workers are active. | Workers are billed based on their state at time of suspension. <br /> Any workers that are currently being created or are loading, will complete to a ready state (and be billed as such). |
| Stopped | - Engine is NOT managing worker recruitment and release <br /> - Workers are all inactive | All workers are changed to and billed in inactive state |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All workers are changed to and are billed in the inactive state.

## Minimum Load (`min_load`)

If not specified during endpoint creation, the default value is 3.
Vast Serverless utilizes a concept of **load** as a metric of work that is performed by a worker, measured in performance (“perf”) per second. This is an internally computed value derived from benchmark tests and is normalized across different work types (tokens for LLMs, images for image generation, etc.). It is used to make scaling and capacity decisions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need some clarification about this from @LucasArmandVast and @Colter-Downing. Measuring load as perf per second doesn't feel accurate even though that's what is in the code base right now.

### Best practice for setting `min_load`

If not specified during endpoint creation, the default value is 5.
- Start with `min_load = 1` (the default), which guarantees at least one active worker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mention that if a developer wants zero scaling (scaling to 0 hot workers), the min_load should be 0.

## Cold Multiplier (`cold_mult`)

The parameters below are specific to only Workergroups, not Endpoints. Pre-configured serverless templates from Vast will have these values already set.
While `min_workers` is fixed regardless of traffic patterns, `cold_mult` defines inactive capacity as a multiplier of the current active workload.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serverless engine attempts to plan its scaling for both predicted short term loads (1-30s) and for predicted long term loads (1 hour and above).

cold_mult is a scalar multiplier that allows developers to tune and plan for expected longer term loads.

cold_mult = (target_perf x target_util)/(predicted_load)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants