-
Notifications
You must be signed in to change notification settings - Fork 4
Updating serverless documentation as part of rewrite #58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Developers want to use serverless. Based off of this fact, the following should be assumed true:
All sections should serve and be organized based off of these statements. Users can set-up their serverless environment via:
We should have sections which serve each of these form factors. Setting up their serverless environment includes:
Users can interact with their serverless environment via:
We should organize the sections so that it is clear to developers how they interact with serverless for each form factor. Interacting includes:
Users can maintain their serverless environment via:
We should organize the sections so that it is clear how they maintain their serverless environment. Maintaining includes:
|
| An endpoint consists of: | ||
|
|
||
| It's important to note that having multiple Worker Groups within a single Endpoint is not always necessary. For most users, a single Worker Group within an Endpoint provides an optimal setup. | ||
| - A named endpoint identifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just be "A name"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think the current proposal is correct?
| - Endpoint parameters such as `max_workers`, `min_load`, `min_workers`, `cold_mult`, `min_cold_load`, and `target_util` | ||
|
|
||
| You can create Worker Groups using our [Serverless-Compatible Templates](/documentation/serverless/text-generation-inference-tgi), which are customized versions of popular templates on Vast, designed to be used on the serverless system. | ||
| Users typically create one endpoint per **function** (for example, text generation or image generation) and per **environment** (production, staging, development). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be useful to explain here that endpoints act as "routing targets", and that all requests sent to an endpoint are load-balanced across that endpoints workers.
| ### Workergroups | ||
|
|
||
| The system architecture for an application using Vast.ai Serverless includes the following components: | ||
| A **Workergroup** defines how workers are recruited and created. Workergroups are configured with [**workergroup-level parameters**](./workergroup-parameters) and are responsible for selecting which GPU offers are eligible for worker creation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More than just GPU selection, workergroups decide what code is actually running on the endpoint via the template. It's probably important to mention this here.
| <Frame caption="Serverless Architecture"> | ||
|  | ||
| </Frame> | ||
| - A serverless-compatible template (referenced by `template_id` or `template_hash`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users would only set these fields in the CLI, in the UI they can select the template from the templates page/modal. Might be more clear to just drop the () here and say "A serverless-compatible template".
Also, we probably want to clearly define what makes a template serverless-compatible or not. This issue comes up frequently in support chats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could reference a different set of docs that define serverless compatability?
| - A serverless-compatible template (referenced by `template_id` or `template_hash`) | ||
| - Hardware and marketplace filters defined via `search_params` | ||
| - Optional instance configuration overrides via `launch_args` | ||
| - Hardware requirements such as `gpu_ram` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another point of confusion. There are two places where users currently have to specify GPU RAM requirements: in search_params and gpu_ram.
Their difference is:
search_paramsfilters out offers/instances that don't have at least the required GPU RAMgpu_raminforms the autoscaler roughly how large the "model" it needs to download, so it can estimate the loading timeout information.
There should probably be a larger discussion on the necessity of this extra gpu_ram field, since it would be less confusing if we can just remove it and figure out a better solution for loading timeouts.
| - Hardware requirements such as `gpu_ram` | ||
| - A set of GPU instances (workers) created from the template | ||
|
|
||
| Multiple Workergroups can exist within a single Endpoint, each with different configurations. This enables advanced use cases such as hardware comparison, gradual model rollout, or mixed-model serving. For many applications, a single Workergroup is sufficient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but important to note that currently routing between workergroups is basically random, not controllable. We could add workergroup-level routing to the autoscaler to enable this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, probably more accurate to say routing is workergroup agnostic. It does not factor in whatsoever. Routing considers all workers from all workergroups in the endpoint
|
|
||
| ### Workers | ||
|
|
||
| **Workers** are individual GPU instances created and managed by the Serverless engine. Each worker runs a [**PyWorker**](./overview), a Python web server that loads the machine learning model and serves requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyWorker does not load the machine learning model. Both the pyworker and the "machine learning model" (maybe a better, more generic name for this?) are launched by the Docker entrypoint or on_start.sh script. The pyworker runs in parallel with the machine learning model and listens to it (via it's log output). It reports the readiness of the model to the autoscaler, and acts as a proxy for incoming requests on the worker, keeping track of requests and passing them along to the model server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking Lucas' comment, a more accurate sentence might be:
"Each worker runs a PyWorker, a Python web server that runs the machine learning model's on-start script, orchestrates hardware benchmarking, and serves requests."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the PyWorker does not run the machine learning model's on-start script. Both the PyWorker and the machine learning model are started completely separately, both from the same on-start script. Two different processes, each started from the same script. They do not invoke one another. They communicate only by logs and passing HTTP requests along.
|
|
||
| - Receiving and processing inference requests | ||
| - Reporting performance metrics (load, utilization, benchmark results) | ||
| - Participating in automated scaling and routing decisions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"- Participating in automated scaling and routing decisions"
This is vague and only sort-of true. They do "participate" in the sense that they report information on their health, current load, dropped requests, and benchmark speed, but outside of just reporting that information, they don't actually take action for routing or scaling. All decision making is handled by the autoscaler based on the reports of the pyworkers.
"Inform automated scaling and routing decisions" is more correct
| ### Serverless Engine | ||
|
|
||
| The **Serverless Engine** is the decision-making service that manages workers across all endpoints and workergroups. Using configuration parameters and real-time metrics, it determines when to: | ||
|
|
||
| - Recruit new workers | ||
| - Activate inactive workers | ||
| - Release or destroy workers | ||
|
|
||
| The engine continuously evaluates cost-performance tradeoffs using automated performance testing and measured load. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important to note that the "Serverless Engine" / autoscaler does not just manage workers, but also acts as a load balancer for incoming requests. It spreads requests out across all available workers to minimize overall waiting time.
| 5. The PyWorker sends the model results back to the client. | ||
| 6. Independently and concurrently, each PyWorker in the Endpoint sends its operational metrics to the Serverless system, which it uses to make scaling decisions. | ||
| - Authentication | ||
| - Routing requests to appropriate workers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is handled by the autoscaler, not the SDK. The SDK asks the autoscaler to "route" it to a worker, and then the SDK passes the request to that worker once it's routed.
|
|
||
| ## How Benchmark Testing Works | ||
|
|
||
| When a new Workergroup is created, the serverless engine enters a **learning phase**. During this phase, it recruits a variety of machine types from those specified in `search_params`. For each new worker, the engine runs the user-configured benchmark and evaluates performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "learning phase" doesn't literally exist, but is a label to basically say "Your endpoint is inefficient right now because we haven't gone through enough scaling cycles to optimize it yet."
Also, the PyWorker, not the Serverless Engine, is responsible for running the benchmark, the results of which are then reported to the Serverless Engine.
| ## Best Practices for Initial Scaling | ||
|
|
||
| The speed at which the serverless engine “settles” into the most cost-effective mix of workers can vary depending on how quickly workers are recruited and released. Because of this, it is recommended to apply a **test load during the first day of operation** to help the system efficiently explore and converge on optimal hardware choices. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically, the act of scaling up and scaling down is where optimization occurs. Just supplying load to an endpoint likely will not be enough. Depending on the size of the endpoint and the workload type, users may need to scale their endpoint up and down a few times (in my experience, usually 3 or more times) before it settles on mostly efficient, working machines.
|
|
||
| For examples of how to simulate load against your endpoint, see the client examples in the Vast SDK repository: | ||
|
|
||
| https://github.com/vast-ai/vast-sdk/tree/main/examples/client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can point to a load-specific example:
https://github.com/vast-ai/vast-sdk/blob/main/examples/client/vllm_load_example.py
| - Output: | ||
| - A `float` representing workload; larger means “more expensive.” | ||
| - Recommendation: | ||
| - For many applications that have a vast majority of similarly complex tasks, utilizing a single constant value per task is sufficient (example cost = 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence could be worded better.
"For applications where requests do not vary much in complexity, returning a constant value (i.e. 100) is often sufficient."
| @@ -0,0 +1,37 @@ | |||
| --- | |||
| title: Managing Scale | |||
| description: Some configuration change strategies to manage different load scenarios. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"configuration change strategies" doesn't make obvious sense to me, but I think I get what you're going for.
Maybe "Learn how to configure your Serverless endpoint for different load scenarios" ?
| { | ||
| "@type": "HowToStep", | ||
| "name": "Manage for Bursty Load", | ||
| "text": "Adjust min_workers to increase managed inactive workers and add capacity for peak demand. Increase cold_mult to change the rate (and in extreme cases, the number) of worker recruitment to handle fast demand transitions. Check max_workers to ensure it's set high enough for the serverless engine to create the required number of workers." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
" Increase cold_mult to change the rate (and in extreme cases, the number) of worker recruitment to handle fast demand transitions"
I don't really understand this explanation of using cold_mult. Cold_mult just sets the cold worker target to hot_worker_target * cold_mult (roughly). It always increases the number of workers, and the rate of worker recruitment is identical, just with a higher target.
| ## Managing for Bursty Load | ||
|
|
||
| - **Adjust** `min_workers`: This will change the number of managed inactive workers, and increase capacity for high peak | ||
| - **Increase** `cold_mult`: This will change the rate (and in extreme cases, the number) of worker recruitment. Use this to manage for fast transitions in demand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for comment above.
|
|
||
| ## Managing for Low Demand or Idle Periods | ||
|
|
||
| - **Adjust** `min_load`: Reducing `min_load` will reduce the minimum number of active workers. Set to `1` to reduce the number to its minimum value of 1 worker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We support scaling to zero as of 2/3/26 (assuming we deploy)
| In the diagram's example, a user's client is attempting to infer from a machine learning model. With Vast's Serverless setup, the client: | ||
|
|
||
| 1. Sends a [`/route/`](/documentation/serverless/route) POST request to the serverless system. This asks the system for a GPU instance to send the inference request. | ||
| 1. Sends a `/route/` POST request to the serverless system. This asks the system for a GPU instance to send the inference request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, we want to change "serverless system" to "Serverless engine"
| 2. The serverless system selects a ready and available worker instance from the user's endpoint and replies with a JSON object containing the URL of the selected instance. | ||
| 3. The client then constructs a new POST request with it's payload, authentication data, and the URL of the worker instance. This is sent to the worker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flow is abstracted away by the SDK
|
|
||
| Vast.ai Serverless offers pay-as-you-go pricing for all workloads at the same rates as Vast.ai's non-Serverless GPU instances. Each instance accrues cost on a per second basis. | ||
| This guide explains how pricing works. | ||
| Unlike other providers, Vast Serverless offers pay-per-second pricing for all workloads at the same as Vast.ai’s non-Serverless GPU instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"at the same as" -> "at the same price as"
| | State | Description | GPU compute | Storage | Bandwidth (in/out) | | ||
| |----------|-------------------|-------------|---------|--------------------| | ||
| | Ready | An active worker | Billed | Billed | Billed | | ||
| | Loading | Model is loading | Billed | Billed | Billed | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Starting" is included in "Loading", so users will also have to pay full price while their instance is starting up from a cold state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But users never see "Starting", right? They will see Model Loading, which we should add here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Model Loading" actually didn't make it into the new UI (maybe we need to talk to Madison about that), but yes users will see starting. Cold workers go from inactive -> starting -> ready, whereas new workers go from created -> loading -> ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LucasArmandVast Could you provide all states that a user would see and put which states have GPU compute, Storage and Bandwith (in/out) billed. This would be helpful for David.
|
|
||
| ## Why Use the SDK | ||
|
|
||
| While there are other ways to interact with Vast Serverless—such as the CLI and the REST API—the SDK is the **most powerful and easiest** method to use. It is the recommended approach for most applications due to its higher-level abstractions, reliability, and ease of integration into Python-based workflows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SDK (in its current state) is not really a replacement for the CLI/UI, since they are still needed for endpoint/workergroup/template creation. Deployments project will change this.
It is true though that using the SDK is pre-packaged "correct" way to use the API, which still remains available for advanced users.
| <Frame caption="Serverless Architecture"> | ||
|  | ||
| </Frame> | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This picture will need updating. The whole back-and-forth between the engine and the worker is now simplified through the SDK. Ie the SDK does this for you and you never really have to know about it. I guess it could still be important for the Anthony persona since he needs to know how all his requests are being handled.
|
|
||
| The parameters below are specific to only Workergroups, not Endpoints. Pre-configured serverless templates from Vast will have these values already set. | ||
|
|
||
| ## `gpu_ram` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goes to the previously mentioned point, can we remove this and include in search_params somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a question for devs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parsing search_params might be weird, but I don't see why we couldn't do it. There is a real difference between "how many GB VRAM do you need" vs. "how many GB is your model weights", but they are probably close enough.
There is this whole infra built for tracking "additional disk usage" since pyworker start, and in theory "download progress" should be additional_disk_usage / model_size. Then we could provide approximate loading progress bars for each worker. I think this was an intended but unimplemented feature. It would probably look like changing this gpu_ram parameter to something like (optional) model_size.
| The Target Utilization ratio determines how much spare capacity (headroom) the serverless engine maintains. For example, if your predicted load | ||
| is 900 tokens/second and target\_util is 0.9, the serverless engine will plan for 1000 tokens/second of capacity (900 ÷ 0.9 = 1000), leaving 100 | ||
| tokens/second (11%) as buffer for traffic spikes. | ||
| ## Minimum Inactive Workers (`min_workers`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should drop the 'Inactive'. This value is state agnostic, meaning 2 Ready and 3 Inactive still satisfies min_workers=5
| @@ -0,0 +1,69 @@ | |||
| --- | |||
| title: Worker Recruitment | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc is actually a list of workergroup parameters. Maybe we rename the title to "Workergroup Parameters" in a similar fashion to the existing "Endpoint Parameters"?
| --- | ||
| title: Architecture | ||
| description: Understand the architecture of Vast.ai Serverless, including the Serverless System, GPU Instances, and User (Client Application). Learn how the system works, how to use the routing process, and how to create Worker Groups. | ||
| title: Overview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should rename the previous section to Serverless Feature Overview. This section should be named Architecture Overview. Having Serverless Overview followed Overview is a bit confusing.
| <Frame caption="Serverless Architecture"> | ||
|  | ||
| </Frame> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be clearer if each Pyworker <--> Model Inference Group was encapsulated in a box labeled GPU Instance.
| - Endpoint parameters such as `max_workers`, `min_load`, `min_workers`, `cold_mult`, `min_cold_load`, and `target_util` | ||
|
|
||
| You can create Worker Groups using our [Serverless-Compatible Templates](/documentation/serverless/text-generation-inference-tgi), which are customized versions of popular templates on Vast, designed to be used on the serverless system. | ||
| Users typically create one endpoint per **function** (for example, text generation or image generation) and per **environment** (production, staging, development). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the term use case instead of function for this sentence. For a software dev, a function a specific term.
| An endpoint consists of: | ||
|
|
||
| It's important to note that having multiple Worker Groups within a single Endpoint is not always necessary. For most users, a single Worker Group within an Endpoint provides an optimal setup. | ||
| - A named endpoint identifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you trying to capture endpoint name and endpoint id with this phrasing? I think The endpoint's name would be sufficient here.
| <Frame caption="Serverless Architecture"> | ||
|  | ||
| </Frame> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diagram should show all primary components so a user knows where each important component fits into the this flow.
| While CLI and API access are available, the SDK is the recommended method for most applications. | ||
|
|
||
| This 2-step routing process is used for security and flexibility. By having the client send payloads directly to the GPU instances, your payload information is never stored on Vast servers. | ||
| ## Example Workflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example workflow describes what happens after a dev client sets up an already has their serverless environment setup. This should be made clear so that a new dev client that is trying to set up serverless doesn't think this workflow is what they have to follow to setup serverless.
| 2. The Serverless system routes the request and returns a suitable worker address based on current load and capacity. | ||
| 3. The client sends the request directly to the selected worker’s API endpoint, including the required authentication data. | ||
| 4. The PyWorker running on the GPU instance forwards the request to the machine learning model and performs inference. | ||
| 5. The inference result is returned to the client. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inference result is returned to the client's application which is then forwarded to their users.
| 3. The client sends the request directly to the selected worker’s API endpoint, including the required authentication data. | ||
| 4. The PyWorker running on the GPU instance forwards the request to the machine learning model and performs inference. | ||
| 5. The inference result is returned to the client. | ||
| 6. Independently and continuously, each PyWorker reports operational and performance metrics back to the Serverless Engine, which uses this data to make ongoing scaling decisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this sentence.
| A command-line style string containing additional parameters for instance creation that will be parsed and applied when the serverless engine creates new workers. This allows you to customize instance configuration beyond what’s specified in templates. | ||
|
|
||
| There is no default value for `launch_args`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter seems nebulous to me. We should have an example or something that tells developers what kind of format they should expect to use for this parameter.
| - `target_util = 0.9` → 11.1% spare capacity | ||
| - `target_util = 0.8` → 25% spare capacity | ||
| - `target_util = 0.5` → 100% spare capacity | ||
| - `target_util = 0.4` → 150% spare capacity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should give a general equation so user's aren't constrained to only these examples.
| | State | Description | GPU compute | Storage | Bandwidth (in/out) | | ||
| |----------|-------------------|-------------|---------|--------------------| | ||
| | Ready | An active worker | Billed | Billed | Billed | | ||
| | Loading | Model is loading | Billed | Billed | Billed | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LucasArmandVast Could you provide all states that a user would see and put which states have GPU compute, Storage and Bandwith (in/out) billed. This would be helpful for David.
| | Inactive | Not billed | Billed | Billed | Billed | | ||
|
|
||
| GPU compute refers to the per-second GPU rental charges. See the [Billing Help](/documentation/reference/billing#ugwiY) page for rate details. No newline at end of file | ||
| | State | Description | GPU compute | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the header label from GPU compute to Billing Description
| |-----------|--------------------------------------------------------------------------------------------|-------------| | ||
| | Active | - Engine is actively managing worker recruitment and release <br /> - Workers are active | All workers billed at their relevant states | | ||
| | Suspended | - Engine is NOT managing worker recruitment and release <br /> - Workers are active. | Workers are billed based on their state at time of suspension. <br /> Any workers that are currently being created or are loading, will complete to a ready state (and be billed as such). | | ||
| | Stopped | - Engine is NOT managing worker recruitment and release <br /> - Workers are all inactive | All workers are changed to and billed in inactive state | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All workers are changed to and are billed in the inactive state.
| ## Minimum Load (`min_load`) | ||
|
|
||
| If not specified during endpoint creation, the default value is 3. | ||
| Vast Serverless utilizes a concept of **load** as a metric of work that is performed by a worker, measured in performance (“perf”) per second. This is an internally computed value derived from benchmark tests and is normalized across different work types (tokens for LLMs, images for image generation, etc.). It is used to make scaling and capacity decisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need some clarification about this from @LucasArmandVast and @Colter-Downing. Measuring load as perf per second doesn't feel accurate even though that's what is in the code base right now.
| ### Best practice for setting `min_load` | ||
|
|
||
| If not specified during endpoint creation, the default value is 5. | ||
| - Start with `min_load = 1` (the default), which guarantees at least one active worker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should mention that if a developer wants zero scaling (scaling to 0 hot workers), the min_load should be 0.
| ## Cold Multiplier (`cold_mult`) | ||
|
|
||
| The parameters below are specific to only Workergroups, not Endpoints. Pre-configured serverless templates from Vast will have these values already set. | ||
| While `min_workers` is fixed regardless of traffic patterns, `cold_mult` defines inactive capacity as a multiplier of the current active workload. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The serverless engine attempts to plan its scaling for both predicted short term loads (1-30s) and for predicted long term loads (1 hour and above).
cold_mult is a scalar multiplier that allows developers to tune and plan for expected longer term loads.
cold_mult = (target_perf x target_util)/(predicted_load)
Self explanatory - rewrote / re-ordered serverless documentation
https://vastai.atlassian.net/browse/AUTO-1104