Skip to content

Conversation

@keyboardAnt
Copy link

@keyboardAnt keyboardAnt commented Nov 10, 2025

… improve page management. Introduce debug mode for better memory tracking and enhance page allocation methods in PagedTokenToKVPoolAllocator and SWATokenToKVPoolAllocator. Update ChunkCache and RadixCache to streamline key handling and memory freeing processes.

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

On L40SL:
[nadavt@lgn09 sglang]$ nvidia-smi
Tue Nov 11 01:59:30 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:83:00.0 Off |                    0 |
| N/A   25C    P8             33W /  350W |       1MiB /  46068MiB |      0%   E. Process |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
[nadavt@lgn09 sglang]$ 
meta-llama/Meta-Llama-3.1-8B-Instruct: +% throughput.

before (main on commit 012bfc4fd ):

[nadavt@lgn09 sglang]$ SGLANG_DEBUG_MEMORY_POOL=1 \
python -m sglang.bench_offline_throughput \
  --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
  --num-prompts 200 \
  --random-input-len 128 \
  --random-output-len 256 \
  --random-range-ratio 0.5
WARNING:sglang.srt.server_args:Attention backend not explicitly specified. Use flashinfer backend by default.
[2025-11-11 04:05:21] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, mem_fraction_static=0.851, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=4096, max_prefill_tokens=16384, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=815120231, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', api_key=None, served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', load_watch_interval=0.1, prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='flashinfer', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_moe_runner_backend=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=32, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=4096, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, decrypted_config_file=None, decrypted_draft_config_file=None)
[2025-11-11 04:05:23] Using default HuggingFace chat template with detected content format: string
[2025-11-11 04:05:37] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-11-11 04:05:38] Init torch distributed ends. mem usage=0.00 GB
[2025-11-11 04:05:38] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-11-11 04:05:39] Load weight begin. avail mem=44.00 GB
[2025-11-11 04:05:40] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  1.96it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.71it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  2.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  2.14it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  2.08it/s]

[2025-11-11 04:05:43] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=28.91 GB, mem usage=15.09 GB.
[2025-11-11 04:05:43] Using KV cache dtype: torch.bfloat16
[2025-11-11 04:05:43] KV Cache is allocated. #tokens: 183133, K size: 11.18 GB, V size: 11.18 GB
[2025-11-11 04:05:43] Memory pool end. avail mem=5.51 GB
[2025-11-11 04:05:43] Capture cuda graph begin. This can take up to several minutes. avail mem=5.00 GB
[2025-11-11 04:05:43] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32]
Capturing batches (bs=1 avail_mem=4.83 GB): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  7.41it/s]
[2025-11-11 04:05:44] Capture cuda graph end. Time elapsed: 1.64 s. mem usage=0.20 GB. avail mem=4.81 GB.
[2025-11-11 04:05:45] max_total_num_tokens=183133, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2048, context_len=131072, available_gpu_mem=4.81 GB
#Input tokens: 62509
#Output tokens: 41627
#Input tokens: 4096
#Output tokens: 256
[2025-11-11 04:06:08] 
Warmup...
[2025-11-11 04:06:08] Prefill batch, #new-seq: 6, #new-token: 1542, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-11-11 04:06:09] Prefill batch, #new-seq: 10, #new-token: 2570, #cached-token: 0, token usage: 0.01, #running-req: 6, #queue-req: 0, 
[2025-11-11 04:06:10] 
Benchmark...
[2025-11-11 04:06:10] Prefill batch, #new-seq: 6, #new-token: 947, #cached-token: 6, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-11-11 04:06:10] Prefill batch, #new-seq: 9, #new-token: 4096, #cached-token: 9, token usage: 0.01, #running-req: 6, #queue-req: 185, 
[2025-11-11 04:06:11] Prefill batch, #new-seq: 16, #new-token: 4096, #cached-token: 15, token usage: 0.03, #running-req: 14, #queue-req: 170, 
[2025-11-11 04:06:11] Prefill batch, #new-seq: 18, #new-token: 4096, #cached-token: 19, token usage: 0.05, #running-req: 29, #queue-req: 153, 
[2025-11-11 04:06:11] Prefill batch, #new-seq: 14, #new-token: 4096, #cached-token: 19, token usage: 0.07, #running-req: 46, #queue-req: 140, 
[2025-11-11 04:06:11] Prefill batch, #new-seq: 16, #new-token: 4096, #cached-token: 29, token usage: 0.09, #running-req: 59, #queue-req: 125, 
[2025-11-11 04:06:12] Prefill batch, #new-seq: 6, #new-token: 4096, #cached-token: 8, token usage: 0.12, #running-req: 74, #queue-req: 120, 
[2025-11-11 04:06:12] Prefill batch, #new-seq: 5, #new-token: 4096, #cached-token: 12, token usage: 0.14, #running-req: 79, #queue-req: 116, 
[2025-11-11 04:06:12] Prefill batch, #new-seq: 19, #new-token: 4096, #cached-token: 38, token usage: 0.16, #running-req: 83, #queue-req: 98, 
[2025-11-11 04:06:13] Prefill batch, #new-seq: 5, #new-token: 4096, #cached-token: 5, token usage: 0.18, #running-req: 101, #queue-req: 94, 
[2025-11-11 04:06:13] Prefill batch, #new-seq: 7, #new-token: 4096, #cached-token: 13, token usage: 0.21, #running-req: 105, #queue-req: 88, 
[2025-11-11 04:06:13] Prefill batch, #new-seq: 17, #new-token: 4096, #cached-token: 33, token usage: 0.23, #running-req: 111, #queue-req: 72, 
[2025-11-11 04:06:14] Prefill batch, #new-seq: 19, #new-token: 4096, #cached-token: 44, token usage: 0.25, #running-req: 127, #queue-req: 54, 
[2025-11-11 04:06:14] Prefill batch, #new-seq: 17, #new-token: 4096, #cached-token: 32, token usage: 0.27, #running-req: 145, #queue-req: 38, 
[2025-11-11 04:06:14] Prefill batch, #new-seq: 20, #new-token: 4096, #cached-token: 42, token usage: 0.30, #running-req: 161, #queue-req: 19, 
[2025-11-11 04:06:15] Prefill batch, #new-seq: 20, #new-token: 3853, #cached-token: 41, token usage: 0.32, #running-req: 180, #queue-req: 0, 
[2025-11-11 04:06:16] Decode batch, #running-req: 144, #token: 45073, token usage: 0.25, cuda graph: False, gen throughput (token/s): 143.87, #queue-req: 0, 
[2025-11-11 04:06:18] Decode batch, #running-req: 121, #token: 42505, token usage: 0.23, cuda graph: False, gen throughput (token/s): 3508.26, #queue-req: 0, 
[2025-11-11 04:06:19] Decode batch, #running-req: 107, #token: 41219, token usage: 0.23, cuda graph: False, gen throughput (token/s): 3241.12, #queue-req: 0, 
[2025-11-11 04:06:20] Decode batch, #running-req: 97, #token: 41978, token usage: 0.23, cuda graph: False, gen throughput (token/s): 2899.32, #queue-req: 0, 
[2025-11-11 04:06:22] Decode batch, #running-req: 90, #token: 42597, token usage: 0.23, cuda graph: False, gen throughput (token/s): 2665.99, #queue-req: 0, 
[2025-11-11 04:06:23] Decode batch, #running-req: 76, #token: 41654, token usage: 0.23, cuda graph: False, gen throughput (token/s): 2360.55, #queue-req: 0, 
[2025-11-11 04:06:25] Decode batch, #running-req: 64, #token: 38822, token usage: 0.21, cuda graph: False, gen throughput (token/s): 2025.60, #queue-req: 0, 
[2025-11-11 04:06:26] Decode batch, #running-req: 51, #token: 34076, token usage: 0.19, cuda graph: False, gen throughput (token/s): 1663.07, #queue-req: 0, 
[2025-11-11 04:06:27] Decode batch, #running-req: 46, #token: 34901, token usage: 0.19, cuda graph: False, gen throughput (token/s): 1447.08, #queue-req: 0, 
[2025-11-11 04:06:29] Decode batch, #running-req: 36, #token: 28419, token usage: 0.16, cuda graph: False, gen throughput (token/s): 1262.64, #queue-req: 0, 
[2025-11-11 04:06:30] Decode batch, #running-req: 28, #token: 22920, token usage: 0.13, cuda graph: True, gen throughput (token/s): 1086.59, #queue-req: 0, 
[2025-11-11 04:06:31] Decode batch, #running-req: 22, #token: 20033, token usage: 0.11, cuda graph: True, gen throughput (token/s): 895.36, #queue-req: 0, 
[2025-11-11 04:06:32] Decode batch, #running-req: 20, #token: 19777, token usage: 0.11, cuda graph: True, gen throughput (token/s): 768.87, #queue-req: 0, 
[2025-11-11 04:06:33] Decode batch, #running-req: 15, #token: 17273, token usage: 0.09, cuda graph: True, gen throughput (token/s): 681.65, #queue-req: 0, 
[2025-11-11 04:06:34] Decode batch, #running-req: 12, #token: 13812, token usage: 0.08, cuda graph: True, gen throughput (token/s): 532.14, #queue-req: 0, 
[2025-11-11 04:06:35] Decode batch, #running-req: 11, #token: 13642, token usage: 0.07, cuda graph: True, gen throughput (token/s): 475.19, #queue-req: 0, 
[2025-11-11 04:06:36] Decode batch, #running-req: 11, #token: 14082, token usage: 0.08, cuda graph: True, gen throughput (token/s): 444.93, #queue-req: 0, 
[2025-11-11 04:06:37] Decode batch, #running-req: 10, #token: 13799, token usage: 0.08, cuda graph: True, gen throughput (token/s): 431.19, #queue-req: 0, 
[2025-11-11 04:06:38] Decode batch, #running-req: 8, #token: 10946, token usage: 0.06, cuda graph: True, gen throughput (token/s): 382.64, #queue-req: 0, 
[2025-11-11 04:06:39] Decode batch, #running-req: 2, #token: 6877, token usage: 0.04, cuda graph: True, gen throughput (token/s): 180.90, #queue-req: 0, 
[2025-11-11 04:06:40] Decode batch, #running-req: 2, #token: 6957, token usage: 0.04, cuda graph: True, gen throughput (token/s): 86.13, #queue-req: 0, 
[2025-11-11 04:06:41] Decode batch, #running-req: 2, #token: 7037, token usage: 0.04, cuda graph: True, gen throughput (token/s): 86.08, #queue-req: 0, 
[2025-11-11 04:06:42] Decode batch, #running-req: 2, #token: 7117, token usage: 0.04, cuda graph: True, gen throughput (token/s): 86.05, #queue-req: 0, 
[2025-11-11 04:06:43] Decode batch, #running-req: 2, #token: 7197, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.99, #queue-req: 0, 
[2025-11-11 04:06:44] Decode batch, #running-req: 2, #token: 7277, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.95, #queue-req: 0, 
[2025-11-11 04:06:44] Decode batch, #running-req: 2, #token: 7357, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.90, #queue-req: 0, 
[2025-11-11 04:06:45] Decode batch, #running-req: 2, #token: 7437, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.84, #queue-req: 0, 
[2025-11-11 04:06:46] Decode batch, #running-req: 2, #token: 7517, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.79, #queue-req: 0, 
[2025-11-11 04:06:47] Decode batch, #running-req: 2, #token: 7597, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.74, #queue-req: 0, 
[2025-11-11 04:06:48] Decode batch, #running-req: 2, #token: 7677, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.69, #queue-req: 0, 
[2025-11-11 04:06:49] Decode batch, #running-req: 2, #token: 7757, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.63, #queue-req: 0, 
[2025-11-11 04:06:50] Decode batch, #running-req: 2, #token: 7837, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.58, #queue-req: 0, 
[2025-11-11 04:06:51] Decode batch, #running-req: 2, #token: 7917, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.52, #queue-req: 0, 
[2025-11-11 04:06:52] Decode batch, #running-req: 2, #token: 7997, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.47, #queue-req: 0, 
[2025-11-11 04:06:53] Decode batch, #running-req: 2, #token: 8077, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.41, #queue-req: 0, 
[2025-11-11 04:06:54] Decode batch, #running-req: 2, #token: 8157, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.37, #queue-req: 0, 
[2025-11-11 04:06:55] Decode batch, #running-req: 2, #token: 8237, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.31, #queue-req: 0, 
[2025-11-11 04:06:56] Decode batch, #running-req: 2, #token: 8317, token usage: 0.05, cuda graph: True, gen throughput (token/s): 85.26, #queue-req: 0, 
[2025-11-11 04:06:57] Decode batch, #running-req: 2, #token: 8397, token usage: 0.05, cuda graph: True, gen throughput (token/s): 85.21, #queue-req: 0, 
[2025-11-11 04:06:58] Decode batch, #running-req: 2, #token: 8477, token usage: 0.05, cuda graph: True, gen throughput (token/s): 85.15, #queue-req: 0, 
[2025-11-11 04:06:58] Decode batch, #running-req: 2, #token: 8557, token usage: 0.05, cuda graph: True, gen throughput (token/s): 85.10, #queue-req: 0, 
[2025-11-11 04:06:59] Decode batch, #running-req: 1, #token: 0, token usage: 0.00, cuda graph: True, gen throughput (token/s): 61.04, #queue-req: 0, 

====== Offline Throughput Benchmark Result =======
Backend:                                 engine    
Successful requests:                     200       
Benchmark duration (s):                  49.11     
Total input tokens:                      62509     
Total generated tokens:                  41627     
Last generation throughput (tok/s):      61.04     
Request throughput (req/s):              4.07      
Input token throughput (tok/s):          1272.90   
Output token throughput (tok/s):         847.67    
Total token throughput (tok/s):          2120.57   
==================================================
[nadavt@lgn09 sglang]$ 

after:

[nadavt@lgn09 sglang]$ SGLANG_DEBUG_MEMORY_POOL=1 python -m sglang.bench_offline_throughput   --model-path meta-llama/Meta-Llama-3.1-8B-Instruct   --num-prompts 200   --random-input-len 128   --random-output-len 256   --random-range-ratio 0.5
WARNING:sglang.srt.server_args:Attention backend not explicitly specified. Use flashinfer backend by default.
[2025-11-11 04:27:11] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, mem_fraction_static=0.851, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=4096, max_prefill_tokens=16384, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=637981649, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', api_key=None, served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', load_watch_interval=0.1, prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='flashinfer', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_moe_runner_backend=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=32, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=4096, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, decrypted_config_file=None, decrypted_draft_config_file=None)
[2025-11-11 04:27:12] Using default HuggingFace chat template with detected content format: string
[2025-11-11 04:27:27] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-11-11 04:27:28] Init torch distributed ends. mem usage=0.00 GB
[2025-11-11 04:27:28] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-11-11 04:27:29] Load weight begin. avail mem=44.00 GB
[2025-11-11 04:27:29] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  1.99it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.73it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  2.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  2.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  2.10it/s]

[2025-11-11 04:27:32] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=28.91 GB, mem usage=15.09 GB.
[2025-11-11 04:27:32] Using KV cache dtype: torch.bfloat16
[2025-11-11 04:27:32] KV Cache is allocated. #tokens: 183133, K size: 11.18 GB, V size: 11.18 GB
[2025-11-11 04:27:32] Memory pool end. avail mem=5.51 GB
[2025-11-11 04:27:33] Capture cuda graph begin. This can take up to several minutes. avail mem=5.00 GB
[2025-11-11 04:27:33] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32]
Capturing batches (bs=1 avail_mem=4.83 GB): 100%|██████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  7.32it/s]
[2025-11-11 04:27:34] Capture cuda graph end. Time elapsed: 1.66 s. mem usage=0.20 GB. avail mem=4.81 GB.
[2025-11-11 04:27:35] max_total_num_tokens=183133, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=2048, context_len=131072, available_gpu_mem=4.81 GB
#Input tokens: 62509
#Output tokens: 41627
#Input tokens: 4096
#Output tokens: 256
[2025-11-11 04:27:58] 
Warmup...
[2025-11-11 04:27:58] Prefill batch, #new-seq: 3, #new-token: 771, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-11-11 04:27:59] Prefill batch, #new-seq: 13, #new-token: 3341, #cached-token: 0, token usage: 0.00, #running-req: 3, #queue-req: 0, 
[2025-11-11 04:28:01] 
Benchmark...
[2025-11-11 04:28:01] Prefill batch, #new-seq: 4, #new-token: 895, #cached-token: 4, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-11-11 04:28:01] Prefill batch, #new-seq: 10, #new-token: 4096, #cached-token: 10, token usage: 0.00, #running-req: 4, #queue-req: 186, 
[2025-11-11 04:28:01] Prefill batch, #new-seq: 17, #new-token: 4096, #cached-token: 16, token usage: 0.03, #running-req: 13, #queue-req: 170, 
[2025-11-11 04:28:01] Prefill batch, #new-seq: 18, #new-token: 4096, #cached-token: 19, token usage: 0.05, #running-req: 29, #queue-req: 153, 
[2025-11-11 04:28:01] Prefill batch, #new-seq: 13, #new-token: 4096, #cached-token: 18, token usage: 0.07, #running-req: 46, #queue-req: 141, 
[2025-11-11 04:28:02] Prefill batch, #new-seq: 17, #new-token: 4096, #cached-token: 31, token usage: 0.09, #running-req: 58, #queue-req: 125, 
[2025-11-11 04:28:02] Prefill batch, #new-seq: 6, #new-token: 4096, #cached-token: 8, token usage: 0.12, #running-req: 74, #queue-req: 120, 
[2025-11-11 04:28:02] Prefill batch, #new-seq: 5, #new-token: 4096, #cached-token: 12, token usage: 0.14, #running-req: 79, #queue-req: 116, 
[2025-11-11 04:28:03] Prefill batch, #new-seq: 19, #new-token: 4096, #cached-token: 38, token usage: 0.16, #running-req: 83, #queue-req: 98, 
[2025-11-11 04:28:03] Prefill batch, #new-seq: 5, #new-token: 4096, #cached-token: 5, token usage: 0.18, #running-req: 101, #queue-req: 94, 
[2025-11-11 04:28:03] Prefill batch, #new-seq: 7, #new-token: 4096, #cached-token: 13, token usage: 0.21, #running-req: 105, #queue-req: 88, 
[2025-11-11 04:28:03] Prefill batch, #new-seq: 17, #new-token: 4096, #cached-token: 33, token usage: 0.23, #running-req: 111, #queue-req: 72, 
[2025-11-11 04:28:04] Prefill batch, #new-seq: 19, #new-token: 4096, #cached-token: 44, token usage: 0.25, #running-req: 127, #queue-req: 54, 
[2025-11-11 04:28:04] Prefill batch, #new-seq: 17, #new-token: 4096, #cached-token: 32, token usage: 0.27, #running-req: 145, #queue-req: 38, 
[2025-11-11 04:28:04] Prefill batch, #new-seq: 20, #new-token: 4096, #cached-token: 42, token usage: 0.30, #running-req: 161, #queue-req: 19, 
[2025-11-11 04:28:05] Prefill batch, #new-seq: 20, #new-token: 3904, #cached-token: 41, token usage: 0.32, #running-req: 180, #queue-req: 0, 
[2025-11-11 04:28:06] Decode batch, #running-req: 144, #token: 45148, token usage: 0.25, cuda graph: False, gen throughput (token/s): 142.31, #queue-req: 0, 
[2025-11-11 04:28:08] Decode batch, #running-req: 121, #token: 42601, token usage: 0.23, cuda graph: False, gen throughput (token/s): 3486.96, #queue-req: 0, 
[2025-11-11 04:28:09] Decode batch, #running-req: 107, #token: 41328, token usage: 0.23, cuda graph: False, gen throughput (token/s): 3243.98, #queue-req: 0, 
[2025-11-11 04:28:11] Decode batch, #running-req: 97, #token: 42097, token usage: 0.23, cuda graph: False, gen throughput (token/s): 2899.28, #queue-req: 0, 
[2025-11-11 04:28:12] Decode batch, #running-req: 90, #token: 42724, token usage: 0.23, cuda graph: False, gen throughput (token/s): 2668.58, #queue-req: 0, 
[2025-11-11 04:28:13] Decode batch, #running-req: 76, #token: 41794, token usage: 0.23, cuda graph: False, gen throughput (token/s): 2362.79, #queue-req: 0, 
[2025-11-11 04:28:15] Decode batch, #running-req: 64, #token: 38978, token usage: 0.21, cuda graph: False, gen throughput (token/s): 2026.74, #queue-req: 0, 
[2025-11-11 04:28:16] Decode batch, #running-req: 51, #token: 34243, token usage: 0.19, cuda graph: False, gen throughput (token/s): 1663.62, #queue-req: 0, 
[2025-11-11 04:28:17] Decode batch, #running-req: 46, #token: 35071, token usage: 0.19, cuda graph: False, gen throughput (token/s): 1448.91, #queue-req: 0, 
[2025-11-11 04:28:19] Decode batch, #running-req: 36, #token: 28599, token usage: 0.16, cuda graph: False, gen throughput (token/s): 1262.71, #queue-req: 0, 
[2025-11-11 04:28:20] Decode batch, #running-req: 28, #token: 23108, token usage: 0.13, cuda graph: True, gen throughput (token/s): 1086.85, #queue-req: 0, 
[2025-11-11 04:28:21] Decode batch, #running-req: 22, #token: 20227, token usage: 0.11, cuda graph: True, gen throughput (token/s): 895.43, #queue-req: 0, 
[2025-11-11 04:28:22] Decode batch, #running-req: 20, #token: 19973, token usage: 0.11, cuda graph: True, gen throughput (token/s): 769.09, #queue-req: 0, 
[2025-11-11 04:28:23] Decode batch, #running-req: 15, #token: 17474, token usage: 0.10, cuda graph: True, gen throughput (token/s): 681.93, #queue-req: 0, 
[2025-11-11 04:28:24] Decode batch, #running-req: 12, #token: 14016, token usage: 0.08, cuda graph: True, gen throughput (token/s): 532.17, #queue-req: 0, 
[2025-11-11 04:28:25] Decode batch, #running-req: 11, #token: 13847, token usage: 0.08, cuda graph: True, gen throughput (token/s): 475.20, #queue-req: 0, 
[2025-11-11 04:28:26] Decode batch, #running-req: 11, #token: 14287, token usage: 0.08, cuda graph: True, gen throughput (token/s): 444.94, #queue-req: 0, 
[2025-11-11 04:28:27] Decode batch, #running-req: 10, #token: 14005, token usage: 0.08, cuda graph: True, gen throughput (token/s): 431.20, #queue-req: 0, 
[2025-11-11 04:28:28] Decode batch, #running-req: 8, #token: 11155, token usage: 0.06, cuda graph: True, gen throughput (token/s): 382.67, #queue-req: 0, 
[2025-11-11 04:28:29] Decode batch, #running-req: 2, #token: 7091, token usage: 0.04, cuda graph: True, gen throughput (token/s): 180.90, #queue-req: 0, 
[2025-11-11 04:28:30] Decode batch, #running-req: 2, #token: 7171, token usage: 0.04, cuda graph: True, gen throughput (token/s): 86.14, #queue-req: 0, 
[2025-11-11 04:28:31] Decode batch, #running-req: 2, #token: 7251, token usage: 0.04, cuda graph: True, gen throughput (token/s): 86.10, #queue-req: 0, 
[2025-11-11 04:28:32] Decode batch, #running-req: 2, #token: 7331, token usage: 0.04, cuda graph: True, gen throughput (token/s): 86.05, #queue-req: 0, 
[2025-11-11 04:28:33] Decode batch, #running-req: 2, #token: 7411, token usage: 0.04, cuda graph: True, gen throughput (token/s): 86.00, #queue-req: 0, 
[2025-11-11 04:28:34] Decode batch, #running-req: 2, #token: 7491, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.96, #queue-req: 0, 
[2025-11-11 04:28:35] Decode batch, #running-req: 2, #token: 7571, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.90, #queue-req: 0, 
[2025-11-11 04:28:35] Decode batch, #running-req: 2, #token: 7651, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.85, #queue-req: 0, 
[2025-11-11 04:28:36] Decode batch, #running-req: 2, #token: 7731, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.79, #queue-req: 0, 
[2025-11-11 04:28:37] Decode batch, #running-req: 2, #token: 7811, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.75, #queue-req: 0, 
[2025-11-11 04:28:38] Decode batch, #running-req: 2, #token: 7891, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.69, #queue-req: 0, 
[2025-11-11 04:28:39] Decode batch, #running-req: 2, #token: 7971, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.64, #queue-req: 0, 
[2025-11-11 04:28:40] Decode batch, #running-req: 2, #token: 8051, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.59, #queue-req: 0, 
[2025-11-11 04:28:41] Decode batch, #running-req: 2, #token: 8131, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.53, #queue-req: 0, 
[2025-11-11 04:28:42] Decode batch, #running-req: 2, #token: 8211, token usage: 0.04, cuda graph: True, gen throughput (token/s): 85.48, #queue-req: 0, 
[2025-11-11 04:28:43] Decode batch, #running-req: 2, #token: 8291, token usage: 0.05, cuda graph: True, gen throughput (token/s): 85.42, #queue-req: 0, 
[2025-11-11 04:28:44] Decode batch, #running-req: 2, #token: 8371, token usage: 0.05, cuda graph: True, gen throughput (token/s): 85.37, #queue-req: 0, 
[2025-11-11 04:28:45] Decode batch, #running-req: 2, #token: 8451, token usage: 0.05, cuda graph: True, gen throughput (token/s): 85.32, #queue-req: 0, 
[2025-11-11 04:28:46] Decode batch, #running-req: 2, #token: 8531, token usage: 0.05, cuda graph: True, gen throughput (token/s): 85.27, #queue-req: 0, 
[2025-11-11 04:28:47] Decode batch, #running-req: 2, #token: 8611, token usage: 0.05, cuda graph: True, gen throughput (token/s): 85.22, #queue-req: 0, 
[2025-11-11 04:28:48] Decode batch, #running-req: 2, #token: 8691, token usage: 0.05, cuda graph: True, gen throughput (token/s): 85.16, #queue-req: 0, 
[2025-11-11 04:28:49] Decode batch, #running-req: 2, #token: 8771, token usage: 0.05, cuda graph: True, gen throughput (token/s): 85.10, #queue-req: 0, 
[2025-11-11 04:28:50] Decode batch, #running-req: 1, #token: 216, token usage: 0.00, cuda graph: True, gen throughput (token/s): 61.35, #queue-req: 0, 
DEBUG self.max_total_num_tokens=183133 free=75062 release=0 evictable=107855 protected=0 avail=75062 tree_total=107855
[2025-11-11 04:28:50] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/projects/dharel/nadavt/repos/sglang/python/sglang/srt/managers/scheduler.py", line 2709, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/home/projects/dharel/nadavt/repos/sglang/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/projects/dharel/nadavt/repos/sglang/python/sglang/srt/managers/scheduler.py", line 1009, in event_loop_overlap
    self.self_check_during_idle()
  File "/home/projects/dharel/nadavt/repos/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 278, in self_check_during_idle
    self.check_memory()
  File "/home/projects/dharel/nadavt/repos/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 192, in check_memory
    raise ValueError(msg)
ValueError: token_to_kv_pool_allocator memory leak detected! self.max_total_num_tokens=183133, available_size=75062, evictable_size=107855, protected_size=0, diff=216, tolerance=32



====== Offline Throughput Benchmark Result =======
Backend:                                 engine    
Successful requests:                     200       
Benchmark duration (s):                  49.02     
Total input tokens:                      62509     
Total generated tokens:                  41627     
Last generation throughput (tok/s):      61.35     
Request throughput (req/s):              4.08      
Input token throughput (tok/s):          1275.23   
Output token throughput (tok/s):         849.22    
Total token throughput (tok/s):          2124.46   
==================================================
[nadavt@lgn09 sglang]$ 

Checklist

keyboardAnt and others added 3 commits November 10, 2025 18:09
… improve page management. Introduce debug mode for better memory tracking and enhance page allocation methods in PagedTokenToKVPoolAllocator and SWATokenToKVPoolAllocator. Update ChunkCache and RadixCache to streamline key handling and memory freeing processes.
Introduce the RadixKey class to manage token IDs and provide enhanced iteration and indexing capabilities. This change aims to streamline key handling within the radix_cache, ensuring compatibility with various input types while maintaining performance. Additionally, a helper function, get_child_key, is added for simplified access to child keys.
…and related classes to simplify memory allocation logic. This change enhances code clarity and maintains functionality across various components.
…ionality

Introduce 'enable_metrics', 'eviction_policy', and 'is_eagle' parameters to the constructors of RadixCache and SWARadixCache. This update aims to improve cache management and provide additional configuration options for users, enhancing the overall flexibility of the memory cache system.
Update the memory cache allocators to replace 'free_page_ids' with 'free_pages' for consistency across the codebase. This change enhances the clarity of the API and ensures that page management is handled uniformly. Additionally, introduce a new test suite for the 'free_pages' method to validate its functionality and integration within the allocator system.
…ages

This update adds a back-compatibility alias 'free_page_ids' to the BaseTokenToKVPoolAllocator class, allowing older call sites to function without modification. Additionally, the ChunkCache and common memory management functions are updated to utilize this new alias, improving consistency in page freeing methods across the codebase.
…classes to streamline memory allocation logic. This change enhances code clarity and maintains functionality across various components.
…locators

This update modifies the page ID calculations in various classes, including PagedTokenToKVPoolAllocator, RadixCache, and SWARadixCache, to ensure correct indexing by adding 1 to the division result. Additionally, it introduces idle checks in the SchedulerRuntimeCheckerMixin to skip checks when a batch is in-flight, improving performance during active processing.
…ibute shadowing

This update modifies the calls to the free_pages method within the BaseTokenToKVPoolAllocator class and its subclasses to prevent instance attribute shadowing. The changes ensure that the method is called correctly, maintaining functionality while improving code clarity.
…ning cache memory checks and adding detailed debug diagnostics. The update improves accuracy in identifying memory leaks by accounting for protected sizes and introducing tolerance for transient conditions, while also providing additional logging for better troubleshooting.
…requests by adding a new parameter. This update improves flexibility in handling unfinished requests while maintaining existing functionality.
… parameters

This update simplifies the `alloc_extend` and `alloc_decode` methods in the memory cache allocators by removing unnecessary CPU tensor parameters. The changes enhance code clarity and maintain functionality by ensuring that only device tensors are passed, streamlining the allocation process.
…ge_ids

This update modifies the memory management methods across various classes, including RadixCache, SWARadixCache, and common.py, to use the newly introduced free_page_ids method instead of free_pages. This change enhances code clarity and maintains functionality while ensuring consistency in page freeing operations throughout the codebase.
This update introduces deduplication of page IDs to prevent double-free attempts and adds checks to ensure only currently allocated pages are freed. These changes enhance the robustness of memory management within the allocator, maintaining functionality while improving error handling in debug mode.
…dixCache by adding detailed logging for memory management operations. This update introduces additional debug information for staged frees and decode-boundary slack estimates in SchedulerRuntimeCheckerMixin, as well as validation flags and reporting in RadixCache, improving observability and troubleshooting for memory-related issues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant