Skip to content

Commit 0b28475

Browse files
authored
Merge pull request #258 from dheerajoruganty/main
ARM CPU benchmarking support
2 parents d1939d8 + 6cb39d9 commit 0b28475

File tree

7 files changed

+334
-5
lines changed

7 files changed

+334
-5
lines changed

README.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,10 @@ Llama3 is now available on SageMaker (read [blog post](https://aws.amazon.com/bl
8888

8989
## New in this release
9090

91+
## 2.0.24
92+
93+
1. ARM benchmarking support (AWS Graviton 3 Chips)
94+
9195
## 2.0.23
9296

9397
1. Bug fixes for Amazon SageMaker BYOE.
@@ -97,11 +101,6 @@ Llama3 is now available on SageMaker (read [blog post](https://aws.amazon.com/bl
97101
1. Benchmarks for the [Amazon Nova](https://docs.aws.amazon.com/nova/latest/userguide/what-is-nova.html) family of models.
98102
1. Benchmarks for multi-modal models: LLama3.2-11B, Claude 3 Sonnet and Claude 3.5 Sonnet using the [ScienceQA](https://huggingface.co/datasets/derek-thomas/ScienceQA) dataset.
99103

100-
## 2.0.21
101-
102-
1. Dynamically get EC2 pricing from Boto3 API.
103-
1. Update pricing information and model id for Amazon Bedrock models.
104-
105104

106105
[Release history](./release_history.md)
107106

docs/benchmarking_on_ec2.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ The steps for benchmarking on different types of EC2 instances (GPU/CPU/Neuron)
1919
- [Benchmarking on an instance type with AWS Chips and the Triton inference server](#benchmarking-on-an-instance-type-with-aws-chips-and-the-triton-inference-server)
2020
- [Benchmarking on an CPU instance type with AMD processors](#benchmarking-on-an-cpu-instance-type-with-amd-processors)
2121
- [Benchmarking on an CPU instance type with Intel processors](#benchmarking-on-an-cpu-instance-type-with-intel-processors)
22+
- [Benchmarking on an CPU instance type with ARM processors (Graviton 3)](#benchmarking-on-an-cpu-instance-type-with-arm-processors)
2223

2324
- [Benchmarking the Triton inference server](#benchmarking-the-triton-inference-server)
2425
- [Benchmarking models on Ollama](#benchmarking-models-on-ollama)
@@ -452,3 +453,99 @@ command below. The config file for this example can be viewed [here](src/fmbench
452453
```
453454
454455
456+
## Benchmarking on an CPU instance type with ARM processors
457+
458+
**_As of 12/24/2024, this has been tested on `c8g.24xlarge` with `llama 3 8b Instruct` on Ubuntu Server 24.04 LTS (HVM), SSD Volume Type_**
459+
460+
461+
1. Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs `Docker` and `Miniconda` on the instance which is then used to create a new `conda` environment for `FMBench`. See instructions for downloading anaconda [here](https://www.anaconda.com/download)
462+
463+
```{.bash}
464+
465+
sudo apt-get update -y
466+
sudo apt-get install -y docker.io git
467+
sudo systemctl start docker
468+
sudo systemctl enable docker
469+
470+
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh -O /home/ubuntu/Miniconda3-latest-Linux-aarch64.sh
471+
bash /home/ubuntu/Miniconda3-latest-Linux-aarch64.sh -b -p /home/ubuntu/miniconda3
472+
rm /home/ubuntu/Miniconda3-latest-Linux-aarch64.sh
473+
474+
# Initialize conda for bash shell
475+
/home/ubuntu/miniconda3/bin/conda init
476+
```
477+
478+
1. Setup the `fmbench_python311` conda environment.
479+
480+
```{.bash}
481+
# Create a new conda environment named 'fmbench_python311' with Python 3.11 and ipykernel
482+
conda create --name fmbench_python311 -y python=3.11 ipykernel
483+
484+
# Activate the newly created conda environment
485+
source activate fmbench_python311
486+
487+
# Upgrade pip and install the fmbench package
488+
pip install -U fmbench
489+
```
490+
491+
1. Build the `vllm` container for serving the model.
492+
493+
1. 👉 The `vllm` container we are building locally is going to be referenced in the `FMBench` config file.
494+
495+
1. The container being built is for ARM CPUs only.
496+
497+
```{.bash}
498+
# Clone the vLLM project repository from GitHub
499+
git clone https://github.com/vllm-project/vllm.git
500+
501+
# Change the directory to the cloned vLLM project
502+
cd vllm
503+
504+
# Build a Docker image using the provided Dockerfile for CPU, with a shared memory size of 12GB
505+
sudo docker build -f Dockerfile.arm -t vllm-cpu-env --shm-size=12g .
506+
```
507+
508+
1. Create local directory structure needed for `FMBench` and copy all publicly available dependencies from the AWS S3 bucket for `FMBench`. This is done by running the `copy_s3_content.sh` script available as part of the `FMBench` repo. **Replace `/tmp` in the command below with a different path if you want to store the config files and the `FMBench` generated data in a different directory**.
509+
510+
```{.bash}
511+
# Replace "/tmp" with "/path/to/your/custom/tmp" if you want to use a custom tmp directory
512+
TMP_DIR="/tmp"
513+
curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- "$TMP_DIR"
514+
```
515+
516+
1. To download the model files from HuggingFace, create a `hf_token.txt` file in the `/tmp/fmbench-read/scripts/` directory containing the Hugging Face token you would like to use. In the command below replace the `hf_yourtokenstring` with your Hugging Face token. **Replace `/tmp` in the command below if you are using `/path/to/your/custom/tmp` to store the config files and the `FMBench` generated data**.
517+
518+
```{.bash}
519+
echo hf_yourtokenstring > $TMP_DIR/fmbench-read/scripts/hf_token.txt
520+
```
521+
522+
1. Before running FMBench, add the current user to the docker group. Run the following commands to run Docker without needing to use `sudo` each time.
523+
524+
```{.bash}
525+
sudo usermod -a -G docker $USER
526+
newgrp docker
527+
```
528+
529+
1. Install `docker-compose`.
530+
531+
```{.bash}
532+
DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}
533+
mkdir -p $DOCKER_CONFIG/cli-plugins
534+
sudo curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o $DOCKER_CONFIG/cli-plugins/docker-compose
535+
sudo chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose
536+
docker compose version
537+
```
538+
539+
1. Run `FMBench` with a packaged or a custom config file. **_This step will also deploy the model on the EC2 instance_**. The `--write-bucket` parameter value is just a placeholder and an actual S3 bucket is not required. You could set the `--tmp-dir` flag to an EFA path instead of `/tmp` if using a shared path for storing config files and reports.
540+
541+
```{.bash}
542+
fmbench --config-file $TMP_DIR/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-c8g-24xlarge.yml --local-mode yes --write-bucket placeholder --tmp-dir $TMP_DIR > fmbench.log 2>&1
543+
```
544+
545+
1. Open a new Terminal and and do a `tail` on `fmbench.log` to see a live log of the run.
546+
547+
```{.bash}
548+
tail -f fmbench.log
549+
```
550+
551+
1. All metrics are stored in the `/tmp/fmbench-write` directory created automatically by the `fmbench` package. Once the run completes all files are copied locally in a `results-*` folder as usual.

docs/manifest.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ Here is a listing of the various configuration files available out-of-the-box wi
6767
**└── llama3/8b**
6868
[├── llama3/8b/config-bedrock.yml](configs/llama3/8b/config-bedrock.yml)
6969
[├── llama3/8b/config-ec2-llama3-8b-c5-18xlarge.yml](configs/llama3/8b/config-ec2-llama3-8b-c5-18xlarge.yml)
70+
[├── llama3/8b/config-ec2-llama3-8b-c8g-24xlarge.yml](configs/llama3/8b/config-ec2-llama3-8b-c8g-24xlarge.yml)
7071
[├── llama3/8b/config-ec2-llama3-8b-g6e-2xlarge.yml](configs/llama3/8b/config-ec2-llama3-8b-g6e-2xlarge.yml)
7172
[├── llama3/8b/config-ec2-llama3-8b-inf2-48xl.yml](configs/llama3/8b/config-ec2-llama3-8b-inf2-48xl.yml)
7273
[├── llama3/8b/config-ec2-llama3-8b-m5-16xlarge.yml](configs/llama3/8b/config-ec2-llama3-8b-m5-16xlarge.yml)

manifest.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,7 @@ configs/llama3/70b/config-llama3-70b-instruct-g5-p4d.yml
116116
configs/llama3/70b/config-llama3-70b-instruct-p4d.yml
117117
configs/llama3/8b/config-bedrock.yml
118118
configs/llama3/8b/config-ec2-llama3-8b-c5-18xlarge.yml
119+
configs/llama3/8b/config-ec2-llama3-8b-c8g-24xlarge.yml
119120
configs/llama3/8b/config-ec2-llama3-8b-g6e-2xlarge.yml
120121
configs/llama3/8b/config-ec2-llama3-8b-inf2-48xl.yml
121122
configs/llama3/8b/config-ec2-llama3-8b-m5-16xlarge.yml

release_history.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
## 2.0.21
2+
1. Dynamically get EC2 pricing from Boto3 API.
3+
1. Update pricing information and model id for Amazon Bedrock models.
4+
15
## 2.0.20
26
1. Add `hf_tokenizer_model_id` parameter to automatically download tokenizers from Hugging Face.
37

Lines changed: 226 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,226 @@
1+
# config file for a rest endpoint supported on fmbench -
2+
# this file uses a llama-3-8b-chat-hf deployed on ec2
3+
general:
4+
name: "llama3-8b-c8g.24xl-ec2"
5+
model_name: "llama3-8b-instruct"
6+
7+
# AWS and SageMaker settings
8+
aws:
9+
# AWS region, this parameter is templatized, no need to change
10+
region: {region}
11+
# SageMaker execution role used to run FMBench, this parameter is templatized, no need to change
12+
sagemaker_execution_role: {role_arn}
13+
# S3 bucket to which metrics, plots and reports would be written to
14+
bucket: {write_bucket} ## add the name of your desired bucket
15+
16+
# directory paths in the write bucket, no need to change these
17+
dir_paths:
18+
data_prefix: data
19+
prompts_prefix: prompts
20+
all_prompts_file: all_prompts.csv
21+
metrics_dir: metrics
22+
models_dir: models
23+
metadata_dir: metadata
24+
25+
# S3 information for reading datasets, scripts and tokenizer
26+
s3_read_data:
27+
# read bucket name, templatized, if left unchanged will default to sagemaker-fmbench-read-region-account_id
28+
read_bucket: {read_bucket}
29+
scripts_prefix: scripts ## add your own scripts in case you are using anything that is not on jumpstart
30+
31+
# S3 prefix in the read bucket where deployment and inference scripts should be placed
32+
scripts_prefix: scripts
33+
34+
# deployment and inference script files to be downloaded are placed in this list
35+
# only needed if you are creating a new deployment script or inference script
36+
# your HuggingFace token does need to be in this list and should be called "hf_token.txt"
37+
script_files:
38+
- hf_token.txt
39+
40+
# configuration files (like this one) are placed in this prefix
41+
configs_prefix: configs
42+
43+
# list of configuration files to download, for now only pricing.yml needs to be downloaded
44+
config_files:
45+
- pricing.yml
46+
47+
# S3 prefix for the dataset files
48+
source_data_prefix: source_data
49+
# list of dataset files, the list below is from the LongBench dataset https://huggingface.co/datasets/THUDM/LongBench
50+
source_data_files:
51+
- 2wikimqa_e.jsonl
52+
- 2wikimqa.jsonl
53+
- hotpotqa_e.jsonl
54+
- hotpotqa.jsonl
55+
- narrativeqa.jsonl
56+
- triviaqa_e.jsonl
57+
- triviaqa.jsonl
58+
59+
# S3 prefix for the tokenizer to be used with the models
60+
# NOTE 1: the same tokenizer is used with all the models being tested through a config file
61+
# NOTE 2: place your model specific tokenizers in a prefix named as <model_name>_tokenizer
62+
# so the mistral tokenizer goes in mistral_tokenizer, Llama2 tokenizer goes in llama2_tokenizer
63+
tokenizer_prefix: llama3_tokenizer
64+
65+
# S3 prefix for prompt templates
66+
prompt_template_dir: prompt_template
67+
68+
# prompt template to use, NOTE: same prompt template gets used for all models being tested through a config file
69+
# the FMBench repo already contains a bunch of prompt templates so review those first before creating a new one
70+
prompt_template_file: prompt_template_llama3.txt
71+
72+
# steps to run, usually all of these would be
73+
# set to yes so nothing needs to change here
74+
# you could, however, bypass some steps for example
75+
# set the 2_deploy_model.ipynb to no if you are re-running
76+
# the same config file and the model is already deployed
77+
run_steps:
78+
0_setup.ipynb: yes
79+
1_generate_data.ipynb: yes
80+
2_deploy_model.ipynb: yes
81+
3_run_inference.ipynb: yes
82+
4_model_metric_analysis.ipynb: yes
83+
5_cleanup.ipynb: yes
84+
85+
datasets:
86+
# dataset related configuration
87+
prompt_template_keys:
88+
- input
89+
- context
90+
91+
# if your dataset has multiple languages and it has a language
92+
# field then you could filter it for a language. Similarly,
93+
# you can filter your dataset to only keep prompts between
94+
# a certain token length limit (the token length is determined
95+
# using the tokenizer you provide in the tokenizer_prefix prefix in the
96+
# read S3 bucket). Each of the array entries below create a payload file
97+
# containing prompts matching the language and token length criteria.
98+
filters:
99+
- language: en
100+
min_length_in_tokens: 1
101+
max_length_in_tokens: 500
102+
payload_file: payload_en_1-500.jsonl
103+
- language: en
104+
min_length_in_tokens: 500
105+
max_length_in_tokens: 1000
106+
payload_file: payload_en_500-1000.jsonl
107+
- language: en
108+
min_length_in_tokens: 1000
109+
max_length_in_tokens: 2000
110+
payload_file: payload_en_1000-2000.jsonl
111+
- language: en
112+
min_length_in_tokens: 2000
113+
max_length_in_tokens: 3000
114+
payload_file: payload_en_2000-3000.jsonl
115+
- language: en
116+
min_length_in_tokens: 3000
117+
max_length_in_tokens: 3840
118+
payload_file: payload_en_3000-3840.jsonl
119+
120+
# While the tests would run on all the datasets
121+
# configured in the experiment entries below but
122+
# the price:performance analysis is only done for 1
123+
# dataset which is listed below as the dataset_of_interest
124+
metrics:
125+
dataset_of_interest: en_3000-3840
126+
127+
# all pricing information is in the pricing.yml file
128+
# this file is provided in the repo. You can add entries
129+
# to this file for new instance types and new Bedrock models
130+
pricing: pricing.yml
131+
132+
# inference parameters, these are added to the payload
133+
# for each inference request. The list here is not static
134+
# any parameter supported by the inference container can be
135+
# added to the list. Put the sagemaker parameters in the sagemaker
136+
# section, bedrock parameters in the bedrock section (not shown here).
137+
# Use the section name (sagemaker in this example) in the inference_spec.parameter_set
138+
# section under experiments.
139+
inference_parameters:
140+
ec2_vllm:
141+
model: meta-llama/Meta-Llama-3-8B-Instruct
142+
temperature: 0.1
143+
top_p: 0.92
144+
top_k: 120
145+
max_tokens: 100
146+
147+
# Configuration for experiments to be run. The experiments section is an array
148+
# so more than one experiments can be added, these could belong to the same model
149+
# but different instance types, or different models, or even different hosting
150+
# options.
151+
experiments:
152+
- name: "llama3-8b-instruct"
153+
# AWS region, this parameter is templatized, no need to change
154+
region: {region}
155+
# model_id is interpreted in conjunction with the deployment_script, so if you
156+
# use a JumpStart model id then set the deployment_script to jumpstart.py.
157+
# if deploying directly from HuggingFace this would be a HuggingFace model id
158+
# see the DJL serving deployment script in the code repo for reference.
159+
#from huggingface to grab
160+
model_id: meta-llama/Meta-Llama-3-8B-Instruct # model id, version and image uri not needed for byo endpoint
161+
model_version:
162+
model_name: "llama3-8b-instruct"
163+
# this can be changed to the IP address of your specific EC2 instance where the model is hosted
164+
ep_name: 'http://localhost:8000/v1/completions'
165+
instance_type: "c8g.24xlarge"
166+
image_uri: vllm-cpu-env
167+
deploy: yes #setting to yes to run deployment script for ec2
168+
instance_count:
169+
deployment_script: ec2_deploy.py
170+
# FMBench comes packaged with multiple inference scripts, such as scripts for SageMaker
171+
# and Bedrock. You can also add your own. This is an example for a rest DJL predictor
172+
# for a llama3-8b-instruct deployed on ec2
173+
inference_script: ec2_predictor.py
174+
# This section defines the settings for Amazon EC2 instances
175+
ec2:
176+
# Privileged Mode makes the docker container run with root.
177+
# This basically means that if you are root in a container you have the privileges of root on the host system
178+
# Only need this if you need to set VLLM_CPU_OMP_THREADS_BIND env variable.
179+
privileged_mode: yes
180+
# The following line specifies the runtime and GPU settings for the instance
181+
# '--runtime=nvidia' tells the container runtime to use the NVIDIA runtime
182+
# '--gpus all' makes all GPUs available to the container
183+
# '--shm-size 12g' sets the size of the shared memory to 12 gigabytes
184+
gpu_or_neuron_setting:
185+
# This setting specifies the timeout (in seconds) for loading the model. In this case, the timeout is set to 2400 seconds, which is 40 minutes.
186+
# If the model takes longer than 40 minutes to load, the process will time out and fail.
187+
model_loading_timeout: 2400
188+
inference_spec:
189+
# this should match one of the sections in the inference_parameters section above
190+
parameter_set: ec2_vllm
191+
# if not set assume djl
192+
container_type: vllm
193+
# modify the serving properties to match your model and requirements
194+
serving.properties:
195+
# runs are done for each combination of payload file and concurrency level
196+
payload_files:
197+
- payload_en_1-500.jsonl
198+
- payload_en_500-1000.jsonl
199+
- payload_en_1000-2000.jsonl
200+
- payload_en_2000-3000.jsonl
201+
- payload_en_3000-3840.jsonl
202+
# concurrency level refers to number of requests sent in parallel to an endpoint
203+
# the next set of requests is sent once responses for all concurrent requests have
204+
# been received.
205+
concurrency_levels:
206+
- 1
207+
# - 2
208+
# - 4
209+
# Environment variables to be passed to the container
210+
# this is not a fixed list, you can add more parameters as applicable.
211+
env:
212+
MODEL_LOADING_TIMEOUT: 2400
213+
# This instance is equipped with 96 CPUs, and we are allocating 93 of them to run this container.
214+
# For additional details, refer to the following URL:
215+
# https://docs.vllm.ai/en/latest/getting_started/cpu-installation.html#related-runtime-environment-variables
216+
VLLM_CPU_OMP_THREADS_BIND: 0-92
217+
218+
report:
219+
latency_budget: 35
220+
cost_per_10k_txn_budget: 60
221+
error_rate_budget: 0
222+
per_inference_request_file: per_inference_request_results.csv
223+
all_metrics_file: all_metrics.csv
224+
txn_count_for_showing_cost: 10000
225+
v_shift_w_single_instance: 0.025
226+
v_shift_w_gt_one_instance: 0.025

src/fmbench/configs/pricing_fallback.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ pricing:
7474
g6e.16xlarge: 7.577
7575
g6e.24xlarge: 15.066
7676
g6e.48xlarge: 30.131
77+
c8g.24xlarge: 3.828
7778

7879
token_based:
7980
amazon.nova-micro-v1:0:

0 commit comments

Comments
 (0)