Merge pull request #258 from dheerajoruganty/main

aarora79 · web-flow · commit 0b28475f540a · 2025-01-02T15:35:17.000-05:00
ARM CPU benchmarking support
diff --git a/README.md b/README.md
@@ -88,6 +88,10 @@ Llama3 is now available on SageMaker (read [blog post](https://aws.amazon.com/bl
 
 ## New in this release
 
+## 2.0.24
+
+1. ARM benchmarking support (AWS Graviton 3 Chips)
+
 ## 2.0.23
 
 1. Bug fixes for Amazon SageMaker BYOE.
@@ -97,11 +101,6 @@ Llama3 is now available on SageMaker (read [blog post](https://aws.amazon.com/bl
 1. Benchmarks for the [Amazon Nova](https://docs.aws.amazon.com/nova/latest/userguide/what-is-nova.html) family of models.
 1. Benchmarks for multi-modal models: LLama3.2-11B, Claude 3 Sonnet and Claude 3.5 Sonnet using the [ScienceQA](https://huggingface.co/datasets/derek-thomas/ScienceQA) dataset.
 
-## 2.0.21
-
-1. Dynamically get EC2 pricing from Boto3 API.
-1. Update pricing information and model id for Amazon Bedrock models.
-
 
 [Release history](./release_history.md)
 
diff --git a/docs/benchmarking_on_ec2.md b/docs/benchmarking_on_ec2.md
@@ -19,6 +19,7 @@ The steps for benchmarking on different types of EC2 instances (GPU/CPU/Neuron)
 - [Benchmarking on an instance type with AWS Chips and the Triton inference server](#benchmarking-on-an-instance-type-with-aws-chips-and-the-triton-inference-server)
 - [Benchmarking on an CPU instance type with AMD processors](#benchmarking-on-an-cpu-instance-type-with-amd-processors)
 - [Benchmarking on an CPU instance type with Intel processors](#benchmarking-on-an-cpu-instance-type-with-intel-processors)
+- [Benchmarking on an CPU instance type with ARM processors (Graviton 3)](#benchmarking-on-an-cpu-instance-type-with-arm-processors)
 
 - [Benchmarking the Triton inference server](#benchmarking-the-triton-inference-server)
 - [Benchmarking models on Ollama](#benchmarking-models-on-ollama)
@@ -452,3 +453,99 @@ command below. The config file for this example can be viewed [here](src/fmbench
     ```
 
 
+## Benchmarking on an CPU instance type with ARM processors
+
+**_As of 12/24/2024, this has been tested on `c8g.24xlarge` with `llama 3 8b Instruct` on Ubuntu Server 24.04 LTS (HVM), SSD Volume Type_**
+
+
+1. Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs `Docker` and `Miniconda` on the instance which is then used to create a new `conda` environment for `FMBench`. See instructions for downloading anaconda [here](https://www.anaconda.com/download)
+
+    ```{.bash}
+
+    sudo apt-get update -y
+    sudo apt-get install -y docker.io git
+    sudo systemctl start docker
+    sudo systemctl enable docker
+
+    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh -O /home/ubuntu/Miniconda3-latest-Linux-aarch64.sh
+    bash /home/ubuntu/Miniconda3-latest-Linux-aarch64.sh -b -p /home/ubuntu/miniconda3
+    rm /home/ubuntu/Miniconda3-latest-Linux-aarch64.sh
+
+    # Initialize conda for bash shell
+    /home/ubuntu/miniconda3/bin/conda init
+    ```
+
+1. Setup the `fmbench_python311` conda environment.
+
+    ```{.bash}
+    # Create a new conda environment named 'fmbench_python311' with Python 3.11 and ipykernel
+    conda create --name fmbench_python311 -y python=3.11 ipykernel
+
+    # Activate the newly created conda environment
+    source activate fmbench_python311
+
+    # Upgrade pip and install the fmbench package
+    pip install -U fmbench
+    ```
+
+1. Build the `vllm` container for serving the model. 
+
+    1. 👉 The `vllm` container we are building locally is going to be referenced in the `FMBench` config file.
+
+    1. The container being built is for ARM CPUs only.
+
+        ```{.bash}
+        # Clone the vLLM project repository from GitHub
+        git clone https://github.com/vllm-project/vllm.git
+
+        # Change the directory to the cloned vLLM project
+        cd vllm
+
+        # Build a Docker image using the provided Dockerfile for CPU, with a shared memory size of 12GB
+        sudo docker build -f Dockerfile.arm -t vllm-cpu-env --shm-size=12g .
+        ```
+
+1. Create local directory structure needed for `FMBench` and copy all publicly available dependencies from the AWS S3 bucket for `FMBench`. This is done by running the `copy_s3_content.sh` script available as part of the `FMBench` repo. **Replace `/tmp` in the command below with a different path if you want to store the config files and the `FMBench` generated data in a different directory**.
+
+    ```{.bash}
+    # Replace "/tmp" with "/path/to/your/custom/tmp" if you want to use a custom tmp directory
+    TMP_DIR="/tmp"
+    curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- "$TMP_DIR"
+    ```
+
+1. To download the model files from HuggingFace, create a `hf_token.txt` file in the `/tmp/fmbench-read/scripts/` directory containing the Hugging Face token you would like to use. In the command below replace the `hf_yourtokenstring` with your Hugging Face token. **Replace `/tmp` in the command below if you are using `/path/to/your/custom/tmp` to store the config files and the `FMBench` generated data**.
+
+    ```{.bash}
+    echo hf_yourtokenstring > $TMP_DIR/fmbench-read/scripts/hf_token.txt
+    ```
+
+1. Before running FMBench, add the current user to the docker group. Run the following commands to run Docker without needing to use `sudo` each time.
+
+    ```{.bash}
+    sudo usermod -a -G docker $USER
+    newgrp docker
+    ```
+
+1. Install `docker-compose`.
+
+    ```{.bash}
+    DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}
+    mkdir -p $DOCKER_CONFIG/cli-plugins
+    sudo curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o $DOCKER_CONFIG/cli-plugins/docker-compose
+    sudo chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose
+    docker compose version
+    ```
+
+1. Run `FMBench` with a packaged or a custom config file. **_This step will also deploy the model on the EC2 instance_**. The `--write-bucket` parameter value is just a placeholder and an actual S3 bucket is not required. You could set the `--tmp-dir` flag to an EFA path instead of `/tmp` if using a shared path for storing config files and reports.
+
+    ```{.bash}
+    fmbench --config-file $TMP_DIR/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-c8g-24xlarge.yml --local-mode yes --write-bucket placeholder --tmp-dir $TMP_DIR > fmbench.log 2>&1
+    ```
+
+1. Open a new Terminal and and do a `tail` on `fmbench.log` to see a live log of the run.
+
+    ```{.bash}
+    tail -f fmbench.log
+    ```
+
+1. All metrics are stored in the `/tmp/fmbench-write` directory created automatically by the `fmbench` package. Once the run completes all files are copied locally in a `results-*` folder as usual.
diff --git a/docs/manifest.md b/docs/manifest.md
@@ -67,6 +67,7 @@ Here is a listing of the various configuration files available out-of-the-box wi
 **└── llama3/8b**  
 [├── llama3/8b/config-bedrock.yml](configs/llama3/8b/config-bedrock.yml)  
 [├── llama3/8b/config-ec2-llama3-8b-c5-18xlarge.yml](configs/llama3/8b/config-ec2-llama3-8b-c5-18xlarge.yml)  
+[├── llama3/8b/config-ec2-llama3-8b-c8g-24xlarge.yml](configs/llama3/8b/config-ec2-llama3-8b-c8g-24xlarge.yml)  
 [├── llama3/8b/config-ec2-llama3-8b-g6e-2xlarge.yml](configs/llama3/8b/config-ec2-llama3-8b-g6e-2xlarge.yml)  
 [├── llama3/8b/config-ec2-llama3-8b-inf2-48xl.yml](configs/llama3/8b/config-ec2-llama3-8b-inf2-48xl.yml)  
 [├── llama3/8b/config-ec2-llama3-8b-m5-16xlarge.yml](configs/llama3/8b/config-ec2-llama3-8b-m5-16xlarge.yml)  
diff --git a/manifest.txt b/manifest.txt
@@ -116,6 +116,7 @@ configs/llama3/70b/config-llama3-70b-instruct-g5-p4d.yml
 configs/llama3/70b/config-llama3-70b-instruct-p4d.yml
 configs/llama3/8b/config-bedrock.yml
 configs/llama3/8b/config-ec2-llama3-8b-c5-18xlarge.yml
+configs/llama3/8b/config-ec2-llama3-8b-c8g-24xlarge.yml
 configs/llama3/8b/config-ec2-llama3-8b-g6e-2xlarge.yml
 configs/llama3/8b/config-ec2-llama3-8b-inf2-48xl.yml
 configs/llama3/8b/config-ec2-llama3-8b-m5-16xlarge.yml
diff --git a/release_history.md b/release_history.md
@@ -1,3 +1,7 @@
+## 2.0.21
+1. Dynamically get EC2 pricing from Boto3 API.
+1. Update pricing information and model id for Amazon Bedrock models.
+
 ## 2.0.20
 1. Add `hf_tokenizer_model_id` parameter to automatically download tokenizers from Hugging Face.
 
diff --git a/src/fmbench/configs/llama3/8b/config-ec2-llama3-8b-c8g-24xlarge.yml b/src/fmbench/configs/llama3/8b/config-ec2-llama3-8b-c8g-24xlarge.yml
@@ -0,0 +1,226 @@
+# config file for a rest endpoint supported on fmbench - 
+# this file uses a llama-3-8b-chat-hf deployed on ec2
+general:
+  name: "llama3-8b-c8g.24xl-ec2"      
+  model_name: "llama3-8b-instruct"
+  
+# AWS and SageMaker settings
+aws:
+  # AWS region, this parameter is templatized, no need to change
+  region: {region}
+  # SageMaker execution role used to run FMBench, this parameter is templatized, no need to change
+  sagemaker_execution_role: {role_arn}
+  # S3 bucket to which metrics, plots and reports would be written to
+  bucket: {write_bucket} ## add the name of your desired bucket
+
+# directory paths in the write bucket, no need to change these
+dir_paths:
+  data_prefix: data
+  prompts_prefix: prompts
+  all_prompts_file: all_prompts.csv
+  metrics_dir: metrics
+  models_dir: models
+  metadata_dir: metadata
+
+# S3 information for reading datasets, scripts and tokenizer
+s3_read_data:
+  # read bucket name, templatized, if left unchanged will default to sagemaker-fmbench-read-region-account_id
+  read_bucket: {read_bucket}
+  scripts_prefix: scripts ## add your own scripts in case you are using anything that is not on jumpstart
+  
+  # S3 prefix in the read bucket where deployment and inference scripts should be placed
+  scripts_prefix: scripts
+    
+  # deployment and inference script files to be downloaded are placed in this list
+  # only needed if you are creating a new deployment script or inference script
+  # your HuggingFace token does need to be in this list and should be called "hf_token.txt"
+  script_files:
+  - hf_token.txt
+
+  # configuration files (like this one) are placed in this prefix
+  configs_prefix: configs
+
+  # list of configuration files to download, for now only pricing.yml needs to be downloaded
+  config_files:
+  - pricing.yml
+
+  # S3 prefix for the dataset files
+  source_data_prefix: source_data
+  # list of dataset files, the list below is from the LongBench dataset https://huggingface.co/datasets/THUDM/LongBench
+  source_data_files:
+  - 2wikimqa_e.jsonl
+  - 2wikimqa.jsonl
+  - hotpotqa_e.jsonl
+  - hotpotqa.jsonl
+  - narrativeqa.jsonl
+  - triviaqa_e.jsonl
+  - triviaqa.jsonl
+
+  # S3 prefix for the tokenizer to be used with the models
+  # NOTE 1: the same tokenizer is used with all the models being tested through a config file
+  # NOTE 2: place your model specific tokenizers in a prefix named as <model_name>_tokenizer
+  #         so the mistral tokenizer goes in mistral_tokenizer, Llama2 tokenizer goes in  llama2_tokenizer
+  tokenizer_prefix: llama3_tokenizer
+
+  # S3 prefix for prompt templates
+  prompt_template_dir: prompt_template
+
+  # prompt template to use, NOTE: same prompt template gets used for all models being tested through a config file
+  # the FMBench repo already contains a bunch of prompt templates so review those first before creating a new one
+  prompt_template_file: prompt_template_llama3.txt
+
+# steps to run, usually all of these would be
+# set to yes so nothing needs to change here
+# you could, however, bypass some steps for example
+# set the 2_deploy_model.ipynb to no if you are re-running
+# the same config file and the model is already deployed
+run_steps:
+  0_setup.ipynb: yes
+  1_generate_data.ipynb: yes
+  2_deploy_model.ipynb: yes
+  3_run_inference.ipynb: yes
+  4_model_metric_analysis.ipynb: yes
+  5_cleanup.ipynb: yes
+
+datasets:
+  # dataset related configuration
+  prompt_template_keys:
+  - input
+  - context
+  
+  # if your dataset has multiple languages and it has a language
+  # field then you could filter it for a language. Similarly,
+  # you can filter your dataset to only keep prompts between
+  # a certain token length limit (the token length is determined
+  # using the tokenizer you provide in the tokenizer_prefix prefix in the
+  # read S3 bucket). Each of the array entries below create a payload file
+  # containing prompts matching the language and token length criteria.
+  filters:
+  - language: en    
+    min_length_in_tokens: 1
+    max_length_in_tokens: 500
+    payload_file: payload_en_1-500.jsonl
+  - language: en
+    min_length_in_tokens: 500
+    max_length_in_tokens: 1000
+    payload_file: payload_en_500-1000.jsonl
+  - language: en
+    min_length_in_tokens: 1000
+    max_length_in_tokens: 2000
+    payload_file: payload_en_1000-2000.jsonl
+  - language: en
+    min_length_in_tokens: 2000
+    max_length_in_tokens: 3000
+    payload_file: payload_en_2000-3000.jsonl
+  - language: en
+    min_length_in_tokens: 3000
+    max_length_in_tokens: 3840
+    payload_file: payload_en_3000-3840.jsonl
+
+# While the tests would run on all the datasets
+# configured in the experiment entries below but 
+# the price:performance analysis is only done for 1
+# dataset which is listed below as the dataset_of_interest
+metrics:
+  dataset_of_interest: en_3000-3840
+  
+# all pricing information is in the pricing.yml file
+# this file is provided in the repo. You can add entries
+# to this file for new instance types and new Bedrock models
+pricing: pricing.yml 
+
+# inference parameters, these are added to the payload
+# for each inference request. The list here is not static
+# any parameter supported by the inference container can be
+# added to the list. Put the sagemaker parameters in the sagemaker
+# section, bedrock parameters in the bedrock section (not shown here).
+# Use the section name (sagemaker in this example) in the inference_spec.parameter_set
+# section under experiments.
+inference_parameters: 
+  ec2_vllm:
+    model: meta-llama/Meta-Llama-3-8B-Instruct
+    temperature: 0.1
+    top_p: 0.92
+    top_k: 120  
+    max_tokens: 100
+
+# Configuration for experiments to be run. The experiments section is an array
+# so more than one experiments can be added, these could belong to the same model
+# but different instance types, or different models, or even different hosting
+# options.
+experiments:
+  - name: "llama3-8b-instruct"
+    # AWS region, this parameter is templatized, no need to change
+    region: {region}
+    # model_id is interpreted in conjunction with the deployment_script, so if you
+    # use a JumpStart model id then set the deployment_script to jumpstart.py.
+    # if deploying directly from HuggingFace this would be a HuggingFace model id
+    # see the DJL serving deployment script in the code repo for reference. 
+    #from huggingface to grab
+    model_id: meta-llama/Meta-Llama-3-8B-Instruct # model id, version and image uri not needed for byo endpoint
+    model_version:
+    model_name: "llama3-8b-instruct"
+    # this can be changed to the IP address of your specific EC2 instance where the model is hosted
+    ep_name: 'http://localhost:8000/v1/completions' 
+    instance_type: "c8g.24xlarge"
+    image_uri: vllm-cpu-env
+    deploy: yes #setting to yes to run deployment script for ec2
+    instance_count: 
+    deployment_script: ec2_deploy.py
+    # FMBench comes packaged with multiple inference scripts, such as scripts for SageMaker
+    # and Bedrock. You can also add your own. This is an example for a rest DJL predictor
+    # for a llama3-8b-instruct deployed on ec2
+    inference_script: ec2_predictor.py
+    # This section defines the settings for Amazon EC2 instances
+    ec2:
+      # Privileged Mode makes the docker container run with root. 
+      # This basically means that if you are root in a container you have the privileges of root on the host system
+      # Only need this if you need to set VLLM_CPU_OMP_THREADS_BIND env variable.
+      privileged_mode: yes      
+      # The following line specifies the runtime and GPU settings for the instance
+      # '--runtime=nvidia' tells the container runtime to use the NVIDIA runtime
+      # '--gpus all' makes all GPUs available to the container
+      # '--shm-size 12g' sets the size of the shared memory to 12 gigabytes
+      gpu_or_neuron_setting:
+      # This setting specifies the timeout (in seconds) for loading the model. In this case, the timeout is set to 2400 seconds, which is 40 minutes. 
+      # If the model takes longer than 40 minutes to load, the process will time out and fail.
+      model_loading_timeout: 2400
+    inference_spec:
+      # this should match one of the sections in the inference_parameters section above
+      parameter_set: ec2_vllm
+      # if not set assume djl
+      container_type: vllm
+    # modify the serving properties to match your model and requirements
+    serving.properties:
+    # runs are done for each combination of payload file and concurrency level
+    payload_files:
+    - payload_en_1-500.jsonl
+    - payload_en_500-1000.jsonl
+    - payload_en_1000-2000.jsonl
+    - payload_en_2000-3000.jsonl
+    - payload_en_3000-3840.jsonl
+    # concurrency level refers to number of requests sent in parallel to an endpoint
+    # the next set of requests is sent once responses for all concurrent requests have
+    # been received.
+    concurrency_levels:
+    - 1
+    # - 2
+    # - 4
+    # Environment variables to be passed to the container
+    # this is not a fixed list, you can add more parameters as applicable.
+    env:
+      MODEL_LOADING_TIMEOUT: 2400
+      # This instance is equipped with 96 CPUs, and we are allocating 93 of them to run this container.
+      # For additional details, refer to the following URL:
+      # https://docs.vllm.ai/en/latest/getting_started/cpu-installation.html#related-runtime-environment-variables
+      VLLM_CPU_OMP_THREADS_BIND: 0-92
+
+report:
+  latency_budget: 35
+  cost_per_10k_txn_budget: 60
+  error_rate_budget: 0
+  per_inference_request_file: per_inference_request_results.csv
+  all_metrics_file: all_metrics.csv
+  txn_count_for_showing_cost: 10000
+  v_shift_w_single_instance: 0.025
+  v_shift_w_gt_one_instance: 0.025  
diff --git a/src/fmbench/configs/pricing_fallback.yml b/src/fmbench/configs/pricing_fallback.yml
@@ -74,6 +74,7 @@ pricing:
     g6e.16xlarge: 7.577
     g6e.24xlarge: 15.066
     g6e.48xlarge: 30.131
+    c8g.24xlarge: 3.828
 
   token_based:
     amazon.nova-micro-v1:0: