Contents:
The lemonade CLI uses a unique command syntax that enables convenient interoperability between models, frameworks, devices, accuracy tests, and deployment options.
Each unit of functionality (e.g., loading a model, running a test, deploying a server, etc.) is called a Tool, and a single call to lemonade can invoke any number of Tools. Each Tool will perform its functionality, then pass its state to the next Tool in the command.
You can read each command out loud to understand what it is doing. For example, a command like this:
lemonade -i amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid oga-load --device hybrid --dtype int4 llm-prompt -p "Hello, my thoughts are"Can be read like this:
Run
lemonadeon the input(-i)checkpointamd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid(which is meta-llama/Llama-3.2-1B-Instruct optimized for OGA and hybrid). First, load it in the OnnxRuntime GenAI framework (oga-load), onto hybrid NPU/GPU acceleration (--device hybrid) in the int4 data type (--dtype int4). Then, pass the OGA model to the prompting tool (llm-prompt) with the prompt (-p) "Hello, my thoughts are" and print the response.
The lemonade -h command will show you which options and Tools are available, and lemonade TOOL -h will tell you more about that specific Tool.
To prompt your LLM, try one of the following:
OGA Hybrid:
lemonade -i amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid oga-load --device hybrid --dtype int4 llm-prompt -p "Hello, my thoughts are" -tHugging Face:
lemonade -i facebook/opt-125m huggingface-load llm-prompt -p "Hello, my thoughts are" -tThe LLM will run with your provided prompt, and the LLM's response to your prompt will be printed to the screen. You can replace the "Hello, my thoughts are" with any prompt you like.
You can also replace the facebook/opt-125m with any Hugging Face checkpoint you like, including LLaMA, Phi, Qwen, Mamba, etc.
You can also set the --device argument in oga-load and huggingface-load to load your LLM on a different device.
The -t (or --template) flag instructs Lemonade to insert the prompt string into the model's chat template.
This typically results in the model returning a higher quality response.
Run lemonade huggingface-load -h and lemonade llm-prompt -h to learn more about these tools.
To measure the accuracy of an LLM using MMLU (Measuring Massive Multitask Language Understanding), try the following:
OGA Hybrid:
lemonade -i amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid oga-load --device hybrid --dtype int4 accuracy-mmlu --tests managementHugging Face:
lemonade -i facebook/opt-125m huggingface-load accuracy-mmlu --tests managementThis command will run just the management test from MMLU on your LLM and save the score to the Lemonade cache at ~/.cache/lemonade. You can also run other subject tests by replacing management with the new test subject name. For the full list of supported subjects, see the MMLU Accuracy Read Me.
You can run the full suite of MMLU subjects by omitting the --test argument. You can learn more about this with lemonade accuracy-mmlu -h.
To measure the time-to-first-token and tokens/second of an LLM, try the following:
OGA Hybrid:
lemonade -i amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid oga-load --device hybrid --dtype int4 oga-benchHugging Face:
lemonade -i facebook/opt-125m huggingface-load huggingface-benchThis command will run a few warm-up iterations, then a few generation iterations where performance data is collected.
The prompt size, number of output tokens, and number iterations are all parameters. Learn more by running lemonade oga-bench -h or lemonade huggingface-bench -h.
To set up your own fine-tuned model:
- Quantize the model using Quark. This step reduces the model size and improves inference efficiency.
- Export the quantized model using Lemonade. This prepares the model for deployment.
Once exported, you can run inference using OGA. Make sure the quantized model is available either locally or hosted on Hugging Face before running inference.
OGA Hybrid:
lemonade -i amd/Llama-3.2-1B-Instruct-awq-uint4-asym-g128-bf16-lmhead oga-load --device hybrid --dtype int4OGA NPU :
lemonade -i amd/Llama-3.2-1B-Instruct-awq-uint4-asym-g128-bf16-lmhead oga-load --device npu --dtype int4Refer to the Finetuned Model Export Guide for detailed instructions on quantizing using Quark.
To see a report that contains all the benchmarking results and all the accuracy results, use the report tool with the --perf flag:
lemonade report --perfThe results can be filtered by model name, device type and data type. See how by running lemonade report -h.
The peak memory used by the Lemonade execution sequence is captured in the build output. To capture more granular
memory usage information, use the --memory flag. For example:
OGA Hybrid:
lemonade --memory -i amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid oga-load --device hybrid --dtype int4 oga-benchHugging Face:
lemonade --memory -i facebook/opt-125m huggingface-load huggingface-benchThis generates a PNG file that is stored in the current folder and the build folder. This file
contains a figure plotting the memory usage over the Lemonade tool sequence. Learn more by running lemonade -h.
To view system information and available devices, use the system-info tool:
lemonade system-infoBy default, this shows essential information including OS version, processor, physical memory, and device details.
For detailed system information including BIOS version, CPU max clock, Windows power setting, and Python packages, use the --verbose flag:
lemonade system-info --verboseFor JSON output format, use the --format flag:
lemonade system-info --format jsonBoth default and verbose modes work with JSON format:
lemonade system-info --verbose --format jsonThe system information includes:
- Default: OS version, processor, physical memory, and device details
- Verbose: All default information plus BIOS version, CPU max clock, Windows power setting, and Python packages
- Devices: CPU details (name, cores, threads, architecture, clock speed), AMD integrated GPU, AMD discrete GPUs, and NPU information
Learn more by running lemonade system-info -h.
Lemonade's low-level API is useful for designing custom experiments. For example, sweeping over specific checkpoints, devices, and/or tools.
Here's a quick example of how to prompt a Hugging Face LLM using the low-level API, which calls the load and prompt tools one by one:
import lemonade.tools.torch_llm as tl
import lemonade.tools.prompt as pt
from lemonade.state import State
state = State(cache_dir="cache", build_name="test")
state = tl.HuggingfaceLoad().run(state, input="facebook/opt-125m")
state = pt.Prompt().run(state, prompt="hi", max_new_tokens=15)
print("Response:", state.response)