A solution for benchmarking many LLMs under many different configurations in parallel on Modal.
pip install -e .To run multiple benchmarks at once, first deploy the Datasette UI, which will let you easily view the results later:
(cd src && modal deploy -m big_benchmark);
Then, start a benchmark suite from a configuration file:
bb configs/llama3.yamlOnce the suite has finished, you will be given a URL to a UI where you can view your results, and a command to download a JSONL file with your results.
We welcome contributions, including those that add tuned benchmarks to our collection. See the CONTRIBUTING file and the Getting Started document for more details on contributing to Big Benchmark.
Big Benchmark is available under the MIT license. See the LICENSE file for more details.