-
Notifications
You must be signed in to change notification settings - Fork 29.7k
Description
Feature request
This is a sampler method already present in other LLM inference backends that aims to simplify the truncation process & help accomodate for the flaws/failings of Top P & Top K.
Min P.
What Min P is doing is simple: we are setting a minimum percentage value that a token must reach to be considered during sampling. However, this is not a hard limit. The minimum will 'scale' based on the top token's probability. So, if you have a Min P value of 0.1 (for example), that would mean your base Min P requirement is 10%. So if your top token is 25%, that means it will only consider tokens that have at least 2.5% probability.
This method subjectively seems to improve results across the board with no noticeable downside, and has been merged into the following FOSS LLM backends:
- llama.cpp
- vllm
- text-generation-webui (through both the HF loaders and llama-cpp-python)
I would suggest a default of 0.05.
Motivation
I noticed certain 'flaws' in the popular Top P sampling method:
- When the model does not have sufficient confidence/concentration on the next token candidate(s), it's possible for the sampler to consider many tokens that are highly unlikely compared to the few choices it has confidence in.
- Top K helps limit the amount of 'low confidence' tokens period as a supplement to Top P, but this often comes at a cost of token choice diversity (often arbitrarily).
- In addition to this, Top P can sometimes cut reasonable tokens. What if there's a 90.1% probability token, followed by a 9% probability token? A Top P value of 0.90 would completely gloss over the 9% token in this instance.
For this reason I made Min P which seems to have positive reception across the board.
Your contribution
I may consider making a PR for this.