Increasing capacity and reliability #12311

ryanjsalva · 2025-10-30T19:49:26Z

ryanjsalva
Oct 30, 2025
Maintainer

Hello, friends. 👋

First, thanks y’all for continuing to shower Gemini CLI with your ideas, issues, code contributions, and new extensions. The sheer volume of developers building with Gemini CLI has exceeded our wildest ambitions. Sincerely… thank you.

Unfortunately, as demand has increased, many of you have also reported intermittent capacity-related errors. I know toolchain reliability is critical to achieving flow state. High error rates are frustrating, and they do not reflect the experience we aim to deliver. I’m truly sorry. 😞 There is, however, light on the horizon. I’m writing to outline our plan to both (a) quickly resolve the congestion, and (b) build a more resilient platform for the years ahead.

Focusing on reliability and long-term stability

To ensure a stable, snappy experience for everyone, the Gemini CLI maintainers will immediately:

Expand overall capacity: First and foremost, we're adding hardware to grow overall capacity, and will continue to add hardware ahead of major launches when capacity typically spikes. This is our top priority.
Improve error messages: We will add clearer, more actionable error messages throughout the product. For example, we will distinguish between timeouts due service reliability vs. “exhausted daily quota.” This will allow you to make better choices about how to recover in the case of an error.
Add better retry logic: We will add better fallback logic so that Gemini CLI can retry and recover from timeouts without your intervention.
Implement intelligent routing: We will roll-out intelligent routing so we use Gemini 2.5 Flash for queries that are lower in complexity for users.

Through these changes, we aim to deliver significantly lower error rates. And, should you ever suffer an error, we hope the improved error messages will light the way to a quick resolution. Thank you for your patience and for continuing to build incredible things on our platform. Keep watching this discussion post for updates. 👀

rcleveng · 2025-10-30T21:55:53Z

rcleveng
Oct 30, 2025

This sounds really great @ryanjsalva , will those error improves address the countTokens issues?

0 replies

Neinndall · 2025-10-30T23:12:28Z

Neinndall
Oct 30, 2025

Will this improve the actual requests we users who use Google Authentication (Free) will receive for the 2.5 PRO model?
Because many users have been experiencing 429 errors for a week now with very low request rates (average of 50-75 requests with the PRO model) automatically switching to the FLASH plan.

2 replies

jackwotherspoon Oct 31, 2025
Maintainer

Yes we are working on improving capacity which should help improve the number of requests you can get with 2.5 Pro.

The model routing should also help your Pro model calls go much further and increase the longevity of sessions. With the model routing, simple requests like "hi" will go to Flash while complex requests will go to Pro. You can always disable this and force Pro using /model.

pentium10 Oct 31, 2025

@jackwotherspoon I wanted a long ago to post about this.
I propose having a #pragma syntax that we can place in /commands or GEMINI.MD or any prompt files, that would decorate a certain action/next instruction to be used with a model we set.

Maybe it could be #usemodel gemini-2.0-flash-lite.

This would make sense to be used on low importance instructions, like write git commit messages, or for me, creating tasks and logging hours at the end of session with an MCP.

wangbiye · 2025-10-31T00:45:29Z

wangbiye
Oct 31, 2025

"We will roll-out intelligent routing so we use Gemini 2.5 Flash for queries that are lower in complexity for users."

How is the level of complexity defined, and by whom? Based on my experience, Gemini itself cannot distinguish between high and low complexity—everything seems simple to it. However, due to its overconfidence, it tends to underestimate the complexity of my code logic, resulting in low-quality responses.

0 replies

russeg · 2025-10-31T07:07:17Z

russeg
Oct 31, 2025

Can't you guys just add Qwen or Deepseek as additional models. Those open-source models are WAY ahead of Gemini in intelligence. I mean Qwen CLI (which is a fork of Gemini CLI) is better in every way because of Qwen model.

Those two models are available in Vertex so I would think there are not major technical barriers to implement to be able to add them.

For coding, Gemini flash even pro are both dumb compared to open-source ones and not come close to GPT/Claude.

0 replies

mlik-sudo · 2025-10-31T07:07:39Z

mlik-sudo
Oct 31, 2025

Thank you @ryanjsalva for the transparency and the commitment to improving reliability. While the outlined measures are encouraging, I'd like to respectfully raise some concerns regarding the proposed intelligent routing system and suggest an alternative approach.

Technical Concerns:

The automatic routing to Gemini 2.5 Flash based on "complexity" assessment raises significant technical questions. As @wangbiye astutely pointed out, defining and detecting complexity is inherently challenging. In practice, LLMs often exhibit overconfidence and may underestimate task complexity, particularly with:

Nuanced code logic requiring deep contextual understanding
Multi-step reasoning chains
Domain-specific requirements that appear simple on the surface

This could result in degraded output quality precisely when users need the most capable model. Additionally, @rcleveng's question about countTokens issues highlights how error handling improvements need to address the full spectrum of API reliability concerns, not just routing.

Ethical and Philosophical Dimensions:

@mlik-sudo's comment about user sovereignty touches on a fundamental principle: users should retain agency over the tools they use. Automatic model switching, even with good intentions, raises concerns about:

Autonomy: Users know their use cases better than any heuristic
Transparency: Opaque switching can erode trust in outputs
Accountability: When quality suffers, who is responsible—the user's prompt or the automatic downgrade?

The 429 errors mentioned by @Neinndall (hitting limits at just 50-75 requests) suggest that capacity constraints are pushing toward rationing rather than true scaling. While understandable, this creates a conflict between system optimization and user needs.

Proposed Solution: User-Controlled Model Selection

Rather than intelligent routing, I propose implementing explicit user control with smart defaults:

Opt-in routing: Make intelligent routing an optional feature users can enable
Model override: Provide a simple mechanism (CLI flag, config setting) to specify model preference
Transparent feedback: When the system suggests Flash, explain why and let users accept or override
Usage analytics: Give users visibility into their model usage patterns and associated costs/quotas

This approach balances your capacity goals with user sovereignty. Users who trust the system can benefit from automatic optimization, while those with specialized needs retain control.

Conclusion:

I deeply appreciate the engineering effort going into capacity expansion and error handling improvements—these are absolutely critical. However, I encourage the team to reconsider automatic model switching in favor of user-empowered tools. This builds trust, respects expertise, and ultimately creates a more sustainable relationship between platform and community.

Thank you for considering this feedback, and for continuing to engage with the community so openly. 🙏

4 replies

jackwotherspoon Oct 31, 2025
Maintainer

Wow! This is excellent feedback, thanks for taking the time to gather your thoughts and writing theme here. We will discuss them as we are actively looking for ways to do what is best for the user.

I think we should atleast publish a blog or some benchmark on how the model routing works. I think it totally valid to want to know more on how it works and we can do better to make that comes across more clear for sure 👍

I think having model routing as "opt-in" is definitely an approach we can take in the future. For now, it is one way we can help improve everyone's experience quite dramatically (without impacting results from our benchmarks, which again we should publish). Trying to make sure saying "hi" does not burn one of your Pro requests.

Model override

Right now you can easily opt-out of model routing by choosing a model in /model or starting gemini with --model flag.

We are also brainstorming ways like you mentioned for the user to have even more control by things like "Wait for Pro" option when you hit a quota instead of being switched to flash.

We are also in the early stages of working on a /usage command where you can see your tokens and requests more transparently.

Again I want to say thanks for the awesome feedback, user like you are making a real difference in shaping the product. We absolutely hear you and are trying to do what is best to help all users achieve a positive experience with Gemini CLI.

Neinndall Nov 2, 2025

@jackwotherspoon im testing with auto and I'm noticing that for code processes, GEMINI is making a lot of mistakes, duplicating code, removing things it shouldn't; it's chaos.

The routing auto its... rly bad.

timrichardson Nov 3, 2025

I used it for hours yesterday and today and my experience is the opposite, I am very happy with it. Python code base with a mature GEMINI.md that was inherited from some months of using Aider. I hope the project is getting some kind of telemetry or metrics to evaluate it (auto routing). My experience in the past 24 hours with gemini-cli is by far the best since I started using it, it's like the promise of the tool is coming to fruition.

Neinndall Nov 3, 2025

yesterday for me was rly bad, the model flash did things and the thing that did was wrong: duplicate code etc...

mlik-sudo · 2025-10-31T10:12:11Z

mlik-sudo
Oct 31, 2025

Gemini CLI should offer a multi-phase workflow where, at every step, users choose the AI mode: full automation (Auto), speed (Gemini Flash), or depth (Gemini Pro). Customization at every stage—user choice becomes the heart of the process.

3 replies

jackwotherspoon Oct 31, 2025
Maintainer

What do you mean at "every step" here? This sounds very interesting.

We do offer setting/switching the model using /model

mlik-sudo Oct 31, 2025

I’m inspired by the “Action Research” methodology, where projects progress through multiple cycles: design, critical review, planning, safety verification, execution, return-of-experience, and improvement.
Each phase demands a different type of AI assistance: deeper reasoning (Gemini Pro), rapid execution (Flash), or full automation (Auto).
My proposal is for a multi-phase workflow in Gemini CLI where, at every project stage, users can choose the most relevant model. For example:

Design/Ideation: select Gemini Pro for deeper, more thoughtful answers

Routine tasks: switch to Flash for speed

Iterative improvements: alternate between models as needed

This “creative dialectic loop” empowers users to adapt AI depth and speed dynamically during a single project. It moves beyond static or automatic routing by putting user agency at the core.
Such a system benefits reproducibility, transparency, and meaningful user control over the AI-driven workflow.

Would it be possible to make model selection at every project phase a native feature (e.g., explicit prompts/flags in CLI), rather than relying on global settings or auto-routing? For example:

bash
gemini step design --model=pro "Architect my feature"
gemini step plan --model=flash "Generate task breakdown"
gemini step review --model=pro "Critique implementation"
This would make Gemini CLI even more powerful, flexible, and ethical for creative, iterative workflows.

Thank you very much for your openness to my suggestions and ideas.
I would like to clarify that I am far from being an expert in this field; my proposals are based more on intuitive reasoning and practical experience rather than deep technical expertise.
I really appreciate your attention and the way you consider community feedback on improving Gemini CLI.

highdealist Nov 16, 2025

You can do this with custom commands, except you can't control model with custom commands I don't think, This would be the way to do it, though: /phase2 and inside of .gemini/commands/phase2.toml you have model = "gemini-2.5-hashish", text="Please do to {{args}}

treb0r · 2025-10-31T21:53:53Z

treb0r
Oct 31, 2025

I think a straightforward and open approach to usage and capacity such as that provided by Anthropic in their Claude tools would be the ideal solution.

Rather than swapping models or showing errors, Gemini should provide a robust usage meter with clear information about daily and monthly quotas together with information about when these are reset.

This would go a long way to solving the issues I currently experience with Gemini.

0 replies

ryanjsalva · 2025-10-31T23:22:20Z

ryanjsalva
Oct 31, 2025
Maintainer Author

Hey, y'all. 👋

Thanks again for your patience and constructive feedback. The debate over model choice and intelligent routing is an important one. In fact, it's important enough to merit a dedicated thread which I'll open next week. If you'll forgive me, I want to focus this afternoon's update on the progress made since yesterday so everyone can remain informed.

🐿️ Shipped

✅ Increased capacity: We added significant capacity to our serving pools. While this has resulted in lower error rates and a better experience for most users, we're still not hitting our internal SLOs.
✅ Intelligent routing: The new intelligent routing logic has been rolled out and is available on 0.11.3 and later. As long as you are on 0.11.3 and beyond you will see that the model in use is auto. This will be more judicious about model use and should help you get significantly more out of the pro model.
✅ Error messages & retry logic: These changes are also into 0.11.3 which means you should be able to see clear errors when you are running into model capacity issues. The newer retry logic is also leading to more intelligent retries and a better overall experience.
✅ Compressing context: Our data suggests that Gemini CLI performs better when using a smaller share of the overall context window. Starting with 0.11.3, we have reduced the default compression threshold so that Gemini CLI automatically compresses context when using 20% rather than 70% of the maximum context size. While not the primary objective, a nice side benefit is that each request consumes fewer tokens and therefore frees up capacity. If you want to switch to the previous value, you can adjust the threshold for automatic compression using the /settings command.

⏭️ Next

⚪️ Even more capacity: We will continue to add capacity and monitor error rates going into next week,

Our plan is to monitor traffic on Monday (when congestion is at it's peak). Watch this discussion for another update on Tuesday.

1 reply

timrichardson Nov 5, 2025

"Our plan is to monitor traffic on Monday (when congestion is at it's peak). Watch this discussion for another update on Tuesday." :)
I have not been using it a a great deal today, but it is still a lot better than last week. I am really enjoying the user experience.
I just worry about the sustainability of the amount of resources deployed to improve it so dramatically, given that only plan on offer is $20 a month (putting about AI Ultra or enterprise)

timrichardson · 2025-11-01T03:52:07Z

timrichardson
Nov 1, 2025

Discussion about the Codebase Investigator mentioned that the prompt/context could influence when this was invoked. Is model routing influenced by prompt phrases such as "investigate this with deep thinking"?

0 replies

timrichardson · 2025-11-01T11:29:38Z

timrichardson
Nov 1, 2025

Experience of the last few hours has been good. Smooth, no 429 errors.

Since someone appears to be listening to messages posted here ...

What is the practical quota, that is the Pro quota? Documentation says 1500 requests a day, but actual user experience is nothing like this. IT seems that this is 1500 requests divided in some unknown way between different models.

Obviously not every request needs Pro, auto routing is great (hopefully), but when Pro runs out, gemini-cli is not very useful. For me, the allowance of Pro queries is hugely more impactful than the number of Flash queries.

The description of various paid options is consistent in terminology. We read "free", "higher" and "highest", but what these mean in terms of Pro requests is a mystery. Also, the only plan I can access with "highest", what ever that is, is AI Ultra from a personal account. Why can't I access "highest" without paying for a big bundle of things I don't want, like animation generation and Youtube without ads? Like, why is there only one level of Developer subscription? There are hints that Google folk believe the quota has been set so high that it is practically inexhaustible. That is clearly not true. I would pay for a higher plan, but I think Ultra at around ten time the cost (the trial discounts the first three months to about five times the cost of Pro) is too much.

1 reply

bgkelly Nov 7, 2025

@ryanjsalva are you able to provide any visibility or updates on the practical quotas that are in effect for the pro model? We're on the Gemini Code Assist Standard and today I've just tapped out at about 125 queries according the CLI stats, but some of the metrics in the gcloud admin are up to 2k ... so I'm not sure what's going on. It's very frustrating to run out!

Neinndall · 2025-11-01T21:58:39Z

Neinndall
Nov 1, 2025

Hi!

It's the first day testing the latest changes. The routing is a bit strange, but it seems the PRO model is working. I've already made more requests than in previous days, so its amazing and I'm still in the same session.

I've received several errors related to the following API error:

✕ [API Error: [{
"error": {
"code": 400,
"message": "The input token count (198552) exceeds the maximum number of tokens allowed (131072).",
"errors": [

{
"message": "The input token count (198552) exceeds the maximum number of tokens allowed (131072).",
"domain": "global",
"reason": "badRequest"

}
],
"status": "INVALID_ARGUMENT"

}
}
]]

This error stopped the conversation, but I was able to continue it without problems.

More on routing: I've noticed that some code tasks are done using the FLASH model, which I see isn't 100% accurate.

1 reply

gplasky Nov 7, 2025

Glad it's working better for you!

The input token count errors are unrelated to this issue. All LLM (Gemini included) have a limit on the number of input and output tokens that can be passed as a part of any request. For simplicity, think of a token as a word -- every word that you pass to the model (input) counts against the total token limit (output tokens also count against this limit but let's keep things simple).

In a CLI session, as you send prompts, run tools, and otherwise interact with the tool, your input token size will grow. If your session is very long, you've read very large files, and/or run tools that have generated lots of output, this token size will eventually exceed the limits of the model.

Context (token) management is a part of any GenAI-based toolset, Gemini CLI included. Strategies for managing context include:

Breaking work down into smaller batches and/or saving state across sessions (e.g. `TODO.md files or things like this).
Using /clear to reset your context or /compress[1] to compress it.
Using (or asking the tool to write) tools to prevent doing operations on large files (think asking the CLI to get the average of a CSV file with one million lines vs having it create a python script that just prints the average).

[1] The CLI also has the feature to compress context automatically when it hits certain thresholds, and this automatic compression was very recently changed to be more aggressive, so you may notice different behavior in newer versions.

westonruter · 2025-11-02T16:53:58Z

westonruter
Nov 2, 2025

I'm hoping this is related to a frustrating experience I've been having where Gemini CLI just spins indefinitely when I ask it to review a diff on a project: #11765.

I just re-tested on 0.11.3 and I'm getting the same experience. My prompt:

Review the changes in the current branch via git diff origin/trunk...

It does ask me to approve running that shell command, and I see the diff in the terminal. The diff is under 500 lines. But then I wait indefinitely at:

⠦ Analyzing Branch Differences (esc to cancel, 6m 19s)

I'm authenticated with Google and I am a Premium subscriber.

2 replies

jackwotherspoon Nov 3, 2025
Maintainer

Responded on your linked issue, it is unrelated to this.

westonruter Nov 3, 2025

Thank you. You're right! It now works perfectly as long as I ensure the non-interactive shell commands are executed.

ameer620 · 2025-11-04T00:35:31Z

ameer620
Nov 4, 2025

npm install -g @google/gemini-cli

0 replies

wangbiye · 2025-11-05T06:43:30Z

wangbiye
Nov 5, 2025

I need a fixed configuration method that uses pro mode. The auto mode is utterly foolish. I demanded it to analyze the existing code and make modifications according to the new plan, but it did nothing and simply told me "It's already done."

2 replies

wangbiye Nov 5, 2025

sirvist-systems Nov 5, 2025

From what I have seen over the past couple of days, it has been defaulting to Auto. I believe "
<img width="749" height="357" alt="Screenshot 2025-11-05 015313"

src="https://github.com/user-attachments/assets/2d4e07ee-e50f-4c80-b2b1-225bee917417" />
/model --model flag" will set it? Not sure if it will persist between sessions. Let us know!

timrichardson · 2025-11-05T06:47:55Z

Please next time read the configuration, you have the way to set in gemini.json the "model". https://geminicli.com/docs/get-started/configuration/
Your question should have been a question to Gemini CLI or Stackoverflow and not a contribution on an issue tracker.

Neinndall · 2025-11-06T12:59:23Z

Neinndall
Nov 6, 2025

For the past two days, I've been experiencing constant API errors: Error 400. And the maximum requests we have with the 2.5 PRO model have dropped again.

✕ [API Error: [{
"error": {
"code": 400,
"message": "The input token count (134977) exceeds the maximum number of tokens allowed (131072).",
"errors": [

{
"message": "The input token count (134977) exceeds the maximum number of tokens allowed (131072).",
"domain": "global",
"reason": "badRequest"

}
],
"status": "INVALID_ARGUMENT"

}
}
]]

0 replies

Increasing capacity and reliability #12311

Uh oh!

Uh oh!

ryanjsalva Oct 30, 2025 Maintainer

Focusing on reliability and long-term stability

Replies: 16 comments · 18 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jackwotherspoon Oct 31, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jackwotherspoon Oct 31, 2025 Maintainer

Model override

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jackwotherspoon Oct 31, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanjsalva Oct 31, 2025 Maintainer Author

🐿️ Shipped

⏭️ Next

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jackwotherspoon Nov 3, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanjsalva
Oct 30, 2025
Maintainer

Replies: 16 comments 18 replies

jackwotherspoon Oct 31, 2025
Maintainer

jackwotherspoon Oct 31, 2025
Maintainer

jackwotherspoon Oct 31, 2025
Maintainer

ryanjsalva
Oct 31, 2025
Maintainer Author

jackwotherspoon Nov 3, 2025
Maintainer