Description
Google DeepMind released the Gemma 4 model family in April 2026. The upstream llama.cpp has already merged full support for the gemma4 architecture (including the gemma4-iswa inference backend, gemma4 tokenizer, and multimodal GEMMA4V/GEMMA4A vision/audio projectors).
However, the llama.cpp version bundled in LocalLLMClient currently only supports up to gemma3n. Attempting to load a Gemma 4 GGUF model results in:
Failed to load model: Failed to load model from file
Environment
- Device: iPhone 17 Pro (iOS 19)
- LocalLLMClient:
main branch @ d420bc8
- Model:
google/gemma-4-E4B-it → converted to GGUF Q4_K_M via upstream llama.cpp build 8818
- Upstream llama.cpp status: Gemma 4 fully supported (text + vision + audio)
Current supported Gemma architectures in bundled llama.cpp
// llama-arch.cpp (current)
{ LLM_ARCH_GEMMA, "gemma" },
{ LLM_ARCH_GEMMA2, "gemma2" },
{ LLM_ARCH_GEMMA3, "gemma3" },
{ LLM_ARCH_GEMMA3N, "gemma3n" },
{ LLM_ARCH_GEMMA_EMBEDDING, "gemma-embedding" },
// ❌ LLM_ARCH_GEMMA4 is missing
Expected behavior
LocalLLMClient should be able to load and run Gemma 4 GGUF models (both text-only and multimodal with mmproj) on iOS/macOS, just like it currently supports Gemma 3 and Gemma 3n.
Key upstream commits/files to sync
The following components need to be synced from upstream llama.cpp to add gemma4 support:
src/llama-arch.h / src/llama-arch.cpp — LLM_ARCH_GEMMA4 registration
src/models/gemma4-iswa.cpp — Gemma 4 inference implementation (ISWA hybrid attention)
src/llama-model.cpp — model loading + graph building for gemma4
src/llama-vocab.cpp — gemma4 tokenizer (BPE with SPM-style byte fallback)
gguf-py/gguf/constants.py — MODEL_ARCH.GEMMA4, PLE tensors, vision/audio projector types
tools/mtmd/clip.cpp / clip.h — GEMMA4V / GEMMA4A projector support
Why this matters
Gemma 4 E4B is specifically designed for on-device deployment (the "E" stands for "Effective parameters" — only 4B active params despite larger total weight due to Per-Layer Embeddings). It's an ideal candidate for mobile inference via LocalLLMClient.
Workaround
Currently using upstream llama.cpp via llama-server + HTTP as a temporary workaround, but native on-device inference via LocalLLMClient would be strongly preferred.
Thank you for this excellent package! 🙏
Description
Google DeepMind released the Gemma 4 model family in April 2026. The upstream
llama.cpphas already merged full support for thegemma4architecture (including thegemma4-iswainference backend,gemma4tokenizer, and multimodalGEMMA4V/GEMMA4Avision/audio projectors).However, the
llama.cppversion bundled inLocalLLMClientcurrently only supports up togemma3n. Attempting to load a Gemma 4 GGUF model results in:Environment
mainbranch @d420bc8google/gemma-4-E4B-it→ converted to GGUF Q4_K_M via upstreamllama.cppbuild 8818Current supported Gemma architectures in bundled llama.cpp
Expected behavior
LocalLLMClientshould be able to load and run Gemma 4 GGUF models (both text-only and multimodal with mmproj) on iOS/macOS, just like it currently supports Gemma 3 and Gemma 3n.Key upstream commits/files to sync
The following components need to be synced from upstream
llama.cppto addgemma4support:src/llama-arch.h/src/llama-arch.cpp—LLM_ARCH_GEMMA4registrationsrc/models/gemma4-iswa.cpp— Gemma 4 inference implementation (ISWA hybrid attention)src/llama-model.cpp— model loading + graph building for gemma4src/llama-vocab.cpp—gemma4tokenizer (BPE with SPM-style byte fallback)gguf-py/gguf/constants.py—MODEL_ARCH.GEMMA4, PLE tensors, vision/audio projector typestools/mtmd/clip.cpp/clip.h—GEMMA4V/GEMMA4Aprojector supportWhy this matters
Gemma 4 E4B is specifically designed for on-device deployment (the "E" stands for "Effective parameters" — only 4B active params despite larger total weight due to Per-Layer Embeddings). It's an ideal candidate for mobile inference via
LocalLLMClient.Workaround
Currently using upstream
llama.cppviallama-server+ HTTP as a temporary workaround, but native on-device inference viaLocalLLMClientwould be strongly preferred.Thank you for this excellent package! 🙏