Skip to content

Feature Request: Add Gemma 4 (gemma4) architecture support #90

@Kyle17888

Description

@Kyle17888

Description

Google DeepMind released the Gemma 4 model family in April 2026. The upstream llama.cpp has already merged full support for the gemma4 architecture (including the gemma4-iswa inference backend, gemma4 tokenizer, and multimodal GEMMA4V/GEMMA4A vision/audio projectors).

However, the llama.cpp version bundled in LocalLLMClient currently only supports up to gemma3n. Attempting to load a Gemma 4 GGUF model results in:

Failed to load model: Failed to load model from file

Environment

  • Device: iPhone 17 Pro (iOS 19)
  • LocalLLMClient: main branch @ d420bc8
  • Model: google/gemma-4-E4B-it → converted to GGUF Q4_K_M via upstream llama.cpp build 8818
  • Upstream llama.cpp status: Gemma 4 fully supported (text + vision + audio)

Current supported Gemma architectures in bundled llama.cpp

// llama-arch.cpp (current)
{ LLM_ARCH_GEMMA,            "gemma"            },
{ LLM_ARCH_GEMMA2,           "gemma2"           },
{ LLM_ARCH_GEMMA3,           "gemma3"           },
{ LLM_ARCH_GEMMA3N,          "gemma3n"          },
{ LLM_ARCH_GEMMA_EMBEDDING,  "gemma-embedding"  },
// ❌ LLM_ARCH_GEMMA4 is missing

Expected behavior

LocalLLMClient should be able to load and run Gemma 4 GGUF models (both text-only and multimodal with mmproj) on iOS/macOS, just like it currently supports Gemma 3 and Gemma 3n.

Key upstream commits/files to sync

The following components need to be synced from upstream llama.cpp to add gemma4 support:

  • src/llama-arch.h / src/llama-arch.cppLLM_ARCH_GEMMA4 registration
  • src/models/gemma4-iswa.cpp — Gemma 4 inference implementation (ISWA hybrid attention)
  • src/llama-model.cpp — model loading + graph building for gemma4
  • src/llama-vocab.cppgemma4 tokenizer (BPE with SPM-style byte fallback)
  • gguf-py/gguf/constants.pyMODEL_ARCH.GEMMA4, PLE tensors, vision/audio projector types
  • tools/mtmd/clip.cpp / clip.hGEMMA4V / GEMMA4A projector support

Why this matters

Gemma 4 E4B is specifically designed for on-device deployment (the "E" stands for "Effective parameters" — only 4B active params despite larger total weight due to Per-Layer Embeddings). It's an ideal candidate for mobile inference via LocalLLMClient.

Workaround

Currently using upstream llama.cpp via llama-server + HTTP as a temporary workaround, but native on-device inference via LocalLLMClient would be strongly preferred.

Thank you for this excellent package! 🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions