Add on-device voice search by MahmoodMahmood · Pull Request #3568 · quran/quran_android

MahmoodMahmood · 2026-03-01T12:35:15Z

Summary

Add on-device voice search that lets users recite a Quran verse and navigate to its location. Uses sherpa-onnx for offline speech recognition with a Whisper model fine-tuned for Quran recitation.

Key features

Full-screen voice search — accessible from QuranActivity toolbar, with recording animation, transcription, and verse matching
Inline voice search — mic icon in SearchActivity/QuranActivity toolbar that records, periodically transcribes, and populates the search field
Fully offline — ASR model (~155 MB) is downloaded once over Wi-Fi and stored on device; no network needed after that
Arabic text normalization — handles diacritics, alif/hamza variants, and taa marbuta for fuzzy verse matching
Three-tier verse matching — exact substring, normalized fuzzy match, and word overlap scoring

APK size impact

The sherpa-onnx native libraries add ~5 MB to the debug APK (all ABIs). The ~155 MB ASR model is not bundled in the APK — it's downloaded on-demand when the user first enables voice search in Settings.

Model download UX

When the user enables voice search in Settings, the ~155 MB Whisper model is downloaded in the background. A progress indicator shows download status. The download only starts on user action (never automatic), and the user can cancel at any time. If the model isn't downloaded yet and the user taps the mic, they see a message directing them to Settings.

Note on sherpa-onnx AAR

sherpa-onnx provides the on-device speech recognition runtime (native C++ libraries + Kotlin API). Since k2-fsa does not publish an official Android AAR to Maven Central, the pre-built AAR (~38 MB) is hosted on this fork's GitHub releases and downloaded automatically at build time by a Gradle task with SHA-256 checksum verification. No binary is checked into the repository.

Note on Voice Activity Detection (VAD) — why recording uses manual stop

We initially implemented VAD using Silero VAD to auto-detect when the user stops reciting. However, tajweed rules produce sustained quiet sounds (madd, ghunnah) that the VAD model misclassified as silence, causing premature auto-stop mid-verse. Natural pauses between verses (waqf) of 2–5 seconds further complicated detection. Despite extensive tuning, the false-stop rate was too high for a good UX, so recording now uses manual stop only.

How to regenerate the ASR model files

The ASR model is converted from tarteel-ai/whisper-base-ar-quran (HuggingFace) to sherpa-onnx ONNX format with INT8 quantization. This produces three files: base-ar-quran-encoder.int8.onnx, base-ar-quran-decoder.int8.onnx, and base-ar-quran-tokens.txt.

Prerequisites

pip install torch==2.6.0 transformers openai-whisper onnxruntime onnx

Note: PyTorch 2.6.x is required. Newer versions (2.10+) default to a dynamo-based ONNX exporter that is incompatible with the sherpa-onnx export script.

Phase 1: Convert HuggingFace model to OpenAI Whisper format

The HuggingFace Transformers Whisper format uses different layer naming than the original OpenAI Whisper format. Create convert_hf_to_openai.py:

import re
import torch
from transformers import WhisperForConditionalGeneration
import whisper

def hf_to_whisper_states(text):
    text = re.sub('.layers.', '.blocks.', text)
    text = re.sub('.self_attn.', '.attn.', text)
    text = re.sub('.q_proj.', '.query.', text)
    text = re.sub('.k_proj.', '.key.', text)
    text = re.sub('.v_proj.', '.value.', text)
    text = re.sub('.out_proj.', '.out.', text)
    text = re.sub('.fc1.', '.mlp.0.', text)
    text = re.sub('.fc2.', '.mlp.2.', text)
    text = re.sub('.encoder_attn.', '.cross_attn.', text)
    text = re.sub('.embed_positions.weight', '.positional_embedding', text)
    text = re.sub('.embed_tokens.', '.token_embedding.', text)
    text = re.sub('model.', '', text)
    text = re.sub('attn.layer_norm.', 'attn_ln.', text)
    text = re.sub('.final_layer_norm.', '.mlp_ln.', text)
    text = re.sub('encoder.layer_norm.', 'encoder.ln_post.', text)
    text = re.sub('decoder.layer_norm.', 'decoder.ln.', text)
    text = re.sub('proj_out.weight', 'decoder.token_embedding.weight', text)
    return text

hf_model = WhisperForConditionalGeneration.from_pretrained("tarteel-ai/whisper-base-ar-quran")
hf_state_dict = hf_model.state_dict()

whisper_state_dict = {}
for key in list(hf_state_dict.keys()):
    new_key = hf_to_whisper_states(key)
    whisper_state_dict[new_key] = hf_state_dict[key]

base_model = whisper.load_model("base")
base_model.load_state_dict(whisper_state_dict)

torch.save(
    {"dims": base_model.dims.__dict__, "model_state_dict": whisper_state_dict},
    "base-ar-quran.pt"
)
print("Saved base-ar-quran.pt")

python convert_hf_to_openai.py

Phase 2: Export to ONNX with INT8 quantization

git clone --depth 1 --branch v1.12.25 https://github.com/k2-fsa/sherpa-onnx.git
cd sherpa-onnx/scripts/whisper

Edit export-onnx.py to add the custom model:

Add "base-ar-quran" to the choices list in get_args():

"medium-aishell",
"base-ar-quran",  # <-- add this

Add a case to the load_model() function (before the final else):

elif name == "base-ar-quran":
    filename = "./base-ar-quran.pt"
    if not Path(filename).is_file():
        raise ValueError("Place base-ar-quran.pt in the current directory.")
    return whisper.load_model(filename)

Then run:

cp /path/to/base-ar-quran.pt .
python export-onnx.py --model base-ar-quran

This produces:

base-ar-quran-encoder.int8.onnx (~28 MB)
base-ar-quran-decoder.int8.onnx (~125 MB)
base-ar-quran-tokens.txt (~847 KB)

Note: The INT8 quantized encoder/decoder checksums are nondeterministic across runs (ONNX quantization is not bit-for-bit reproducible), but the resulting models are functionally equivalent. The tokens file is deterministic.

Screenshots

Main screen with mic button	Search toolbar with inline mic	Search results after voice input

Pulsing Mic Animation

Mic icon pulses with concentric ripple rings during recording. Partial transcription updates the search box in real-time.

How to review

This PR is split into 5 commits, best reviewed in order:

#	Commit	What to look at
1	Common utilities	`SearchTextUtil` Arabic normalization logic and `InlineVoiceSearchController` interface design
2	ASR engine	`AsrEngine`/`AsrModelManager`/`AudioRecorder` — the sherpa-onnx integration and model download logic
3	Verse matching	`QuranVerseMatcher` three-tier algorithm and `QuranVerseProvider` interface
4	Voice search UI	Compose screen, Molecule presenter, state management
5	App integration	How voice search is wired into existing activities, menus, and DI

Architecture

flowchart LR
    subgraph app["App Layer"]
        direction TB
        QA["QuranActivity"]
        SA["SearchActivity"]
        PMV["PulsingMicView"]
    end

    subgraph feature["feature/voicesearch"]
        direction TB

        subgraph ui["UI"]
            VSA["VoiceSearchActivity"]
            VSS["VoiceSearchScreen"]
            VSP["VoiceSearchPresenter"]
        end

        subgraph asr["ASR Pipeline"]
            AR["AudioRecorder"]
            ASR["AsrEngine\n(sherpa-onnx)"]
            AMM["AsrModelManager"]
        end

        subgraph matching["Verse Matching"]
            QVM["QuranVerseMatcher"]
            QVP["QuranVerseProvider"]
        end

        IVSCI["InlineVoiceSearchControllerImpl"]
    end

    subgraph common["Common Modules"]
        direction TB
        IC["InlineVoiceSearchController\n(common/di)"]
        STU["SearchTextUtil\n(common/search)"]
    end

    QA -- launches --> VSA
    QA -- inline mic --> IC
    SA -- inline mic --> IC
    QA -.- PMV
    SA -.- PMV

    VSA --> VSP
    VSA --> VSS
    VSP --> AR
    VSP --> ASR
    VSP --> QVM

    AR -- "16kHz PCM" --> ASR
    AMM -- model --> ASR

    QVM --> QVP
    STU -- normalization --> QVM

    IVSCI -.-> IC
    IVSCI --> AR
    IVSCI --> ASR

New modules

Module	Purpose
`common/search`	Arabic text normalization (diacritics removal, alif/hamza unification)
`common/di`	`InlineVoiceSearchController` interface for cross-module voice search contracts
`feature/voicesearch`	Full feature: UI, ASR engine, audio recording, model management, verse matching, DI

How to test manually

Build and install: ./gradlew installMadaniDebug
Open Settings → enable Voice Search → wait for the ~155 MB model to download (progress shown)
Full-screen voice search: From the main Quran screen, tap the mic icon in the toolbar → recite a verse (e.g. Surah Al-Fatiha) → tap stop → verify the correct verse appears in results → tap a result to navigate
Inline voice search: Open search → tap the mic icon in the search bar → recite → verify transcription appears in the search field in real-time → verify search results update
Edge cases: Try reciting only part of a verse, try a verse from the middle of a surah, try with background noise
Model not ready: Uninstall and reinstall, tap mic before enabling voice search in Settings → verify the "model not ready" message appears

Automated tests

SearchTextUtilArabicTest — Arabic normalization and tokenization
QuranVerseMatcherTest — Verse matching with exact, fuzzy, and partial scoring
AsrModelManagerTest — Model file management, migration, temp file cleanup
AudioRecorderBufferTest — Buffer initialization, bounds, sample rate
VoiceSearchStateTest — State defaults, copy semantics, enum completeness
VoiceSearchPresenterTest — State mapping, event handling, navigation events
InlineVoiceSearchControllerImplTest — State transitions, model readiness, recording guards

🤖 Generated with Claude Code

- SearchTextUtil: Arabic text normalization (diacritics removal, alif/hamza unification, taa marbuta handling) used for fuzzy verse matching - InlineVoiceSearchController: interface in common/di for cross-module voice search contracts, allowing the app module to control inline voice search without depending on the feature module directly Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

On-device Arabic speech recognition using sherpa-onnx (Whisper model): - AsrEngine: wraps sherpa-onnx OfflineRecognizer for Arabic transcription - AsrModelManager: downloads and manages the ~155MB Whisper model files - AudioRecorder: captures 16kHz mono PCM from the microphone The sherpa-onnx AAR is downloaded automatically at build time from a GitHub release with SHA-256 verification (see feature/voicesearch/build.gradle.kts). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Three-tier matching algorithm: 1. Exact substring match against normalized verse text 2. Normalized fuzzy match with diacritics/alif-hamza unification 3. Word overlap scoring for partial recitations QuranVerseProvider interface + QuranVerseProviderImpl that loads verses from the existing translation database for matching. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Full-screen voice search with Compose UI: - VoiceSearchScreen: recording animation, transcription display, verse match results with sura/ayah navigation - VoiceSearchPresenter: Molecule-based state management connecting AudioRecorder -> AsrEngine -> QuranVerseMatcher pipeline - VoiceSearchActivity: entry point launched from QuranActivity toolbar - DI wiring via VoiceSearchComponent (Metro) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- QuranActivity: mic button in toolbar launches full-screen voice search - SearchActivity: inline mic icon that records, transcribes, and populates the search field in real-time with a pulsing animation - InlineVoiceSearchControllerImpl: bridges ASR engine with the inline mic UI, managing recording lifecycle and periodic transcription - PulsingMicView: animated mic button with concentric ripple rings - ApplicationModule: DI bindings for voice search components Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MahmoodMahmood

Self-review notes to help with context on key design decisions.

MahmoodMahmood · 2026-03-09T14:31:37Z

feature/voicesearch/build.gradle.kts

+  outputs.file(aarFile)
+  onlyIf { !aarFile.exists() }
+  doLast {
+    aarFile.parentFile.mkdirs()


Why sherpa-onnx over Google's on-device Speech API?

No Google Play Services dependency — runs purely on-device with no GMS requirement, important for F-Droid and Huawei users

Custom model support — we use a Whisper model fine-tuned for Quran recitation (tarteel-ai/whisper-base-ar-quran), which is far more accurate for tajweed-style Arabic than Google's general-purpose recognizer

Full offline guarantee — Google's "offline" speech API still phones home; sherpa-onnx is truly air-gapped after initial model download

Privacy — no audio data leaves the device

The tradeoff is the ~38MB native library + ~155MB model download, but this is a one-time cost.

MahmoodMahmood · 2026-03-09T14:31:37Z

common/search/src/main/java/com/quran/common/search/SearchTextUtil.kt

+
+    // Remove diacritics (harakat)
+    result = DIACRITICS_REGEX.replace(result, "")
+


Arabic normalization strategy:

This handles the core challenge of matching ASR output (which has no diacritics and inconsistent alif/hamza forms) against Quran text (which has full tashkeel). The approach:

Strip all Unicode combining marks (diacritics/tashkeel)

Unify alif variants (أ إ آ ٱ → ا) since ASR output doesn't distinguish them

Normalize taa marbuta (ة → ه) for end-of-word matching

This is intentionally simple — a more sophisticated approach (e.g., morphological analysis) would be overkill since we're matching against known Quran text, not arbitrary Arabic.

MahmoodMahmood · 2026-03-09T14:31:37Z

...voicesearch/src/main/java/com/quran/mobile/feature/voicesearch/matching/QuranVerseMatcher.kt

+
+      if (normalizedQuery.isBlank() || queryWords.isEmpty()) {
+        return@withContext emptyList()
+      }


Three-tier matching algorithm:

Exact substring — if the normalized transcription appears verbatim in a verse, it's a strong match

Normalized fuzzy — after applying Arabic normalization to both sides, check for substring containment

Word overlap scoring — tokenize both strings and compute overlap ratio, useful for partial recitations where the user only recites a few words

The tiers are tried in order and the best match wins. This handles the reality that ASR output is imperfect — users might recite part of a verse, the model might miss a word, etc.

MahmoodMahmood · 2026-03-09T14:31:37Z

...rch/src/main/java/com/quran/mobile/feature/voicesearch/di/InlineVoiceSearchControllerImpl.kt

+
+import com.quran.data.di.AppScope
+import com.quran.mobile.di.InlineVoiceSearchController
+import com.quran.mobile.di.InlineVoiceSearchState


Why a separate InlineVoiceSearchController interface?

This follows the existing pattern in the codebase where common/di holds interfaces and the feature module provides the implementation. This way:

SearchActivity and QuranActivity (in app) depend only on the interface

The voice search feature module can be swapped out or disabled without touching the app module

The implementation bridges ASR → transcription → search field updates with periodic transcription (every ~2 seconds of audio)

MahmoodMahmood changed the title ~~WIP: Add on-device voice search~~ Add on-device voice search Mar 8, 2026

MahmoodMahmood force-pushed the add-voice-search branch 6 times, most recently from 0b4564d to 8a490ff Compare March 9, 2026 14:17

MahmoodMahmood and others added 5 commits March 9, 2026 10:29

MahmoodMahmood force-pushed the add-voice-search branch from 8a490ff to b92db2c Compare March 9, 2026 14:30

MahmoodMahmood commented Mar 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add on-device voice search#3568

Add on-device voice search#3568
MahmoodMahmood wants to merge 5 commits intoquran:mainfrom
MahmoodMahmood:add-voice-search

MahmoodMahmood commented Mar 1, 2026 •

edited

Loading

Uh oh!

MahmoodMahmood left a comment

Uh oh!

MahmoodMahmood Mar 9, 2026

Uh oh!

MahmoodMahmood Mar 9, 2026

Uh oh!

MahmoodMahmood Mar 9, 2026

Uh oh!

MahmoodMahmood Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		// Remove diacritics (harakat)
		result = DIACRITICS_REGEX.replace(result, "")

Conversation

MahmoodMahmood commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key features

APK size impact

Model download UX

Note on sherpa-onnx AAR

Prerequisites

Phase 1: Convert HuggingFace model to OpenAI Whisper format

Phase 2: Export to ONNX with INT8 quantization

Screenshots

Pulsing Mic Animation

How to review

Architecture

New modules

How to test manually

Automated tests

Uh oh!

MahmoodMahmood left a comment

Choose a reason for hiding this comment

Uh oh!

MahmoodMahmood Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

MahmoodMahmood Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

MahmoodMahmood Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

MahmoodMahmood Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MahmoodMahmood commented Mar 1, 2026 •

edited

Loading