Skip to content

Add on-device voice search#3568

Draft
MahmoodMahmood wants to merge 5 commits intoquran:mainfrom
MahmoodMahmood:add-voice-search
Draft

Add on-device voice search#3568
MahmoodMahmood wants to merge 5 commits intoquran:mainfrom
MahmoodMahmood:add-voice-search

Conversation

@MahmoodMahmood
Copy link
Contributor

@MahmoodMahmood MahmoodMahmood commented Mar 1, 2026

Summary

Add on-device voice search that lets users recite a Quran verse and navigate to its location. Uses sherpa-onnx for offline speech recognition with a Whisper model fine-tuned for Quran recitation.

Key features

  • Full-screen voice search — accessible from QuranActivity toolbar, with recording animation, transcription, and verse matching
  • Inline voice search — mic icon in SearchActivity/QuranActivity toolbar that records, periodically transcribes, and populates the search field
  • Fully offline — ASR model (~155 MB) is downloaded once over Wi-Fi and stored on device; no network needed after that
  • Arabic text normalization — handles diacritics, alif/hamza variants, and taa marbuta for fuzzy verse matching
  • Three-tier verse matching — exact substring, normalized fuzzy match, and word overlap scoring

APK size impact

The sherpa-onnx native libraries add ~5 MB to the debug APK (all ABIs). The ~155 MB ASR model is not bundled in the APK — it's downloaded on-demand when the user first enables voice search in Settings.

Model download UX

When the user enables voice search in Settings, the ~155 MB Whisper model is downloaded in the background. A progress indicator shows download status. The download only starts on user action (never automatic), and the user can cancel at any time. If the model isn't downloaded yet and the user taps the mic, they see a message directing them to Settings.

Note on sherpa-onnx AAR

sherpa-onnx provides the on-device speech recognition runtime (native C++ libraries + Kotlin API). Since k2-fsa does not publish an official Android AAR to Maven Central, the pre-built AAR (~38 MB) is hosted on this fork's GitHub releases and downloaded automatically at build time by a Gradle task with SHA-256 checksum verification. No binary is checked into the repository.

Note on Voice Activity Detection (VAD) — why recording uses manual stop

We initially implemented VAD using Silero VAD to auto-detect when the user stops reciting. However, tajweed rules produce sustained quiet sounds (madd, ghunnah) that the VAD model misclassified as silence, causing premature auto-stop mid-verse. Natural pauses between verses (waqf) of 2–5 seconds further complicated detection. Despite extensive tuning, the false-stop rate was too high for a good UX, so recording now uses manual stop only.

How to regenerate the ASR model files

The ASR model is converted from tarteel-ai/whisper-base-ar-quran (HuggingFace) to sherpa-onnx ONNX format with INT8 quantization. This produces three files: base-ar-quran-encoder.int8.onnx, base-ar-quran-decoder.int8.onnx, and base-ar-quran-tokens.txt.

Prerequisites

pip install torch==2.6.0 transformers openai-whisper onnxruntime onnx

Note: PyTorch 2.6.x is required. Newer versions (2.10+) default to a dynamo-based ONNX exporter that is incompatible with the sherpa-onnx export script.

Phase 1: Convert HuggingFace model to OpenAI Whisper format

The HuggingFace Transformers Whisper format uses different layer naming than the original OpenAI Whisper format. Create convert_hf_to_openai.py:

import re
import torch
from transformers import WhisperForConditionalGeneration
import whisper

def hf_to_whisper_states(text):
    text = re.sub('.layers.', '.blocks.', text)
    text = re.sub('.self_attn.', '.attn.', text)
    text = re.sub('.q_proj.', '.query.', text)
    text = re.sub('.k_proj.', '.key.', text)
    text = re.sub('.v_proj.', '.value.', text)
    text = re.sub('.out_proj.', '.out.', text)
    text = re.sub('.fc1.', '.mlp.0.', text)
    text = re.sub('.fc2.', '.mlp.2.', text)
    text = re.sub('.encoder_attn.', '.cross_attn.', text)
    text = re.sub('.embed_positions.weight', '.positional_embedding', text)
    text = re.sub('.embed_tokens.', '.token_embedding.', text)
    text = re.sub('model.', '', text)
    text = re.sub('attn.layer_norm.', 'attn_ln.', text)
    text = re.sub('.final_layer_norm.', '.mlp_ln.', text)
    text = re.sub('encoder.layer_norm.', 'encoder.ln_post.', text)
    text = re.sub('decoder.layer_norm.', 'decoder.ln.', text)
    text = re.sub('proj_out.weight', 'decoder.token_embedding.weight', text)
    return text

hf_model = WhisperForConditionalGeneration.from_pretrained("tarteel-ai/whisper-base-ar-quran")
hf_state_dict = hf_model.state_dict()

whisper_state_dict = {}
for key in list(hf_state_dict.keys()):
    new_key = hf_to_whisper_states(key)
    whisper_state_dict[new_key] = hf_state_dict[key]

base_model = whisper.load_model("base")
base_model.load_state_dict(whisper_state_dict)

torch.save(
    {"dims": base_model.dims.__dict__, "model_state_dict": whisper_state_dict},
    "base-ar-quran.pt"
)
print("Saved base-ar-quran.pt")
python convert_hf_to_openai.py

Phase 2: Export to ONNX with INT8 quantization

git clone --depth 1 --branch v1.12.25 https://github.com/k2-fsa/sherpa-onnx.git
cd sherpa-onnx/scripts/whisper

Edit export-onnx.py to add the custom model:

  1. Add "base-ar-quran" to the choices list in get_args():

    "medium-aishell",
    "base-ar-quran",  # <-- add this
  2. Add a case to the load_model() function (before the final else):

    elif name == "base-ar-quran":
        filename = "./base-ar-quran.pt"
        if not Path(filename).is_file():
            raise ValueError("Place base-ar-quran.pt in the current directory.")
        return whisper.load_model(filename)

Then run:

cp /path/to/base-ar-quran.pt .
python export-onnx.py --model base-ar-quran

This produces:

  • base-ar-quran-encoder.int8.onnx (~28 MB)
  • base-ar-quran-decoder.int8.onnx (~125 MB)
  • base-ar-quran-tokens.txt (~847 KB)

Note: The INT8 quantized encoder/decoder checksums are nondeterministic across runs (ONNX quantization is not bit-for-bit reproducible), but the resulting models are functionally equivalent. The tokens file is deterministic.

Screenshots

Main screen with mic button Search toolbar with inline mic Search results after voice input

Pulsing Mic Animation

Mic icon pulses with concentric ripple rings during recording. Partial transcription updates the search box in real-time.

How to review

This PR is split into 5 commits, best reviewed in order:

# Commit What to look at
1 Common utilities SearchTextUtil Arabic normalization logic and InlineVoiceSearchController interface design
2 ASR engine AsrEngine/AsrModelManager/AudioRecorder — the sherpa-onnx integration and model download logic
3 Verse matching QuranVerseMatcher three-tier algorithm and QuranVerseProvider interface
4 Voice search UI Compose screen, Molecule presenter, state management
5 App integration How voice search is wired into existing activities, menus, and DI

Architecture

flowchart LR
    subgraph app["App Layer"]
        direction TB
        QA["QuranActivity"]
        SA["SearchActivity"]
        PMV["PulsingMicView"]
    end

    subgraph feature["feature/voicesearch"]
        direction TB

        subgraph ui["UI"]
            VSA["VoiceSearchActivity"]
            VSS["VoiceSearchScreen"]
            VSP["VoiceSearchPresenter"]
        end

        subgraph asr["ASR Pipeline"]
            AR["AudioRecorder"]
            ASR["AsrEngine\n(sherpa-onnx)"]
            AMM["AsrModelManager"]
        end

        subgraph matching["Verse Matching"]
            QVM["QuranVerseMatcher"]
            QVP["QuranVerseProvider"]
        end

        IVSCI["InlineVoiceSearchControllerImpl"]
    end

    subgraph common["Common Modules"]
        direction TB
        IC["InlineVoiceSearchController\n(common/di)"]
        STU["SearchTextUtil\n(common/search)"]
    end

    QA -- launches --> VSA
    QA -- inline mic --> IC
    SA -- inline mic --> IC
    QA -.- PMV
    SA -.- PMV

    VSA --> VSP
    VSA --> VSS
    VSP --> AR
    VSP --> ASR
    VSP --> QVM

    AR -- "16kHz PCM" --> ASR
    AMM -- model --> ASR

    QVM --> QVP
    STU -- normalization --> QVM

    IVSCI -.-> IC
    IVSCI --> AR
    IVSCI --> ASR
Loading

New modules

Module Purpose
common/search Arabic text normalization (diacritics removal, alif/hamza unification)
common/di InlineVoiceSearchController interface for cross-module voice search contracts
feature/voicesearch Full feature: UI, ASR engine, audio recording, model management, verse matching, DI

How to test manually

  1. Build and install: ./gradlew installMadaniDebug
  2. Open Settings → enable Voice Search → wait for the ~155 MB model to download (progress shown)
  3. Full-screen voice search: From the main Quran screen, tap the mic icon in the toolbar → recite a verse (e.g. Surah Al-Fatiha) → tap stop → verify the correct verse appears in results → tap a result to navigate
  4. Inline voice search: Open search → tap the mic icon in the search bar → recite → verify transcription appears in the search field in real-time → verify search results update
  5. Edge cases: Try reciting only part of a verse, try a verse from the middle of a surah, try with background noise
  6. Model not ready: Uninstall and reinstall, tap mic before enabling voice search in Settings → verify the "model not ready" message appears

Automated tests

  • SearchTextUtilArabicTest — Arabic normalization and tokenization
  • QuranVerseMatcherTest — Verse matching with exact, fuzzy, and partial scoring
  • AsrModelManagerTest — Model file management, migration, temp file cleanup
  • AudioRecorderBufferTest — Buffer initialization, bounds, sample rate
  • VoiceSearchStateTest — State defaults, copy semantics, enum completeness
  • VoiceSearchPresenterTest — State mapping, event handling, navigation events
  • InlineVoiceSearchControllerImplTest — State transitions, model readiness, recording guards

🤖 Generated with Claude Code

@MahmoodMahmood MahmoodMahmood changed the title WIP: Add on-device voice search Add on-device voice search Mar 8, 2026
@MahmoodMahmood MahmoodMahmood force-pushed the add-voice-search branch 6 times, most recently from 0b4564d to 8a490ff Compare March 9, 2026 14:17
MahmoodMahmood and others added 5 commits March 9, 2026 10:29
- SearchTextUtil: Arabic text normalization (diacritics removal, alif/hamza
  unification, taa marbuta handling) used for fuzzy verse matching
- InlineVoiceSearchController: interface in common/di for cross-module voice
  search contracts, allowing the app module to control inline voice search
  without depending on the feature module directly

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On-device Arabic speech recognition using sherpa-onnx (Whisper model):
- AsrEngine: wraps sherpa-onnx OfflineRecognizer for Arabic transcription
- AsrModelManager: downloads and manages the ~155MB Whisper model files
- AudioRecorder: captures 16kHz mono PCM from the microphone

The sherpa-onnx AAR is downloaded automatically at build time from a GitHub
release with SHA-256 verification (see feature/voicesearch/build.gradle.kts).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three-tier matching algorithm:
1. Exact substring match against normalized verse text
2. Normalized fuzzy match with diacritics/alif-hamza unification
3. Word overlap scoring for partial recitations

QuranVerseProvider interface + QuranVerseProviderImpl that loads verses
from the existing translation database for matching.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Full-screen voice search with Compose UI:
- VoiceSearchScreen: recording animation, transcription display, verse
  match results with sura/ayah navigation
- VoiceSearchPresenter: Molecule-based state management connecting
  AudioRecorder -> AsrEngine -> QuranVerseMatcher pipeline
- VoiceSearchActivity: entry point launched from QuranActivity toolbar
- DI wiring via VoiceSearchComponent (Metro)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- QuranActivity: mic button in toolbar launches full-screen voice search
- SearchActivity: inline mic icon that records, transcribes, and populates
  the search field in real-time with a pulsing animation
- InlineVoiceSearchControllerImpl: bridges ASR engine with the inline mic
  UI, managing recording lifecycle and periodic transcription
- PulsingMicView: animated mic button with concentric ripple rings
- ApplicationModule: DI bindings for voice search components

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor Author

@MahmoodMahmood MahmoodMahmood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review notes to help with context on key design decisions.

outputs.file(aarFile)
onlyIf { !aarFile.exists() }
doLast {
aarFile.parentFile.mkdirs()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why sherpa-onnx over Google's on-device Speech API?

  1. No Google Play Services dependency — runs purely on-device with no GMS requirement, important for F-Droid and Huawei users
  2. Custom model support — we use a Whisper model fine-tuned for Quran recitation (tarteel-ai/whisper-base-ar-quran), which is far more accurate for tajweed-style Arabic than Google's general-purpose recognizer
  3. Full offline guarantee — Google's "offline" speech API still phones home; sherpa-onnx is truly air-gapped after initial model download
  4. Privacy — no audio data leaves the device

The tradeoff is the ~38MB native library + ~155MB model download, but this is a one-time cost.


// Remove diacritics (harakat)
result = DIACRITICS_REGEX.replace(result, "")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arabic normalization strategy:

This handles the core challenge of matching ASR output (which has no diacritics and inconsistent alif/hamza forms) against Quran text (which has full tashkeel). The approach:

  1. Strip all Unicode combining marks (diacritics/tashkeel)
  2. Unify alif variants (أ إ آ ٱ → ا) since ASR output doesn't distinguish them
  3. Normalize taa marbuta (ة → ه) for end-of-word matching

This is intentionally simple — a more sophisticated approach (e.g., morphological analysis) would be overkill since we're matching against known Quran text, not arbitrary Arabic.


if (normalizedQuery.isBlank() || queryWords.isEmpty()) {
return@withContext emptyList()
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three-tier matching algorithm:

  1. Exact substring — if the normalized transcription appears verbatim in a verse, it's a strong match
  2. Normalized fuzzy — after applying Arabic normalization to both sides, check for substring containment
  3. Word overlap scoring — tokenize both strings and compute overlap ratio, useful for partial recitations where the user only recites a few words

The tiers are tried in order and the best match wins. This handles the reality that ASR output is imperfect — users might recite part of a verse, the model might miss a word, etc.


import com.quran.data.di.AppScope
import com.quran.mobile.di.InlineVoiceSearchController
import com.quran.mobile.di.InlineVoiceSearchState
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a separate InlineVoiceSearchController interface?

This follows the existing pattern in the codebase where common/di holds interfaces and the feature module provides the implementation. This way:

  • SearchActivity and QuranActivity (in app) depend only on the interface
  • The voice search feature module can be swapped out or disabled without touching the app module
  • The implementation bridges ASR → transcription → search field updates with periodic transcription (every ~2 seconds of audio)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant