Conversation
0b4564d to
8a490ff
Compare
- SearchTextUtil: Arabic text normalization (diacritics removal, alif/hamza unification, taa marbuta handling) used for fuzzy verse matching - InlineVoiceSearchController: interface in common/di for cross-module voice search contracts, allowing the app module to control inline voice search without depending on the feature module directly Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On-device Arabic speech recognition using sherpa-onnx (Whisper model): - AsrEngine: wraps sherpa-onnx OfflineRecognizer for Arabic transcription - AsrModelManager: downloads and manages the ~155MB Whisper model files - AudioRecorder: captures 16kHz mono PCM from the microphone The sherpa-onnx AAR is downloaded automatically at build time from a GitHub release with SHA-256 verification (see feature/voicesearch/build.gradle.kts). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three-tier matching algorithm: 1. Exact substring match against normalized verse text 2. Normalized fuzzy match with diacritics/alif-hamza unification 3. Word overlap scoring for partial recitations QuranVerseProvider interface + QuranVerseProviderImpl that loads verses from the existing translation database for matching. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Full-screen voice search with Compose UI: - VoiceSearchScreen: recording animation, transcription display, verse match results with sura/ayah navigation - VoiceSearchPresenter: Molecule-based state management connecting AudioRecorder -> AsrEngine -> QuranVerseMatcher pipeline - VoiceSearchActivity: entry point launched from QuranActivity toolbar - DI wiring via VoiceSearchComponent (Metro) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- QuranActivity: mic button in toolbar launches full-screen voice search - SearchActivity: inline mic icon that records, transcribes, and populates the search field in real-time with a pulsing animation - InlineVoiceSearchControllerImpl: bridges ASR engine with the inline mic UI, managing recording lifecycle and periodic transcription - PulsingMicView: animated mic button with concentric ripple rings - ApplicationModule: DI bindings for voice search components Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8a490ff to
b92db2c
Compare
MahmoodMahmood
left a comment
There was a problem hiding this comment.
Self-review notes to help with context on key design decisions.
| outputs.file(aarFile) | ||
| onlyIf { !aarFile.exists() } | ||
| doLast { | ||
| aarFile.parentFile.mkdirs() |
There was a problem hiding this comment.
Why sherpa-onnx over Google's on-device Speech API?
- No Google Play Services dependency — runs purely on-device with no GMS requirement, important for F-Droid and Huawei users
- Custom model support — we use a Whisper model fine-tuned for Quran recitation (tarteel-ai/whisper-base-ar-quran), which is far more accurate for tajweed-style Arabic than Google's general-purpose recognizer
- Full offline guarantee — Google's "offline" speech API still phones home; sherpa-onnx is truly air-gapped after initial model download
- Privacy — no audio data leaves the device
The tradeoff is the ~38MB native library + ~155MB model download, but this is a one-time cost.
|
|
||
| // Remove diacritics (harakat) | ||
| result = DIACRITICS_REGEX.replace(result, "") | ||
|
|
There was a problem hiding this comment.
Arabic normalization strategy:
This handles the core challenge of matching ASR output (which has no diacritics and inconsistent alif/hamza forms) against Quran text (which has full tashkeel). The approach:
- Strip all Unicode combining marks (diacritics/tashkeel)
- Unify alif variants (أ إ آ ٱ → ا) since ASR output doesn't distinguish them
- Normalize taa marbuta (ة → ه) for end-of-word matching
This is intentionally simple — a more sophisticated approach (e.g., morphological analysis) would be overkill since we're matching against known Quran text, not arbitrary Arabic.
|
|
||
| if (normalizedQuery.isBlank() || queryWords.isEmpty()) { | ||
| return@withContext emptyList() | ||
| } |
There was a problem hiding this comment.
Three-tier matching algorithm:
- Exact substring — if the normalized transcription appears verbatim in a verse, it's a strong match
- Normalized fuzzy — after applying Arabic normalization to both sides, check for substring containment
- Word overlap scoring — tokenize both strings and compute overlap ratio, useful for partial recitations where the user only recites a few words
The tiers are tried in order and the best match wins. This handles the reality that ASR output is imperfect — users might recite part of a verse, the model might miss a word, etc.
|
|
||
| import com.quran.data.di.AppScope | ||
| import com.quran.mobile.di.InlineVoiceSearchController | ||
| import com.quran.mobile.di.InlineVoiceSearchState |
There was a problem hiding this comment.
Why a separate InlineVoiceSearchController interface?
This follows the existing pattern in the codebase where common/di holds interfaces and the feature module provides the implementation. This way:
SearchActivityandQuranActivity(inapp) depend only on the interface- The voice search feature module can be swapped out or disabled without touching the app module
- The implementation bridges ASR → transcription → search field updates with periodic transcription (every ~2 seconds of audio)
Summary
Add on-device voice search that lets users recite a Quran verse and navigate to its location. Uses sherpa-onnx for offline speech recognition with a Whisper model fine-tuned for Quran recitation.
Key features
APK size impact
The sherpa-onnx native libraries add ~5 MB to the debug APK (all ABIs). The ~155 MB ASR model is not bundled in the APK — it's downloaded on-demand when the user first enables voice search in Settings.
Model download UX
When the user enables voice search in Settings, the ~155 MB Whisper model is downloaded in the background. A progress indicator shows download status. The download only starts on user action (never automatic), and the user can cancel at any time. If the model isn't downloaded yet and the user taps the mic, they see a message directing them to Settings.
Note on sherpa-onnx AAR
sherpa-onnx provides the on-device speech recognition runtime (native C++ libraries + Kotlin API). Since k2-fsa does not publish an official Android AAR to Maven Central, the pre-built AAR (~38 MB) is hosted on this fork's GitHub releases and downloaded automatically at build time by a Gradle task with SHA-256 checksum verification. No binary is checked into the repository.
Note on Voice Activity Detection (VAD) — why recording uses manual stop
We initially implemented VAD using Silero VAD to auto-detect when the user stops reciting. However, tajweed rules produce sustained quiet sounds (madd, ghunnah) that the VAD model misclassified as silence, causing premature auto-stop mid-verse. Natural pauses between verses (waqf) of 2–5 seconds further complicated detection. Despite extensive tuning, the false-stop rate was too high for a good UX, so recording now uses manual stop only.
How to regenerate the ASR model files
The ASR model is converted from tarteel-ai/whisper-base-ar-quran (HuggingFace) to sherpa-onnx ONNX format with INT8 quantization. This produces three files:
base-ar-quran-encoder.int8.onnx,base-ar-quran-decoder.int8.onnx, andbase-ar-quran-tokens.txt.Prerequisites
Phase 1: Convert HuggingFace model to OpenAI Whisper format
The HuggingFace Transformers Whisper format uses different layer naming than the original OpenAI Whisper format. Create
convert_hf_to_openai.py:Phase 2: Export to ONNX with INT8 quantization
git clone --depth 1 --branch v1.12.25 https://github.com/k2-fsa/sherpa-onnx.git cd sherpa-onnx/scripts/whisperEdit
export-onnx.pyto add the custom model:Add
"base-ar-quran"to thechoiceslist inget_args():Add a case to the
load_model()function (before the finalelse):Then run:
cp /path/to/base-ar-quran.pt . python export-onnx.py --model base-ar-quranThis produces:
base-ar-quran-encoder.int8.onnx(~28 MB)base-ar-quran-decoder.int8.onnx(~125 MB)base-ar-quran-tokens.txt(~847 KB)Screenshots
Pulsing Mic Animation
Mic icon pulses with concentric ripple rings during recording. Partial transcription updates the search box in real-time.
How to review
This PR is split into 5 commits, best reviewed in order:
SearchTextUtilArabic normalization logic andInlineVoiceSearchControllerinterface designAsrEngine/AsrModelManager/AudioRecorder— the sherpa-onnx integration and model download logicQuranVerseMatcherthree-tier algorithm andQuranVerseProviderinterfaceArchitecture
flowchart LR subgraph app["App Layer"] direction TB QA["QuranActivity"] SA["SearchActivity"] PMV["PulsingMicView"] end subgraph feature["feature/voicesearch"] direction TB subgraph ui["UI"] VSA["VoiceSearchActivity"] VSS["VoiceSearchScreen"] VSP["VoiceSearchPresenter"] end subgraph asr["ASR Pipeline"] AR["AudioRecorder"] ASR["AsrEngine\n(sherpa-onnx)"] AMM["AsrModelManager"] end subgraph matching["Verse Matching"] QVM["QuranVerseMatcher"] QVP["QuranVerseProvider"] end IVSCI["InlineVoiceSearchControllerImpl"] end subgraph common["Common Modules"] direction TB IC["InlineVoiceSearchController\n(common/di)"] STU["SearchTextUtil\n(common/search)"] end QA -- launches --> VSA QA -- inline mic --> IC SA -- inline mic --> IC QA -.- PMV SA -.- PMV VSA --> VSP VSA --> VSS VSP --> AR VSP --> ASR VSP --> QVM AR -- "16kHz PCM" --> ASR AMM -- model --> ASR QVM --> QVP STU -- normalization --> QVM IVSCI -.-> IC IVSCI --> AR IVSCI --> ASRNew modules
common/searchcommon/diInlineVoiceSearchControllerinterface for cross-module voice search contractsfeature/voicesearchHow to test manually
./gradlew installMadaniDebugAutomated tests
SearchTextUtilArabicTest— Arabic normalization and tokenizationQuranVerseMatcherTest— Verse matching with exact, fuzzy, and partial scoringAsrModelManagerTest— Model file management, migration, temp file cleanupAudioRecorderBufferTest— Buffer initialization, bounds, sample rateVoiceSearchStateTest— State defaults, copy semantics, enum completenessVoiceSearchPresenterTest— State mapping, event handling, navigation eventsInlineVoiceSearchControllerImplTest— State transitions, model readiness, recording guards🤖 Generated with Claude Code