DenisovAV · DenisovAV · Apr 4, 2026 · Apr 4, 2026 · Apr 4, 2026 · Apr 4, 2026
diff --git a/.claude/skills/release/SKILL.md b/.claude/skills/release/SKILL.md
@@ -0,0 +1,106 @@
+---
+name: release
+description: Release flutter_gemma — rebuild JAR, update all version numbers, checksums, CHANGELOG, upload to GitHub release
+user_invocable: true
+---
+
+# Flutter Gemma Release
+
+Complete release checklist for flutter_gemma plugin. Run as `/release <version>` (e.g. `/release 0.14.0`).
+
+## Pre-flight
+
+Before starting, verify you're on the correct branch and all changes are committed:
+```bash
+git status
+git log --oneline -5
+```
+
+## Step 1: Update version numbers
+
+All files that contain the version:
+
+| File | Variable/Field | Example |
+|------|---------------|---------|
+| `pubspec.yaml` | `version:` | `version: <VERSION>` |
+| `ios/flutter_gemma.podspec` | `s.version` | `s.version = '<VERSION>'` |
+| `litertlm-server/build.gradle.kts` | `version =` | `version = "<VERSION>"` |
+| `CLAUDE.md` | `Current Version:` | `- **Current Version**: <VERSION>` |
+| `macos/scripts/setup_desktop.sh:61` | `JAR_VERSION=` | `JAR_VERSION="<VERSION>"` |
+| `macos/scripts/prepare_resources.sh:42` | `JAR_VERSION=` | `JAR_VERSION="<VERSION>"` |
+| `linux/scripts/setup_desktop.sh:62` | `JAR_VERSION=` | `JAR_VERSION="<VERSION>"` |
+| `windows/scripts/setup_desktop.ps1:90` | `$JarVersion =` | `$JarVersion = "<VERSION>"` |
+
+> JAR_URL is auto-derived from JAR_VERSION in all scripts — no separate update needed.
+
+## Step 2: Update CHANGELOG.md
+
+Add new section at top with all changes. Categories: features, fixes, breaking changes.
+
+## Step 3: Build JAR
+
+```bash
+cd litertlm-server && ./gradlew fatJar
+```
+
+Verify build success. JAR output: `litertlm-server/build/libs/litertlm-server-<VERSION>-all.jar`
+
+## Step 4: Compute new SHA256
+
+```bash
+shasum -a 256 litertlm-server/build/libs/litertlm-server-*-all.jar
+```
+
+## Step 5: Update JAR checksums in all 4 scripts
+
+| File | Variable |
+|------|----------|
+| `macos/scripts/setup_desktop.sh:63` | `JAR_CHECKSUM="<sha256>"` |
+| `macos/scripts/prepare_resources.sh:44` | `JAR_CHECKSUM="<sha256>"` |
+| `linux/scripts/setup_desktop.sh:64` | `JAR_CHECKSUM="<sha256>"` |
+| `windows/scripts/setup_desktop.ps1:92` | `$JarChecksum = "<sha256>"` |
+
+JAR is cross-platform (JVM bytecode) — same checksum for all platforms.
+
+## Step 6: Verify
+
+```bash
+flutter analyze    # 0 errors
+flutter test       # all pass
+dart pub publish --dry-run   # 0 warnings
+```
+
+**NEVER publish without dry-run first.** Publishing is IRREVERSIBLE.
+
+## Step 7: Create/update GitHub release
+
+```bash
+# Create new release
+gh release create v<VERSION> \
+  litertlm-server/build/libs/litertlm-server-<VERSION>-all.jar \
+  --title "v<VERSION>" \
+  --notes-file CHANGELOG_EXCERPT.md
+
+# OR update existing release (delete old JAR first)
+gh release delete-asset v<VERSION> litertlm-server.jar --yes 2>/dev/null
+gh release upload v<VERSION> litertlm-server/build/libs/litertlm-server-<VERSION>-all.jar
+```
+
+Verify JAR URL returns 200:
+```bash
+curl -sI "https://github.com/DenisovAV/flutter_gemma/releases/download/v<VERSION>/litertlm-server.jar" | head -1
+```
+
+## Step 8: Commit & PR
+
+- Author: `--author="Sasha Denisov <denisov.shureg@gmail.com>"`
+- No AI attribution in commit messages
+- No "Co-Authored-By" or "Generated with Claude" footers
+- Create PR via `gh pr create`
+
+## Step 9: After merge — publish
+
+```bash
+dart pub publish --dry-run   # verify one more time
+dart pub publish             # only after user approval!
+```
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,9 @@
+## 0.13.1
+- **LiteRT-LM 0.10.0**: Updated Android and JVM SDK from 0.9.0 to 0.10.0
+- **Gemma 4 Thinking Mode**: `isThinking: true` now works with Gemma 4 E2B/E4B models (Android, iOS, Desktop; not Web)
+- **Fix cancel download**: Cancel download now works correctly (#196)
+- **Fix `large_file_handler` platform support**: Conditional imports for pub.dev platform analysis compatibility
+
 ## 0.13.0
 - **Gemma 4 E2B/E4B**: Added support for next-gen multimodal models (text + image + audio)
 - **systemInstruction**: New parameter in `createChat()` and `createSession()` for setting system-level context

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -69,7 +69,7 @@ final token = const String.fromEnvironment('HF_TOKEN');
 - 🔥 **Local AI Inference** - Run Gemma models directly on device
 - 🖼️ **Multimodal Support** - Text + Image input with Gemma 3 Nano
 - 🛠️ **Function Calling** - Enable models to call external functions
-- 🧠 **Thinking Mode** - View reasoning process of DeepSeek models
+- 🧠 **Thinking Mode** - View reasoning process of DeepSeek and Gemma 4 models
 - 📱 **Cross-Platform** - Android, iOS, Web, macOS, Windows, Linux
 - ⚡ **GPU Acceleration** - Hardware-accelerated inference
 - 🔧 **LoRA Support** - Efficient fine-tuning weights
@@ -401,6 +401,8 @@ Future<void> close() async {
 
 | Model Family | Function Calling | Thinking Mode | Multimodal | Platform Support |
 |--------------|------------------|---------------|------------|------------------|
+| Gemma 4 E2B | ✅ | ✅ ¹ | ✅ | Android, iOS, Web, Desktop |
+| Gemma 4 E4B | ✅ | ✅ ¹ | ✅ | Android, iOS, Web, Desktop |
 | Gemma 3 Nano | ✅ | ❌ | ✅ | Android, iOS, Web |
 | Gemma 3 270M | ❌ | ❌ | ❌ | Android, iOS, Web |
 | Gemma-3 1B | ✅ | ❌ | ❌ | Android, iOS, Web |
@@ -411,6 +413,8 @@ Future<void> close() async {
 | Qwen2.5 | ✅ | ❌ | ❌ | Android, iOS, Web |
 | Phi-4 | ❌ | ❌ | ❌ | Android, iOS, Web |
 
+> ¹ Thinking Mode for Gemma 4: Android, iOS, Desktop only. Web (MediaPipe) does not support `extraContext`.
+
 ### Platform Limitations
 
 | Platform | Vision/Multimodal | Audio | Embeddings | Notes |
@@ -457,10 +461,10 @@ dev_dependencies:
 
 ### MediaPipe GenAI Integration
 
-- **Current Version Web**: v0.10.26
+- **Current Version Web**: v0.10.27
 - **Current Version Android**: v0.10.33
 - **Current Version iOS**: v0.10.33
-- **Web CDN**: `https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@0.10.26`
+- **Web CDN**: `https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@0.10.27`
 - **iOS/Android**: Integrated via CocoaPods/Gradle
 
 ## Development Best Practices
@@ -623,7 +627,7 @@ Log.w(TAG, "sizeInTokens: LiteRT-LM does not support token counting. " +
 
 **Dependency (build.gradle):**
 ```gradle
-implementation 'com.google.ai.edge.litertlm:litertlm-android:0.9.0-beta'
+implementation 'com.google.ai.edge.litertlm:litertlm-android:0.10.0'
 ```
 
 **Usage (Dart - no changes required):**
@@ -642,7 +646,7 @@ await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
 ```html
 <!-- index.html -->
 <script type="module">
-import { FilesetResolver, LlmInference } from 'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@0.10.26';
+import { FilesetResolver, LlmInference } from 'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@0.10.27';
 window.FilesetResolver = FilesetResolver;
 window.LlmInference = LlmInference;
 </script>
@@ -1243,7 +1247,7 @@ flutter_gemma/
 
 - **GitHub**: https://github.com/DenisovAV/flutter_gemma
 - **Pub.dev**: https://pub.dev/packages/flutter_gemma
-- **Current Version**: 0.13.0
+- **Current Version**: 0.13.1
 - **License**: Check repository for license details
 - **Issues**: Report bugs via GitHub Issues
 - **Changelog**: See `CHANGELOG.md` for version history
diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@
 
 **The plugin supports not only Gemma, but also other models. Here's the full list of supported models:** [Gemma 4 E2B/E4B](https://huggingface.co/google/gemma-4-E2B-it-litert-lm), [Gemma3n E2B/E4B](https://huggingface.co/google/gemma-3n-E2B-it-litert-preview), [FastVLM 0.5B](https://huggingface.co/litert-community/FastVLM-0.5B), [Gemma-3 1B](https://huggingface.co/litert-community/Gemma3-1B-IT), [Gemma 3 270M](https://huggingface.co/litert-community/gemma-3-270m-it), [FunctionGemma 270M](https://huggingface.co/sasha-denisov/function-gemma-270M-it), [Qwen3 0.6B](https://huggingface.co/litert-community/Qwen3-0.6B), [Qwen 2.5](https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct), [Phi-4 Mini](https://huggingface.co/litert-community/Phi-4-mini-instruct), [DeepSeek R1](https://huggingface.co/litert-community/DeepSeek-R1-Distill-Qwen-1.5B), [SmolLM 135M](https://huggingface.co/litert-community/SmolLM-135M-Instruct).
 
-*Note: The flutter_gemma plugin supports Gemma3n (with **multimodal vision and audio support**), FastVLM (vision), Gemma-3, FunctionGemma, Qwen3, Qwen 2.5, Phi-4, DeepSeek R1 and SmolLM. Desktop platforms (macOS, Windows, Linux) require `.litertlm` model format.
+*Note: The flutter_gemma plugin supports Gemma 4 and Gemma3n (with **multimodal vision and audio support**), FastVLM (vision), Gemma-3, FunctionGemma, Qwen3, Qwen 2.5, Phi-4, DeepSeek R1 and SmolLM. Desktop platforms (macOS, Windows, Linux) require `.litertlm` model format.
 
 [Gemma](https://ai.google.dev/gemma) is a family of lightweight, state-of-the art open models built from the same research and technology used to create the Gemini models
 
@@ -32,7 +32,7 @@ There is an example of using:
 - **🖼️ Multimodal Support:** Text + Image input with Gemma3n vision models
 - **🎙️ Audio Input:** Record and send audio messages with Gemma3n E2B/E4B models (Android, Desktop - LiteRT-LM engine)
 - **🛠️ Function Calling:** Enable your models to call external functions and integrate with other services (supported by select models)
-- **🧠 Thinking Mode:** View the reasoning process of DeepSeek models with <think> blocks 
+- **🧠 Thinking Mode:** View the reasoning process of DeepSeek and Gemma 4 models with thinking blocks
 - **🛑 Stop Generation:** Cancel text generation mid-process on Android, Web, and Desktop
 - **⚙️ Backend Switching:** Choose between CPU and GPU backends for each model individually in the example app 
 - **🔍 Advanced Model Filtering:** Filter models by features (Multimodal, Function Calls, Thinking) with expandable UI
@@ -72,8 +72,8 @@ The example app offers a curated list of models, each suited for different tasks
 
 | Model Family | Best For | Function Calling | Thinking Mode | Vision | Languages | Size |
 |---|---|:---:|:---:|:---:|---|---|
-| **Gemma 4 E2B** | Next-gen multimodal chat — text, image, audio | ✅ | ❌ | ✅ | Multilingual | 2.4GB |
-| **Gemma 4 E4B** | Next-gen multimodal chat — text, image, audio | ✅ | ❌ | ✅ | Multilingual | 4.3GB |
+| **Gemma 4 E2B** | Next-gen multimodal chat — text, image, audio | ✅ | ✅ | ✅ | Multilingual | 2.4GB |
+| **Gemma 4 E4B** | Next-gen multimodal chat — text, image, audio | ✅ | ✅ | ✅ | Multilingual | 4.3GB |
 | **Gemma3n** | On-device multimodal chat and image analysis | ✅ | ❌ | ✅ | Multilingual | 3-6GB |
 | **FastVLM 0.5B** | Fast vision-language inference | ❌ | ❌ | ✅ | Multilingual | 0.5GB |
 | **Phi-4 Mini** | Advanced reasoning and instruction following | ✅ | ❌ | ❌ | Multilingual | 3.9GB |
@@ -1544,11 +1544,11 @@ FunctionGemma uses a special format (different from JSON-based function calling)
 
 The `flutter_gemma` plugin handles this format automatically via `FunctionCallParser`.
 
-9. **🧠 Thinking Mode (DeepSeek Models)**
+9. **🧠 Thinking Mode (DeepSeek & Gemma 4 Models)**
 
-DeepSeek models support "thinking mode" where you can see the model's reasoning process before it generates the final response. This provides transparency into how the model approaches problems.
+DeepSeek and Gemma 4 (E2B/E4B) models support "thinking mode" where you can see the model's reasoning process before it generates the final response. This provides transparency into how the model approaches problems.
 
-**Enable Thinking Mode:**
+**Enable Thinking Mode (DeepSeek):**
 
 ```dart
 final chat = await inferenceModel.createChat(
@@ -1559,7 +1559,6 @@ final chat = await inferenceModel.createChat(
   modelType: ModelType.deepSeek, // Required for DeepSeek models
   supportsFunctionCalls: true, // DeepSeek also supports function calls
   tools: _tools, // Optional: add tools for function calling
-  // tokenBuffer: 256, // Token buffer for context management
 );
 ```
 
@@ -1586,12 +1585,25 @@ chat.generateChatResponseAsync().listen((response) {
 });
 ```
 
+**Enable Thinking Mode (Gemma 4):**
+
+```dart
+final chat = await inferenceModel.createChat(
+  temperature: 1.0,
+  topK: 64,
+  topP: 0.95,
+  isThinking: true, // Enable thinking mode
+  modelType: ModelType.gemmaIt, // Gemma 4 E2B/E4B
+);
+// <|think|> is auto-injected into systemInstruction — no manual prompt needed.
+```
+
 **Thinking Mode Features:**
 - ✅ **Transparent Reasoning**: See how the model thinks through problems
 - ✅ **Interactive UI**: Show/hide thinking bubbles with expandable content
 - ✅ **Streaming Support**: Thinking content streams in real-time
 - ✅ **Function Integration**: Models can think before calling functions
-- ✅ **DeepSeek Optimized**: Designed specifically for DeepSeek model architecture
+- ✅ **Supported Models**: DeepSeek R1 and Gemma 4 E2B/E4B
 
 **Example Thinking Flow:**
 1. User asks: "Change the background to blue and explain why blue is calming"
@@ -2096,7 +2108,7 @@ Function calling is currently supported by the following models:
 | **Image Input (Multimodal)** | ✅ Full | ✅ Full | ✅ Full | ⚠️ Broken (#684) | macOS: model hallucinates |
 | **Audio Input** | ✅ Full | ✅ Full | ❌ Not supported | ✅ Full | Gemma3n E2B/E4B |
 | **Function Calling** | ✅ Full | ✅ Full | ✅ Full | ❌ Not supported | LiteRT-LM limitation |
-| **Thinking Mode** | ✅ Full | ✅ Full | ✅ Full | ❌ Not supported | DeepSeek models |
+| **Thinking Mode** | ✅ Full | ✅ Full | ✅ Full | ✅ Full | DeepSeek & Gemma 4 |
 | **Stop Generation** | ✅ Full | ✅ Full | ✅ Full | ✅ Full | Cancel mid-process |
 | **GPU Acceleration** | ✅ Full | ✅ Full | ✅ Full | ⚠️ Partial | macOS GPU broken |
 | **NPU Acceleration** | ✅ Full | ❌ Not supported | ❌ Not supported | ❌ Not supported | Android only (.litertlm) |
@@ -2264,13 +2276,14 @@ import 'package:flutter_gemma/core/extensions.dart';
 
 // Clean response based on model type
 String cleanedResponse = ModelThinkingFilter.cleanResponse(
-  rawResponse, 
+  rawResponse,
   ModelType.deepSeek
 );
 
 // The filter automatically removes model-specific tokens like:
 // - <end_of_turn> tags (Gemma models)
-// - Special DeepSeek tokens
+// - <think>...</think> blocks (DeepSeek)
+// - <|channel>thought\n...<channel|> blocks (Gemma 4 E2B/E4B)
 // - Extra whitespace and formatting
 ```
 

diff --git a/android/build.gradle b/android/build.gradle
@@ -76,7 +76,7 @@ dependencies {
     implementation 'org.jetbrains.kotlinx:kotlinx-coroutines-guava:1.9.0'
 
     // LiteRT-LM Engine for .litertlm model files
-    implementation 'com.google.ai.edge.litertlm:litertlm-android:0.9.0'
+    implementation 'com.google.ai.edge.litertlm:litertlm-android:0.10.0'
 
     implementation 'androidx.core:core-ktx:1.12.0'
     implementation 'androidx.lifecycle:lifecycle-runtime-ktx:2.7.0'

diff --git a/android/src/main/kotlin/dev/flutterberlin/flutter_gemma/FlutterGemmaPlugin.kt b/android/src/main/kotlin/dev/flutterberlin/flutter_gemma/FlutterGemmaPlugin.kt
@@ -82,8 +82,8 @@ private class PlatformServiceImpl(
   private val engineLock = Any()  // Lock for thread-safe engine access
 
   // NEW: Use InferenceEngine abstraction instead of InferenceModel
-  private var engine: InferenceEngine? = null
-  private var session: InferenceSession? = null
+  @Volatile private var engine: InferenceEngine? = null
+  @Volatile private var session: InferenceSession? = null
 
   // RAG components
   private var embeddingModel: EmbeddingModel? = null
@@ -130,6 +130,9 @@ private class PlatformServiceImpl(
 
         // Only now clear old state and swap in new engine (thread-safe)
         synchronized(engineLock) {
+          // Cancel stale stream collector before replacing engine
+          streamJob?.cancel()
+          streamJob = null
           session?.cancelGeneration()
           try {
             session?.close()
@@ -176,6 +179,7 @@ private class PlatformServiceImpl(
     enableVisionModality: Boolean?,
     enableAudioModality: Boolean?,
     systemInstruction: String?,
+    enableThinking: Boolean?,
     callback: (Result<Unit>) -> Unit
   ) {
     scope.launch {
@@ -193,6 +197,7 @@ private class PlatformServiceImpl(
             enableVisionModality = enableVisionModality,
             enableAudioModality = enableAudioModality,
             systemInstruction = systemInstruction,
+            enableThinking = enableThinking ?: false,
           )
 
           session?.close()

diff --git a/android/src/main/kotlin/dev/flutterberlin/flutter_gemma/PigeonInterface.g.kt b/android/src/main/kotlin/dev/flutterberlin/flutter_gemma/PigeonInterface.g.kt
@@ -203,7 +203,7 @@ private open class PigeonInterfacePigeonCodec : StandardMessageCodec() {
 interface PlatformService {
   fun createModel(maxTokens: Long, modelPath: String, loraRanks: List<Long>?, preferredBackend: PreferredBackend?, maxNumImages: Long?, supportAudio: Boolean?, callback: (Result<Unit>) -> Unit)
   fun closeModel(callback: (Result<Unit>) -> Unit)
-  fun createSession(temperature: Double, randomSeed: Long, topK: Long, topP: Double?, loraPath: String?, enableVisionModality: Boolean?, enableAudioModality: Boolean?, systemInstruction: String?, callback: (Result<Unit>) -> Unit)
+  fun createSession(temperature: Double, randomSeed: Long, topK: Long, topP: Double?, loraPath: String?, enableVisionModality: Boolean?, enableAudioModality: Boolean?, systemInstruction: String?, enableThinking: Boolean?, callback: (Result<Unit>) -> Unit)
   fun closeSession(callback: (Result<Unit>) -> Unit)
   fun sizeInTokens(prompt: String, callback: (Result<Long>) -> Unit)
   fun addQueryChunk(prompt: String, callback: (Result<Unit>) -> Unit)
@@ -315,7 +315,8 @@ interface PlatformService {
             val enableVisionModalityArg = args[5] as Boolean?
             val enableAudioModalityArg = args[6] as Boolean?
             val systemInstructionArg = args[7] as String?
-            api.createSession(temperatureArg, randomSeedArg, topKArg, topPArg, loraPathArg, enableVisionModalityArg, enableAudioModalityArg, systemInstructionArg) { result: Result<Unit> ->
+            val enableThinkingArg = args[8] as Boolean?
+            api.createSession(temperatureArg, randomSeedArg, topKArg, topPArg, loraPathArg, enableVisionModalityArg, enableAudioModalityArg, systemInstructionArg, enableThinkingArg) { result: Result<Unit> ->
               val error = result.exceptionOrNull()
               if (error != null) {
                 reply.reply(wrapError(error))

diff --git a/android/src/main/kotlin/dev/flutterberlin/flutter_gemma/engines/EngineConfig.kt b/android/src/main/kotlin/dev/flutterberlin/flutter_gemma/engines/EngineConfig.kt
@@ -28,6 +28,7 @@ data class SessionConfig(
     val enableVisionModality: Boolean? = null,
     val enableAudioModality: Boolean? = null,
     val systemInstruction: String? = null,
+    val enableThinking: Boolean = false,
 )
 
 /**