diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/01.png b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/01.png
new file mode 100644
index 0000000000..98a272f84b
Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/01.png differ
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/02.png b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/02.png
new file mode 100644
index 0000000000..d0b8df7cb0
Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/02.png differ
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/03.png b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/03.png
new file mode 100644
index 0000000000..80e41973f2
Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/03.png differ
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/04.png b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/04.png
new file mode 100644
index 0000000000..d098da4e1a
Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/04.png differ
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/05.png b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/05.png
new file mode 100644
index 0000000000..8fa7609f69
Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/05.png differ
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/06.png b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/06.png
new file mode 100644
index 0000000000..a78e5ee6f7
Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/06.png differ
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/07.png b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/07.png
new file mode 100644
index 0000000000..5993f29b22
Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/07.png differ
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/08.png b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/08.png
new file mode 100644
index 0000000000..a01e883efc
Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/08.png differ
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/09.png b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/09.png
new file mode 100644
index 0000000000..64d714c262
Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/09.png differ
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/10.png b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/10.png
new file mode 100644
index 0000000000..571783c51e
Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/10.png differ
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md
new file mode 100644
index 0000000000..71c32dd2e8
--- /dev/null
+++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md
@@ -0,0 +1,52 @@
+---
+title: Halide Essentials. From Basics to Android Integration
+minutes_to_complete: 180
+
+who_is_this_for: This is an introductory topic for software developers interested in learning how to use Halide for image processing.
+
+learning_objectives:
+ - Understand foundational concepts of Halide and set up your development environment.
+ - Create a basic real-time image processing pipeline using Halide.
+ - Optimize image processing workflows by applying operation fusion in Halide.
+ - Integrate Halide pipelines into Android applications developed with Kotlin.
+
+prerequisites:
+ - Basic C++ knowledge
+ - Basic programming knowledge
+ - Android Studio with Android Emulator
+
+author: Dawid Borycki
+
+### Tags
+skilllevels: Introductory
+subjects: Performance and Architecture
+armips:
+ - Cortex-A
+ - Cortex-X
+operatingsystems:
+ - Android
+tools_software_languages:
+ - Android Studio
+ - Coding
+
+further_reading:
+ - resource:
+ title: Halide 19.0.0
+ link: https://halide-lang.org/docs/index.html
+ type: website
+ - resource:
+ title: Halide GitHub
+ link: https://github.com/halide/Halide
+ type: repository
+ - resource:
+ title: Halide Tutorials
+ link: https://halide-lang.org/tutorials/
+ type: website
+
+
+### FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 1 # _index.md always has weight of 1 to order correctly
+layout: "learningpathall" # All files under learning paths have this same wrapper
+learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
+---
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/_next-steps.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/_next-steps.md
new file mode 100644
index 0000000000..c3db0de5a2
--- /dev/null
+++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/_next-steps.md
@@ -0,0 +1,8 @@
+---
+# ================================================================================
+# FIXED, DO NOT MODIFY THIS FILE
+# ================================================================================
+weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
+title: "Next Steps" # Always the same, html page title.
+layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
+---
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md
new file mode 100644
index 0000000000..3bb359a6fa
--- /dev/null
+++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md
@@ -0,0 +1,419 @@
+---
+# User change
+title: "Integrating Halide into an Android (Kotlin) Project"
+
+weight: 6
+
+layout: "learningpathall"
+---
+
+## Objective
+In this lesson, we’ll learn how to integrate a high-performance Halide image-processing pipeline into an Android application using Kotlin.
+
+## Overview of mobile integration with Halide
+Android is the world’s most widely-used mobile operating system, powering billions of devices across diverse markets. This vast user base makes Android an ideal target platform for developers aiming to reach a broad audience, particularly in applications requiring sophisticated image and signal processing, such as augmented reality, photography, video editing, and real-time analytics.
+
+Kotlin, now the preferred programming language for Android development, combines concise syntax with robust language features, enabling developers to write maintainable, expressive, and safe code. It offers seamless interoperability with existing Java codebases and straightforward integration with native code via JNI, simplifying the development of performant mobile applications.
+
+## Benefits of using Halide on mobile
+Integrating Halide into Android applications brings several key advantages:
+1. Performance. Halide enables significant acceleration of complex image processing algorithms, often surpassing the speed of traditional Java or Kotlin implementations by leveraging optimized code generation. By generating highly optimized native code tailored for ARM CPUs or GPUs, Halide can dramatically increase frame rates and responsiveness, essential for real-time or interactive applications.
+2. Efficiency. On mobile devices, resource efficiency translates directly to improved battery life and reduced thermal output. Halide’s scheduling strategies (such as operation fusion, tiling, parallelization, and vectorization) minimize unnecessary memory transfers, CPU usage, and GPU overhead. This optimization substantially reduces overall power consumption, extending battery life and enhancing the user experience by preventing overheating.
+3. Portability. Halide abstracts hardware-specific details, allowing developers to write a single high-level pipeline that easily targets different processor architectures and hardware configurations. Pipelines can seamlessly run on various ARM-based CPUs and GPUs commonly found in Android smartphones and tablets, enabling developers to support a wide range of devices with minimal platform-specific modifications.
+4. Custom Algorithm Integration. Halide allows developers to easily integrate their bespoke image-processing algorithms that may not be readily available or optimized in common libraries, providing full flexibility and control over application-specific performance and functionality.
+
+In short, Halide delivers high-performance image processing without sacrificing portability or efficiency, a balance particularly valuable on resource-constrained mobile devices.
+
+### Android development ecosystem and challenges
+While Android presents abundant opportunities for developers, the mobile development ecosystem brings its own set of challenges, especially for performance-intensive applications:
+1. Limited Hardware Resources. Unlike desktop or server environments, mobile devices have significant constraints on processing power, memory capacity, and battery life. Developers must optimize software meticulously to deliver smooth performance while carefully managing hardware resource consumption. Leveraging tools like Halide allows developers to overcome these constraints by optimizing computational workloads, making resource-intensive tasks feasible on constrained hardware.
+2. Cross-Compilation Complexities. Developing native code for Android requires handling multiple hardware architectures (such as armv8-a, ARM64, and sometimes x86/x86_64). Cross-compilation introduces complexities due to different instruction sets, CPU features, and performance characteristics. Managing this complexity involves careful use of the Android NDK, understanding toolchains, and correctly configuring build systems (e.g., Gradle, CMake). Halide helps mitigate these issues by abstracting away many platform-specific optimizations, automatically generating code optimized for target architectures.
+3. Image-Format Conversions (Bitmap ↔ Halide Buffer). Android typically handles images through the Bitmap class or similar platform-specific constructs, whereas Halide expects image data to be in raw, contiguous buffer formats. Developers must bridge the gap between Android-specific image representations (Bitmaps, YUV images from camera APIs, etc.) and Halide’s native buffer format. Proper management of these conversions—including considerations for pixel formats, stride alignment, and memory copying overhead—can significantly impact performance and correctness, necessitating careful design and efficient implementation of buffer-handling routines.
+
+## Project requirements
+Before integrating Halide into your Android application, ensure you have the necessary tools and libraries.
+
+### Tools and prerequisites
+1. Android Studio. [Download link](https://developer.android.com/studio).
+2. Android NDK (Native Development Kit). Can be easily installed from Android Studio (Tools → SDK Manager → SDK Tools → Android NDK).
+
+## Setting up the Android project
+### Creating the project
+1. Open Android Studio.
+2. Select New Project > Native C++.
+
+
+### Configure the project
+1. Set the project Name to Arm.Halide.AndroidDemo.
+2. Choose Kotlin as the language.
+3. Set Minimum SDK to API 24.
+4. Click Next.
+
+5. Select C++17 from the C++ Standard dropdown list.
+
+6. Click Finish.
+
+## Configuring the Android project
+Next, configure your Android project to use the files generated in the previous step. First, copy blur_threshold_android.a and blur_threshold_android.h into ArmHalideAndroidDemo/app/src/main/cpp. Ensure your cpp directory contains the following files:
+* native-lib.cpp
+* blur_threshold_android.a
+* blur_threshold_android.h
+* CMakeLists.txt
+
+Open CMakeLists.txt and modify it as follows (replace /path/to/halide with your Halide installation directory):
+```cpp
+cmake_minimum_required(VERSION 3.22.1)
+
+project("armhalideandroiddemo")
+include_directories(
+ /path/to/halide/include
+)
+
+add_library(blur_threshold_android STATIC IMPORTED)
+set_target_properties(blur_threshold_android PROPERTIES IMPORTED_LOCATION
+ ${CMAKE_CURRENT_SOURCE_DIR}/blur_threshold_android.a
+)
+
+add_library(${CMAKE_PROJECT_NAME} SHARED native-lib.cpp)
+
+target_link_libraries(${CMAKE_PROJECT_NAME}
+ blur_threshold_android
+ android
+ log)
+```
+
+Open build.gradle.kts and modify it as follows:
+
+```console
+plugins {
+ alias(libs.plugins.android.application)
+ alias(libs.plugins.kotlin.android)
+}
+
+android {
+ namespace = "com.arm.armhalideandroiddemo"
+ compileSdk = 35
+
+ defaultConfig {
+ applicationId = "com.arm.armhalideandroiddemo"
+ minSdk = 24
+ targetSdk = 34
+ versionCode = 1
+ versionName = "1.0"
+ ndk {
+ abiFilters += "arm64-v8a"
+ }
+ testInstrumentationRunner = "androidx.test.runner.AndroidJUnitRunner"
+ externalNativeBuild {
+ cmake {
+ cppFlags += "-std=c++17"
+ }
+ }
+ }
+
+ buildTypes {
+ release {
+ isMinifyEnabled = false
+ proguardFiles(
+ getDefaultProguardFile("proguard-android-optimize.txt"),
+ "proguard-rules.pro"
+ )
+ }
+ }
+ compileOptions {
+ sourceCompatibility = JavaVersion.VERSION_11
+ targetCompatibility = JavaVersion.VERSION_11
+ }
+ kotlinOptions {
+ jvmTarget = "11"
+ }
+ externalNativeBuild {
+ cmake {
+ path = file("src/main/cpp/CMakeLists.txt")
+ version = "3.22.1"
+ }
+ }
+ buildFeatures {
+ viewBinding = true
+ }
+}
+
+dependencies {
+
+ implementation(libs.androidx.core.ktx)
+ implementation(libs.androidx.appcompat)
+ implementation(libs.material)
+ implementation(libs.androidx.constraintlayout)
+ testImplementation(libs.junit)
+ androidTestImplementation(libs.androidx.junit)
+ androidTestImplementation(libs.androidx.espresso.core)
+}
+```
+
+Click the Sync Now button at the top. To verify that everything is configured correctly, click Build > Make Project in Android Studio.
+
+## UI
+Now, you'll define the application's User Interface, consisting of two buttons and an ImageView. One button loads the image, the other processes it, and the ImageView displays both the original and processed images.
+1. Open the res/layout/activity_main.xml file, and modify it as follows:
+```XML
+
+
+
+
+
+
+
+
+
+
+
+
+
+```
+
+2. In MainActivity.kt, comment out the following line:
+
+```java
+//binding.sampleText.text = stringFromJNI()
+```
+
+Now you can run the app to view the UI:
+
+
+
+## Processing
+You will now implement the image processing code. First, pick up an image you want to process. Here we use the camera man. Then, under the Arm.Halide.AndroidDemo/src/main create assets folder, and save the image under that folder as img.png.
+
+Now, open MainActivity.kt and modify it as follows:
+```java
+package com.arm.armhalideandroiddemo
+
+import android.graphics.Bitmap
+import android.graphics.BitmapFactory
+import androidx.appcompat.app.AppCompatActivity
+import android.os.Bundle
+import android.widget.Button
+import android.widget.ImageView
+import com.arm.armhalideandroiddemo.databinding.ActivityMainBinding
+import kotlinx.coroutines.CoroutineScope
+import kotlinx.coroutines.Dispatchers
+import kotlinx.coroutines.launch
+import kotlinx.coroutines.withContext
+import java.io.InputStream
+
+class MainActivity : AppCompatActivity() {
+
+ private lateinit var binding: ActivityMainBinding
+
+ private var originalBitmap: Bitmap? = null
+ private lateinit var btnLoadImage: Button
+ private lateinit var btnProcessImage: Button
+ private lateinit var imageView: ImageView
+
+ override fun onCreate(savedInstanceState: Bundle?) {
+ super.onCreate(savedInstanceState)
+
+ binding = ActivityMainBinding.inflate(layoutInflater)
+ setContentView(binding.root)
+
+ btnLoadImage = findViewById(R.id.btnLoadImage)
+ btnProcessImage = findViewById(R.id.btnProcessImage)
+ imageView = findViewById(R.id.imageView)
+
+ // Load the image from assets when the user clicks "Load Image"
+ btnLoadImage.setOnClickListener {
+ originalBitmap = loadImageFromAssets("img.png")
+ originalBitmap?.let {
+ imageView.setImageBitmap(it)
+ // Enable the process button only if the image is loaded.
+ btnProcessImage.isEnabled = true
+ }
+ }
+
+ // Process the image using Halide when the user clicks "Process Image"
+ btnProcessImage.setOnClickListener {
+ originalBitmap?.let { bmp ->
+ // Run the processing on a background thread using coroutines.
+ CoroutineScope(Dispatchers.IO).launch {
+ // Convert Bitmap to grayscale byte array.
+ val grayBytes = extractGrayScaleBytes(bmp)
+
+ // Call your native function via JNI.
+ val processedBytes = blurThresholdImage(grayBytes, bmp.width, bmp.height)
+
+ // Convert processed bytes back to a Bitmap.
+ val processedBitmap = createBitmapFromGrayBytes(processedBytes, bmp.width, bmp.height)
+
+ // Update UI on the main thread.
+ withContext(Dispatchers.Main) {
+ imageView.setImageBitmap(processedBitmap)
+ }
+ }
+ }
+ }
+ }
+
+ // Utility to load an image from the assets folder.
+ private fun loadImageFromAssets(fileName: String): Bitmap? {
+ return try {
+ val assetManager = assets
+ val istr: InputStream = assetManager.open(fileName)
+ BitmapFactory.decodeStream(istr)
+ } catch (e: Exception) {
+ e.printStackTrace()
+ null
+ }
+ }
+
+ // Convert Bitmap to a grayscale ByteArray.
+ private fun extractGrayScaleBytes(bitmap: Bitmap): ByteArray {
+ val width = bitmap.width
+ val height = bitmap.height
+ val pixels = IntArray(width * height)
+ bitmap.getPixels(pixels, 0, width, 0, 0, width, height)
+ val grayBytes = ByteArray(width * height)
+ var index = 0
+ for (pixel in pixels) {
+ val r = (pixel shr 16 and 0xFF)
+ val g = (pixel shr 8 and 0xFF)
+ val b = (pixel and 0xFF)
+ val gray = ((r + g + b) / 3).toByte()
+ grayBytes[index++] = gray
+ }
+ return grayBytes
+ }
+
+ // Convert a grayscale byte array back to a Bitmap.
+ private fun createBitmapFromGrayBytes(grayBytes: ByteArray, width: Int, height: Int): Bitmap {
+ val bitmap = Bitmap.createBitmap(width, height, Bitmap.Config.ARGB_8888)
+ val pixels = IntArray(width * height)
+ var idx = 0
+ for (i in 0 until width * height) {
+ val gray = grayBytes[idx++].toInt() and 0xFF
+ pixels[i] = (0xFF shl 24) or (gray shl 16) or (gray shl 8) or gray
+ }
+ bitmap.setPixels(pixels, 0, width, 0, 0, width, height)
+ return bitmap
+ }
+
+ external fun blurThresholdImage(inputBytes: ByteArray, width: Int, height: Int): ByteArray
+
+ companion object {
+ // Used to load the 'armhalideandroiddemo' library on application startup.
+ init {
+ System.loadLibrary("armhalideandroiddemo")
+ }
+ }
+}
+```
+
+This Kotlin Android application demonstrates integrating a Halide-generated image-processing pipeline within an Android app. The main activity (MainActivity) manages loading and processing an image stored in the application’s asset folder.
+
+When the app launches, the Process Image button is disabled. When a user taps Load Image, the app retrieves img.png from its assets directory and displays it within the ImageView, simultaneously enabling the Process Image button for further interaction.
+
+Upon pressing the Process Image button, the following sequence occurs:
+1. Background Processing. A Kotlin coroutine initiates processing on a background thread, ensuring the application’s UI remains responsive.
+2. Conversion to Grayscale. The loaded bitmap image is converted into a grayscale byte array using a simple RGB-average method, preparing it for processing by the native (JNI) layer.
+3. Native Function Invocation. This grayscale byte array, along with image dimensions, is passed to a native function (blurThresholdImage) defined via JNI. This native function is implemented using the Halide pipeline, performing operations such as blurring and thresholding directly on the image data.
+4. Post-processing. After the native function completes, the resulting processed grayscale byte array is converted back into a Bitmap image.
+5. UI Update. The coroutine then updates the displayed image (on the main UI thread) with this newly processed bitmap, providing the user immediate visual feedback.
+
+The code defines three utility methods:
+1. loadImageFromAssets, which retrieves an image from the assets folder and decodes it into a Bitmap.
+2. extractGrayScaleBytes - converts a Bitmap into a grayscale byte array suitable for native processing.
+3. createBitmapFromGrayBytes - converts a grayscale byte array back into a Bitmap for display purposes.
+
+Note that performing the grayscale conversion in Halide allows us to exploit operator fusion, further improving performance by avoiding intermediate memory accesses. This could be done as in our examples before (processing-workflow).
+
+The JNI integration occurs through an external method declaration, blurThresholdImage, loaded via the companion object at app startup. The native library (armhalideandroiddemo) containing this function is compiled separately and integrated into the application (native-lib.cpp).
+
+You will now need to create blurThresholdImage function. To do so, in Android Studio put the cursor above blurThresholdImage function, and then click Create JNI function for blurThresholdImage:
+
+
+This will generate a new function in the native-lib.cpp:
+```cpp
+extern "C"
+JNIEXPORT jbyteArray JNICALL
+Java_com_arm_armhalideandroiddemo_MainActivity_blurThresholdImage(JNIEnv *env, jobject thiz,
+ jbyteArray input_bytes,
+ jint width, jint height) {
+ // TODO: implement blurThresholdImage()
+}
+```
+
+Implement this function as follows:
+```cpp
+extern "C"
+JNIEXPORT jbyteArray JNICALL
+Java_com_arm_armhalideandroiddemo_MainActivity_blurThresholdImage(JNIEnv *env, jobject thiz,
+ jbyteArray input_bytes,
+ jint width, jint height) {
+ // Get the input byte array
+ jbyte* inBytes = env->GetByteArrayElements(input_bytes, nullptr);
+ if (inBytes == nullptr) return nullptr;
+
+ // Wrap the grayscale image in a Halide::Runtime::Buffer.
+ Halide::Runtime::Buffer inputBuffer(reinterpret_cast(inBytes), width, height);
+
+ // Prepare an output buffer of the same size.
+ Halide::Runtime::Buffer outputBuffer(width, height);
+
+ // Call your Halide AOT function. Its signature is typically:
+ blur_threshold(inputBuffer, outputBuffer);
+
+ // Allocate a jbyteArray for the output.
+ jbyteArray outputArray = env->NewByteArray(width * height);
+ // Copy the data from Halide's output buffer to the jbyteArray.
+ env->SetByteArrayRegion(outputArray, 0, width * height, reinterpret_cast(outputBuffer.data()));
+
+ env->ReleaseByteArrayElements(input_bytes, inBytes, JNI_ABORT);
+ return outputArray;
+}
+```
+Then supplement the native-lib.cpp file by the following includes:
+```cpp
+#include "HalideBuffer.h"
+#include "Halide.h"
+#include "blur_threshold_android.h"
+```
+
+This C++ function acts as a bridge between Java (Kotlin) and native code. Specifically, the function blurThresholdImage is implemented using JNI, allowing it to be directly called from Kotlin. When invoked from Kotlin (through the external fun blurThresholdImage declaration), the function receives a grayscale image represented as a Java byte array (jbyteArray) along with its width and height.
+
+The input Java byte array (input_bytes) is accessed and pinned into native memory via GetByteArrayElements. This provides a direct pointer (inBytes) to the grayscale data sent from Kotlin. The raw grayscale byte data is wrapped into a Halide::Runtime::Buffer object (inputBuffer). This buffer structure is required by the Halide pipeline. An output buffer (outputBuffer) is created with the same dimensions as the input image. This buffer will store the result produced by the Halide pipeline. The native function invokes the Halide-generated AOT function blur_threshold, passing in both the input and output buffers. After processing, a new Java byte array (outputArray) is allocated to hold the processed grayscale data. The processed data from the Halide output buffer is copied into this Java array using SetByteArrayRegion. The native input buffer (inBytes) is explicitly released using ReleaseByteArrayElements, specifying JNI_ABORT as no changes were made to the input array. Finally, the processed byte array (outputArray) is returned to Kotlin.
+
+Through this JNI bridge, Kotlin can invoke high-performance native code. You can now re-run the application. Click the Load Image button, and then Process Image. You will see the following results:
+
+
+
+
+In the above code we created a new jbyteArray and copying the data explicitly, which can result in an additional overhead. To optimize performance by avoiding unnecessary memory copies, you can directly wrap Halide’s buffer in a Java-accessible ByteBuffer like so
+```java
+// Instead of allocating a new jbyteArray, create a direct ByteBuffer from Halide's buffer data.
+jobject outputBuffer = env->NewDirectByteBuffer(output.data(), width * height);
+```
+
+## Summary
+In this lesson, we’ve successfully integrated a Halide image-processing pipeline into an Android application using Kotlin. We started by setting up an Android project configured for native development with the Android NDK, employing Kotlin as the primary language. We then integrated Halide-generated static libraries and demonstrated their usage through Java Native Interface (JNI), bridging Kotlin and native code. This equips developers with the skills needed to harness Halide’s capabilities for building sophisticated, performant mobile applications on Android.
\ No newline at end of file
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md
new file mode 100644
index 0000000000..4c8ebe0796
--- /dev/null
+++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md
@@ -0,0 +1,162 @@
+---
+# User change
+title: "Ahead-of-time and cross-compilation"
+
+weight: 5
+
+layout: "learningpathall"
+---
+
+## Ahead-of-time and cross-compilation
+One of Halide’s standout features is the ability to compile image processing pipelines ahead-of-time (AOT), enabling developers to generate optimized binary code on their host machines rather than compiling directly on target devices. This AOT compilation process allows developers to create highly efficient libraries that run effectively across diverse hardware without incurring the runtime overhead associated with just-in-time (JIT) compilation.
+
+Halide also supports robust cross-compilation capabilities. Cross-compilation means using the host version of Halide, typically running on a desktop Linux or macOS system—to target different architectures, such as ARM for Android devices. Developers can thus optimize Halide pipelines on their host machine, produce libraries specifically optimized for Android, and integrate them seamlessly into Android applications. The generated pipeline code includes essential optimizations and can embed minimal runtime support, further reducing workload on the target device and ensuring responsiveness and efficiency.
+
+## Objective
+In this section, we leverage the host version of Halide to perform AOT compilation of an image processing pipeline via cross-compilation. The resulting pipeline library is specifically tailored to Android devices (targeting, for instance, arm64-v8a ABI), while the compilation itself occurs entirely on the host system. This approach significantly accelerates development by eliminating the need to build Halide or perform JIT compilation on Android devices. It also guarantees that the resulting binaries are optimized for the intended hardware, streamlining the deployment of high-performance image processing applications on mobile platforms.
+
+## Prepare Pipeline for Android
+The procedure implemented in the following code demonstrates how Halide’s AOT compilation and cross-compilation features can be utilized to create an optimized image processing pipeline for Android. We will run Halide on our host machine (in this example, macOS) to generate a static library containing the pipeline function, which will later be invoked from an Android device. Below is a step-by-step explanation of this process.
+
+Create a new file named blur-android.cpp with the following contents:
+
+```cpp
+#include "Halide.h"
+#include
+#include // for std::string
+#include // for fixed-width integer types (e.g., uint8_t)
+using namespace Halide;
+
+int main(int argc, char** argv) {
+ if (argc < 2) {
+ std::cerr << "Usage: " << argv[0] << " \n";
+ return 1;
+ }
+
+ std::string output_basename = argv[1];
+
+ // Configure Halide Target for Android
+ Halide::Target target;
+ target.os = Halide::Target::OS::Android;
+ target.arch = Halide::Target::Arch::ARM;
+ target.bits = 64;
+ target.set_feature(Target::NoRuntime, false);
+
+ // --- Define the pipeline ---
+ // Define variables
+ Var x("x"), y("y");
+
+ // Define input parameter
+ ImageParam input(UInt(8), 2, "input");
+
+ // Create a clamped function that limits the access to within the image bounds
+ Func clamped = Halide::BoundaryConditions::repeat_edge(input);
+
+ // Now use the clamped function in processing
+ RDom r(0, 3, 0, 3);
+ Func blur("blur");
+
+ // Initialize blur accumulation
+ blur(x, y) = cast(0);
+ blur(x, y) += cast(clamped(x + r.x - 1, y + r.y - 1));
+
+ // Then continue with pipeline
+ Func blur_div("blur_div");
+ blur_div(x, y) = cast(blur(x, y) / 9);
+
+ // Thresholding
+ Func thresholded("thresholded");
+ Expr t = cast(128);
+ thresholded(x, y) = select(blur_div(x, y) > t, cast(255), cast(0));
+
+ // Simple scheduling
+ blur_div.compute_root();
+ thresholded.compute_root();
+
+ // --- AOT compile to a file ---
+ thresholded.compile_to_static_library(
+ output_basename, // base filename
+ { input }, // list of inputs
+ "blur_threshold", // name of the generated function
+ target
+ );
+
+ return 0;
+}
+```
+
+In the original implementation constants 128, 255, and 0 were implicitly treated as integers. Here, the threshold value (128) and output values (255, 0) are explicitly cast to uint8_t. This approach removes ambiguity and clearly specifies the types used, ensuring compatibility and clarity. Both approaches result in identical functionality, but explicitly casting helps emphasize the type correctness and may avoid subtle issues during cross-compilation or in certain environments.
+
+The program takes at least one command-line argument, the output base name used to generate the files (e.g., “blur_threshold_android”). Here, the target architecture is explicitly set within the code to Android ARM64:
+
+```cpp
+// Configure Halide Target for Android
+Halide::Target target;
+target.os = Halide::Target::OS::Android;
+target.arch = Halide::Target::Arch::ARM;
+target.bits = 64;
+
+// Enable Halide runtime inclusion in the generated library (needed if not linking Halide runtime separately).
+target.set_feature(Target::NoRuntime, false);
+
+// Optionally, enable hardware-specific optimizations to improve performance on ARM devices:
+// - DotProd: Optimizes matrix multiplication and convolution-like operations on ARM.
+// - ARMFp16 (half-precision floating-point operations).
+```
+
+Notes:
+* NoRuntime — When set to true, Halide excludes its runtime from the generated code, and you must link the runtime manually during the linking step. When set to false, the Halide runtime is included in the generated library, which simplifies deployment.
+* ARMFp16 — Enables the use of ARM hardware support for half-precision (16-bit) floating-point operations, which can provide faster execution when reduced precision is acceptable.
+
+We declare spatial variables (x, y) and an ImageParam named “input” representing the input image data. We use boundary clamping (clamp) to safely handle edge pixels. Then, we apply a 3x3 blur with a reduction domain (RDom). The accumulated sum is divided by 9 (the number of pixels in the neighborhood), producing an average blurred image. Lastly, thresholding is applied, producing a binary output: pixels above a certain brightness threshold (128) become white (255), while others become black (0).
+
+This section intentionally reinforces previous concepts, focusing now primarily on explicitly clarifying integration details, such as type correctness and the handling of runtime features within Halide.
+
+Simple scheduling directives (compute_root) instruct Halide to compute intermediate functions at the pipeline’s root, simplifying debugging and potentially enhancing runtime efficiency.
+
+This strategy can simplify debugging by clearly isolating computational steps and may enhance runtime efficiency by explicitly controlling intermediate storage locations.
+
+By clearly separating algorithm logic from scheduling, developers can easily test and compare different scheduling strategies,such as compute_inline, compute_root, compute_at, and more, without modifying their fundamental algorithmic code. This separation significantly accelerates iterative optimization and debugging processes, ultimately yielding better-performing code with minimal overhead.
+
+We invoke Halide’s AOT compilation function compile_to_static_library, which generates a static library (.a) containing the optimized pipeline and a corresponding header file (.h).
+
+```cpp
+thresholded.compile_to_static_library(
+ output_basename, // base filename for output files (e.g., "blur_threshold_android")
+ { input }, // list of input parameters to the pipeline
+ "blur_threshold", // the generated function name
+ target // our target configuration for Android
+);
+```
+
+This will produce:
+* A static library (blur_threshold_android.a) containing the compiled pipeline. This static library also includes Halide’s runtime functions tailored specifically for the targeted architecture (arm-64-android). Thus, no separate Halide runtime needs to be provided on the Android device when linking against this library.
+* A header file (blur_threshold_android.h) declaring the pipeline function for use in other C++/JNI code.
+
+These generated files are then ready to integrate directly into an Android project via JNI, allowing efficient execution of the optimized pipeline on Android devices. The integration process is covered in the next section.
+
+Note: JNI (Java Native Interface) is a framework that allows Java (or Kotlin) code running in a Java Virtual Machine (JVM), such as on Android, to interact with native applications and libraries written in languages like C or C++. JNI bridges the managed Java/Kotlin environment and the native, platform-specific implementations.
+
+## Compilation instructions
+To compile the pipeline-generation program on your host system, use the following commands (replace /path/to/halide with your Halide installation directory):
+```console
+export DYLD_LIBRARY_PATH=/path/to/halide/lib/libHalide.19.dylib
+g++ -std=c++17 blud-android.cpp -o blud-android \
+ -I/path/to/halide/include -L/path/to/halide/lib -lHalide \
+ $(pkg-config --cflags --libs opencv4) -lpthread -ldl \
+ -Wl,-rpath,/path/to/halide/lib
+```
+
+Then execute the binary:
+```console
+./blur_android blur_threshold_android
+```
+
+This will produce two files:
+* blur_threshold_android.a: The static library containing your Halide pipeline.
+* blur_threshold_android.h: The header file needed to invoke the generated pipeline.
+
+We will integrate these files into our Android project in the following section.
+
+## Summary
+In this section, we’ve explored Halide’s powerful ahead-of-time (AOT) and cross-compilation capabilities, preparing an optimized image processing pipeline tailored specifically for Android devices. By using the host-based Halide compiler, we’ve generated a static library optimized for ARM64 Android architecture, incorporating safe boundary conditions, neighborhood-based blurring, and thresholding operations. This streamlined process allows seamless integration of highly optimized native code into Android applications, ensuring both development efficiency and runtime performance on mobile platforms.
\ No newline at end of file
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md
new file mode 100644
index 0000000000..20394259d4
--- /dev/null
+++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md
@@ -0,0 +1,531 @@
+---
+# User change
+title: "Demonstrating Operation Fusion"
+
+weight: 4
+
+layout: "learningpathall"
+---
+
+## Objective
+In the previous section, we explored parallelization and tiling. Here, we focus on operator fusion (inlining) in Halide—i.e., letting producers be computed directly inside their consumers—versus materializing intermediates with compute_root() or compute_at(). You’ll learn when fusion reduces memory traffic and when materializing saves recomputation (e.g., for large stencils or multi-use intermediates). We’ll inspect loop nests with print_loop_nest(), switch among schedules (fuse-all, fuse-blur-only, materialize, tile-and-materialize-per-tile) in a live camera pipeline, and measure the impact (ms/FPS/MPix/s).
+
+Note: this section does not cover loop fusion (the fuse directive). We concentrate on operator fusion, which is Halide’s default behavior.
+
+## Code
+To demonstrate how fusion in Halide works let's create a new file camera-capture-fusion.cpp, and modify it as follows. This code uses a live camera pipeline (BGR → gray → 3×3 blur → threshold), adds a few schedule variants to toggle operator fusion vs. materialization, and print ms / FPS / MPix/s. So you can see the impact immediately.
+
+```cpp
+#include "Halide.h"
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+
+using namespace Halide;
+using namespace cv;
+using namespace std;
+
+enum class Schedule : int {
+ Simple = 0, // materialize gray + blur
+ FuseBlurAndThreshold = 1,// materialize gray; fuse blur+threshold
+ FuseAll = 2, // fuse everything (default)
+ Tile = 3, // tile output; materialize gray per tile; blur fused
+};
+
+static const char* schedule_name(Schedule s) {
+ switch (s) {
+ case Schedule::Simple: return "Simple";
+ case Schedule::FuseBlurAndThreshold: return "FuseBlurAndThreshold";
+ case Schedule::FuseAll: return "FuseAll";
+ case Schedule::Tile: return "Tile";
+ default: return "Unknown";
+ }
+}
+
+// Build the BGR->Gray -> 3x3 binomial blur -> threshold pipeline.
+// We clamp the *ImageParam* at the borders (Func clamp of ImageParam works in Halide 19).
+Pipeline make_pipeline(ImageParam& input, Schedule schedule) {
+ Var x("x"), y("y");
+
+ // Assume 3-channel BGR interleaved frames (we convert if needed).
+ input.dim(0).set_stride(3); // x-stride = channels
+ input.dim(2).set_stride(1); // c-stride = 1
+ input.dim(2).set_bounds(0, 3); // three channels
+
+ Func inputClamped = BoundaryConditions::repeat_edge(input);
+
+ // Gray (Rec.601)
+ Func gray("gray");
+ gray(x, y) = cast(0.114f * inputClamped(x, y, 0)
+ + 0.587f * inputClamped(x, y, 1)
+ + 0.299f * inputClamped(x, y, 2));
+
+ // 3x3 binomial blur (sum/16)
+ Func blur("blur");
+ const uint16_t k[3][3] = {{1,2,1},{2,4,2},{1,2,1}};
+ Expr blurSum = cast(0);
+ for (int j = 0; j < 3; ++j)
+ for (int i = 0; i < 3; ++i)
+ blurSum = blurSum + cast(gray(x + i - 1, y + j - 1)) * k[j][i];
+ blur(x, y) = cast(blurSum / 16);
+
+ // Threshold (binary)
+ Func thresholded("thresholded");
+ Expr T = cast(128);
+ thresholded(x, y) = select(blur(x, y) > T, cast(255), cast(0));
+
+ // Final output
+ Func output("output");
+ output(x, y) = thresholded(x, y);
+ output.compute_root(); // we always realize 'output'
+
+ // Scheduling to demonstrate OPERATOR FUSION vs MATERIALIZATION
+ // Default in Halide = fusion/inlining (no schedule on producers).
+ Var xo("xo"), yo("yo"), xi("xi"), yi("yi");
+
+ switch (schedule) {
+ case Schedule::Simple:
+ // Materialize gray and blur (two loop nests); thresholded fuses into output
+ gray.compute_root();
+ blur.compute_root();
+ break;
+
+ case Schedule::FuseBlurAndThreshold:
+ // Materialize gray; blur and thresholded remain fused into output
+ gray.compute_root();
+ break;
+
+ case Schedule::FuseAll:
+ // No schedule on producers: gray, blur, thresholded all fuse into output
+ break;
+
+ case Schedule::Tile:
+ // Tile the output; compute gray per tile; blur stays fused within tile
+ output.tile(x, y, xo, yo, xi, yi, 64, 64);
+ gray.compute_at(output, xo);
+ break;
+ }
+
+ // (Optional) Print loop nest once to “x-ray” the schedule
+ std::cout << "\n---- Loop structure (" << schedule_name(schedule) << ") ----\n";
+ output.print_loop_nest();
+ std::cout << "-----------------------------------------------\n";
+
+ return Pipeline(output);
+}
+
+int main(int argc, char** argv) {
+ // Optional CLI: start with a given schedule number 0..3
+ Schedule current = Schedule::FuseAll;
+ if (argc >= 2) {
+ int s = std::atoi(argv[1]);
+ if (s >= 0 && s <= 3) current = static_cast(s);
+ }
+ std::cout << "Starting with schedule: " << schedule_name(current)
+ << " (press 0..3 to switch; q/Esc to quit)\n";
+
+ // Open camera
+ VideoCapture cap(0);
+ if (!cap.isOpened()) {
+ std::cerr << "Error: Unable to open camera.\n";
+ return 1;
+ }
+ cap.set(CAP_PROP_CONVERT_RGB, true); // ask OpenCV for BGR frames
+
+ // Grab one frame to get size/channels
+ Mat frame;
+ cap >> frame;
+ if (frame.empty()) {
+ std::cerr << "Error: empty first frame.\n";
+ return 1;
+ }
+ if (frame.channels() == 4) {
+ cvtColor(frame, frame, COLOR_BGRA2BGR);
+ } else if (frame.channels() == 1) {
+ cvtColor(frame, frame, COLOR_GRAY2BGR);
+ }
+ if (!frame.isContinuous()) frame = frame.clone();
+
+ const int width = frame.cols;
+ const int height = frame.rows;
+
+ // Halide inputs/outputs
+ ImageParam input(UInt(8), 3, "input");
+ Buffer outBuf(width, height, "out");
+
+ // Build pipeline for the starting schedule
+ Pipeline pipe = make_pipeline(input, current);
+
+ bool warmed_up = false;
+ namedWindow("Fusion Demo (live)", WINDOW_NORMAL);
+
+ for (;;) {
+ cap >> frame;
+ if (frame.empty()) break;
+ if (frame.channels() == 4) {
+ cvtColor(frame, frame, COLOR_BGRA2BGR);
+ } else if (frame.channels() == 1) {
+ cvtColor(frame, frame, COLOR_GRAY2BGR);
+ }
+ if (!frame.isContinuous()) frame = frame.clone();
+
+ // Wrap interleaved frame
+ auto in_rt = Runtime::Buffer::make_interleaved(
+ frame.data, frame.cols, frame.rows, /*channels*/3);
+ Buffer<> in_fe(*in_rt.raw_buffer());
+ input.set(in_fe);
+
+ // Time the Halide realize() only
+ auto t0 = std::chrono::high_resolution_clock::now();
+ try {
+ pipe.realize(outBuf);
+ } catch (const Halide::RuntimeError& e) {
+ std::cerr << "Halide runtime error: " << e.what() << "\n";
+ break;
+ } catch (const std::exception& e) {
+ std::cerr << "std::exception: " << e.what() << "\n";
+ break;
+ }
+ auto t1 = std::chrono::high_resolution_clock::now();
+
+ double ms = std::chrono::duration(t1 - t0).count();
+ double fps = ms > 0.0 ? 1000.0 / ms : 0.0;
+ double mpixps = ms > 0.0 ? (double(width) * double(height)) / (ms * 1000.0) : 0.0;
+
+ std::cout << std::fixed << std::setprecision(2)
+ << (warmed_up ? "" : "[warm-up] ")
+ << schedule_name(current) << " | "
+ << ms << " ms | "
+ << fps << " FPS | "
+ << mpixps << " MPix/s\r" << std::flush;
+ warmed_up = true;
+
+ // Show result
+ Mat view(height, width, CV_8UC1, outBuf.data());
+ imshow("Fusion Demo (live)", view);
+ int key = waitKey(1);
+ if (key == 27 || key == 'q' || key == 'Q') break;
+
+ // Hotkeys 0..3 to switch schedules live
+ if (key >= '0' && key <= '3') {
+ Schedule next = static_cast(key - '0');
+ if (next != current) {
+ std::cout << "\nSwitching to schedule: " << schedule_name(next) << "\n";
+ current = next;
+ try {
+ pipe = make_pipeline(input, current); // rebuild JIT with new schedule
+ } catch (const Halide::CompileError& e) {
+ std::cerr << "Halide compile error: " << e.what() << "\n";
+ break;
+ }
+ warmed_up = false; // next frame includes JIT, label as warm-up
+ }
+ }
+ }
+
+ std::cout << "\n";
+ destroyAllWindows();
+ return 0;
+}
+```
+We begin by pulling in the right set of headers. Right after the includes we define an enumeration, Schedule, which lists the four different scheduling strategies we want to experiment with. These represent the “modes” we’ll toggle between while the program is running: a simple materialized version, a fused blur-plus-threshold, a fully fused pipeline, and a tiled variant.
+
+Finally, to make the output more readable, we add a small helper function, schedule_name. It converts each enum value into a human-friendly label so that when the program prints logs or overlays statistics, you can immediately see which schedule is active.
+```cpp
+#include "Halide.h"
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+
+using namespace Halide;
+using namespace cv;
+using namespace std;
+
+enum class Schedule : int {
+ Simple = 0,
+ FuseBlurAndThreshold = 1,
+ FuseAll = 2,
+ Tile = 3,
+};
+
+static const char* schedule_name(Schedule s) { ... }
+```
+
+The heart of this demo is the make_pipeline function. It defines our camera processing pipeline in Halide and applies different scheduling choices depending on which mode we select.
+
+We start by declaring Var x, y as our pixel coordinates. Similarly as before, our camera frames come in as 3-channel interleaved BGR, we tell Halide how the data is laid out: the stride along x is 3 (one step moves across all three channels), the stride along c (channels) is 1, and the bounds on the channel dimension are 0–2.
+
+Because we don’t want to worry about array bounds when applying filters, we clamp the input at the borders. In Halide 19, BoundaryConditions::repeat_edge works cleanly when applied to an ImageParam, since it has .dim() information. This way, all downstream stages can assume safe access even at the edges of the image.
+
+```cpp
+Pipeline make_pipeline(ImageParam& input, Schedule schedule) {
+ Var x("x"), y("y");
+
+ // (a) Interleaved constraints for BGR frames
+ input.dim(0).set_stride(3); // x stride = channels
+ input.dim(2).set_stride(1); // channel stride = 1
+ input.dim(2).set_bounds(0, 3); // channels = 0..2
+
+ // (b) Border handling: clamp the *ImageParam* (works cleanly in Halide 19)
+ Func inputClamped = BoundaryConditions::repeat_edge(input);
+```
+
+Next comes the gray conversion. As in previous section, we use Rec.601 weights a 3×3 binomial blur. Instead of using a reduction domain (RDom), we unroll the sum in C++ host code with a pair of loops over the kernel. The kernel values {1, 2, 1; 2, 4, 2; 1, 2, 1} approximate a Gaussian filter. Each pixel of blur is simply the weighted sum of its 3×3 neighborhood, divided by 16.
+
+We then add a threshold stage. Pixels above 128 become white, and all others black, producing a binary image. Finally, we define an output Func that wraps the thresholded result and call compute_root() on it so that it will be realized explicitly when we run the pipeline.
+
+```cpp
+ // (c) BGR → gray (Rec.601, float weights)
+ Func gray("gray");
+ gray(x, y) = cast(0.114f * inputClamped(x, y, 0)
+ + 0.587f * inputClamped(x, y, 1)
+ + 0.299f * inputClamped(x, y, 2));
+
+ // (d) 3×3 binomial blur, unrolled in host code (no RDom needed)
+ Func blur("blur");
+ const uint16_t k[3][3] = {{1,2,1},{2,4,2},{1,2,1}};
+ Expr blurSum = cast(0);
+ for (int j = 0; j < 3; ++j)
+ for (int i = 0; i < 3; ++i)
+ blurSum = blurSum + cast(gray(x + i - 1, y + j - 1)) * k[j][i];
+ blur(x, y) = cast(blurSum / 16);
+
+ // (e) Threshold to binary
+ Func thresholded("thresholded");
+ Expr T = cast(128);
+ thresholded(x, y) = select(blur(x, y) > T, cast(255), cast(0));
+
+ // (f) Final output and default root
+ Func output("output");
+ output(x, y) = thresholded(x, y);
+ output.compute_root();
+```
+
+Now comes the interesting part: the scheduling choices. Depending on the Schedule enum passed in, we instruct Halide to either fuse everything (the default), materialize some intermediates, or even tile the output.
+* Simple. We explicitly compute and store both gray and blur across the whole frame with compute_root(). This makes them easy to reuse or parallelize, but requires extra memory traffic.
+* FuseBlurAndThreshold. We compute gray once as a planar buffer, but leave blur and thresholded fused into output. This often works well when the input is interleaved, because subsequent stages read from a planar gray.
+* FuseAll. We apply no scheduling to producers, so gray, blur, and thresholded are all inlined into output. This minimizes memory usage but can recompute gray many times inside the 3×3 stencil.
+* Tile. We split the output into 64×64 tiles. Within each tile, we materialize gray (compute_at(output, xo)), so the working set is small and stays in cache. blur remains fused within each tile.
+
+To help us “x-ray” what’s happening, we print the loop nest Halide generates for each schedule using print_loop_nest(). This gives us a clear view of how fusion or materialization changes the structure of the computation.
+
+```cpp
+Var xo("xo"), yo("yo"), xi("xi"), yi("yi");
+
+switch (schedule) {
+ case Schedule::Simple:
+ // Materialize gray and blur as whole-frame buffers.
+ gray.compute_root();
+ blur.compute_root();
+ break;
+
+ case Schedule::FuseBlurAndThreshold:
+ // Materialize only gray; leave blur+threshold fused into output.
+ gray.compute_root();
+ break;
+
+ case Schedule::FuseAll:
+ // No schedules on producers → gray, blur, thresholded all inline into output.
+ break;
+
+ case Schedule::Tile:
+ // Tile the output; compute gray per tile; blur stays fused inside each tile.
+ output.tile(x, y, xo, yo, xi, yi, 64, 64);
+ gray.compute_at(output, xo);
+ break;
+}
+
+// Optional: print loop nest to “x-ray” the shape of the generated loops
+std::cout << "\n---- Loop structure (" << schedule_name(schedule) << ") ----\n";
+output.print_loop_nest();
+std::cout << "-----------------------------------------------\n";
+
+return Pipeline(output);
+}
+```
+
+All the camera handling is just like before: we open the default webcam with OpenCV, normalize frames to 3-channel BGR if needed, wrap each frame as an interleaved Halide buffer, run the pipeline, and show the result. We still time only the realize() call and print ms / FPS / MPix/s, with the first frame marked as [warm-up].
+
+The new piece is that you can toggle scheduling modes from the keyboard while the app is running:
+1. Keys:
+* 0 – Simple (materialize gray and blur)
+* 1 – FuseBlurAndThreshold (materialize gray; fuse blur+threshold)
+* 2 – FuseAll (default fusion: fuse gray+blur+threshold)
+* 3 – Tile (tile output; materialize gray per tile; blur fused inside tile)
+* q / Esc – quit
+
+Under the hood, pressing 0–3 triggers a rebuild of the Halide pipeline with the chosen schedule:
+1. We map the key to a Schedule enum value.
+2. We call make_pipeline(input, next) to construct the new scheduled pipeline.
+3. We reset the warm-up flag, so the next line of stats is labeled [warm-up] (that frame includes JIT).
+4. The main loop keeps grabbing frames; only the Halide schedule changes.
+
+This live switching makes fusion tangible: you can watch the loop nest printout change, see the visualization update, and compare throughput numbers in real time as you move between Simple, FuseBlurAndThreshold, FuseAll, and Tile.
+
+Now, build and run the sample:
+```console
+g++ -std=c++17 camera-capture-fusion.cpp -o camera-capture-fusion \
+ -I/path/to/halide/include -L/path/to/halide/lib -lHalide \
+ $(pkg-config --cflags --libs opencv4) -lpthread -ldl \
+ -Wl,-rpath,/path/to/halide/lib
+./camera-capture-fusion
+```
+
+You will see the following output:
+```output
+% ./camera-capture-fusion
+Starting with schedule: FuseAll (press 0..3 to switch; q/Esc to quit)
+
+---- Loop structure (FuseAll) ----
+produce output:
+ for y:
+ for x:
+ output(...) = ...
+-----------------------------------------------
+FuseAll | 18.90 ms | 52.92 FPS | 109.74 MPix/s2 MPix/s
+Switching to schedule: FuseBlurAndThreshold
+
+---- Loop structure (FuseBlurAndThreshold) ----
+produce gray:
+ for y:
+ for x:
+ gray(...) = ...
+consume gray:
+ produce output:
+ for y:
+ for x:
+ output(...) = ...
+-----------------------------------------------
+FuseBlurAndThreshold | 4.85 ms | 206.19 FPS | 427.55 MPix/s97 MPix/s
+Switching to schedule: FuseAll
+
+---- Loop structure (FuseAll) ----
+produce output:
+ for y:
+ for x:
+ output(...) = ...
+-----------------------------------------------
+FuseAll | 18.14 ms | 55.12 FPS | 114.30 MPix/s22 MPix/s
+Switching to schedule: Tile
+
+---- Loop structure (Tile) ----
+produce output:
+ for y.yo:
+ for x.xo:
+ produce gray:
+ for y:
+ for x:
+ gray(...) = ...
+ consume gray:
+ for y.yi in [0, 63]:
+ for x.xi in [0, 63]:
+ output(...) = ...
+-----------------------------------------------
+Tile | 4.98 ms | 200.73 FPS | 416.23 MPix/s28 MPix/s
+Switching to schedule: Simple
+
+---- Loop structure (Simple) ----
+produce gray:
+ for y:
+ for x:
+ gray(...) = ...
+consume gray:
+ produce blur:
+ for y:
+ for x:
+ blur(...) = ...
+ consume blur:
+ produce output:
+ for y:
+ for x:
+ output(...) = ...
+-----------------------------------------------
+Simple | 6.01 ms | 166.44 FPS | 345.12 MPix/s15 MPix/s
+```
+
+The console output combines two kinds of information:
+1. Loop nests – printed by print_loop_nest(). These show how Halide actually arranges the computation for the chosen schedule. They are a great “x-ray” view of fusion and materialization:
+* In FuseAll, the loop nest contains only output. That’s because gray, blur, and thresholded are all inlined (fused) into it. Each pixel of output recomputes its 3×3 neighborhood of gray.
+* In FuseBlurAndThreshold, there is an extra loop for gray, because we explicitly called gray.compute_root(). The blur and thresholded stages are still fused into output. This reduces recomputation of gray and makes downstream loops simpler to vectorize.
+* In Simple, both gray and blur have their own loop nests, and thresholded fuses into output. This introduces two extra buffers, but each stage is computed once and can be parallelized independently.
+* In Tile, you see the outer tile loops (y.yo and x.xo) and the inner per-tile loops (y.yi, x.xi). Inside each tile, gray is produced once and then consumed by the fused blur and threshold. This keeps the working set small and cache-friendly.
+2. Performance metrics – printed after each realize(). They report:
+* ms – the average time to process one frame.
+* FPS – frames per second (1000 / ms).
+* MPix/s – millions of pixels per second processed.
+
+Comparing the numbers:
+* FuseAll runs at ~53 FPS. It has minimal memory traffic but pays for recomputation of gray under the blur.
+* FuseBlurAndThreshold jumps to over 200 FPS. By materializing gray, we avoid redundant recomputation and allow blur+threshold to stay fused. This is often the sweet spot for interleaved camera input.
+* Simple reaches ~166 FPS. Both gray and blur are materialized, so no recomputation occurs, but memory traffic is higher than in FuseBlurAndThreshold.
+* Tile achieves similar speed (~200 FPS). Producing gray per tile balances recomputation and memory traffic by keeping intermediates local to cache.
+
+By toggling schedules live, you can see and measure how operator fusion and materialization change both the loop structure and the throughput:
+* Fusion is the default in Halide and eliminates temporary storage, but may cause recomputation for spatial filters.
+* Materializing selected stages with compute_root() or compute_at() can reduce recomputation, enable vectorization and parallelization, and sometimes yield much higher throughput.
+* Tile-level materialization (compute_at) provides a hybrid - fusing within tiles while keeping intermediates small and cache-resident.
+
+This demo makes these trade-offs concrete: the loop nest diagrams explain the structure, and the live FPS/MPix/s stats show the real performance impact.
+
+## What “fusion” means in Halide
+One of Halide’s defining features is that, by default, it performs operator fusion, also called inlining. This means that if a stage produces some intermediate values, those values aren’t stored in a separate buffer and then re-read later—instead, the stage is computed directly inside the consumer’s loop. In other words, unless you tell Halide otherwise, every producer Func is fused into the next stage that uses it.
+
+Why is this important? Fusion reduces memory traffic, because Halide doesn’t need to write intermediates out to RAM and read them back again. On CPUs, where memory bandwidth is often the bottleneck, this can be a major performance win. Fusion also improves cache locality, since values are computed exactly where they are needed and the working set stays small. The trade-off, however, is that fusion can cause recomputation: if a consumer uses a neighborhood (like a blur that reads 3×3 or 9×9 pixels), the fused producer may be recalculated multiple times for overlapping regions. Whether fusion is faster depends on the balance between compute cost and memory traffic.
+
+Consider the difference in pseudocode:
+```cpp
+for y:
+ for x:
+ out(x,y) = threshold( sum_{i,j in 3x3} kernel(i,j) * gray(x+i,y+j) )
+ // gray(...) is computed on the fly for each (i,j)
+```
+
+Materialized with compute_root():
+
+```cpp
+for y: for x: gray(x,y) = ... // write one planar gray image
+for y: for x: out(x,y) = threshold( sum kernel * gray(x+i,y+j) )
+```
+
+The fused version eliminates buffer writes but recomputes gray under the blur stencil. The materialized version performs more memory operations but avoids recomputation, and also gives us a clean point to parallelize or vectorize the gray stage.
+
+It’s worth noting that Halide also supports a loop fusion directive (fuse) that merges two loop variables together. That’s a different concept and not our focus here. In this tutorial, we’re talking specifically about operator fusion—the decision of whether to inline or materialize stages.
+
+## How this looks in the live camera demo
+Our pipeline is: BGR input → gray → 3×3 blur → thresholded → output. Depending on the schedule, we see different kinds of fusion:
+* FuseAll. No schedules on producers. gray, blur, and thresholded are all inlined into output. This minimizes memory traffic but recomputes gray repeatedly inside the 3×3 blur.
+* FuseBlurAndThreshold: We add gray.compute_root(), materializing gray once as a planar buffer. This avoids recomputation of gray and makes downstream blur and thresholded vectorize better. blur and thresholded remain fused.
+* Simple. Both gray and blur are materialized across the frame. This avoids recomputation entirely but increases memory traffic.
+* Tile. We split the output into 64×64 tiles and compute gray per tile (compute_at(output, xo)). This keeps intermediate results local to cache while still fusing blur inside each tile.
+
+By toggling between these modes in the live demo, you can see how the loop nests and throughput numbers change, which makes the abstract idea of fusion much more concrete.
+
+## When to use operator fusion
+Fusion is Halide’s default and usually the right place to start. It’s especially effective for:
+* Element-wise chains, where each pixel is transformed independently:
+examples include intensity scaling or offset, gamma correction, channel mixing, color-space conversions, and logical masking.
+* Cheap post-ops after spatial filters:
+for instance, there’s no reason to materialize a blurred image just to threshold it. Fuse the threshold directly into the blur’s consumer.
+
+In our code, FuseAll inlines gray, blur, and thresholded into output. FuseBlurAndThreshold materializes only gray, then keeps blur and thresholded fused—a common middle ground that balances memory use and compute reuse.
+
+## When to materialize instead of fuse
+Fusion isn’t always best. You’ll want to materialize an intermediate (compute_root() or compute_at()) if:
+* The producer would be recomputed many times under a large stencil.
+* The producer is read from an interleaved source and it’s easier to vectorize a planar buffer.
+* The intermediate is reused by multiple consumers.
+* You need a natural stage to apply parallelization or tiling.
+
+### Profiling
+The fastest way to check whether fusion helps is to measure it. Our demo prints timing and throughput per frame, but Halide also includes a built-in profiler that reports per-stage runtimes. To learn how to enable and interpret the profiler, see the official [Halide profiling tutorial](https://halide-lang.org/tutorials/tutorial_lesson_21_auto_scheduler_generate.html#profiling).
+
+## Summary
+In this lesson, we learned about operator fusion in Halide—a powerful technique for reducing memory bandwidth and improving computational efficiency. We explored why fusion matters, looked at scenarios where it is most effective, and saw how Halide’s scheduling constructs such as compute_root() and compute_at() let us control whether stages are fused or materialized. By experimenting with different schedules, including fusing the Gaussian blur and thresholding stages, we observed how fusion can significantly improve the performance of a real-time image processing pipeline
\ No newline at end of file
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md
new file mode 100644
index 0000000000..7462531a9d
--- /dev/null
+++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md
@@ -0,0 +1,240 @@
+---
+# User change
+title: "Background and Installation"
+
+weight: 2
+
+layout: "learningpathall"
+---
+
+## Introduction
+Halide is a powerful, open-source programming language specifically designed to simplify and optimize high-performance image and signal processing pipelines. Initially developed by researchers at MIT and Adobe in 2012, Halide addresses a critical challenge in computational imaging: efficiently mapping image-processing algorithms onto diverse hardware architectures without extensive manual tuning. It accomplishes this by clearly separating the description of an algorithm (specifying the mathematical or logical transformations applied to images or signals) from its schedule (detailing how and where those computations execute). This design enables rapid experimentation and effective optimization for various processing platforms, including CPUs, GPUs, and mobile hardware.
+
+A key advantage of Halide lies in its innovative programming model. By clearly distinguishing between algorithmic logic and scheduling decisions—such as parallelism, vectorization, memory management, and hardware-specific optimizations—developers can first focus on ensuring the correctness of their algorithms. Performance tuning can then be handled independently, significantly accelerating development cycles. This approach often yields performance that matches or even surpasses manually optimized code. As a result, Halide has seen widespread adoption across industry and academia, powering image processing systems at technology giants such as Google, Adobe, and Facebook, and enabling advanced computational photography features used by millions daily.
+
+In this learning path, you will explore Halide’s foundational concepts, set up your development environment, and create your first functional Halide application. By the end, you will understand what makes Halide uniquely suited to efficient image processing, particularly on mobile and Arm-based hardware, and be ready to build your own optimized pipelines.
+
+For broader or more general use cases, please refer to the official Halide documentation and tutorials available at halide-lang.org.
+
+The example code for this Learning Path is available in the following repositories: [here](https://github.com/dawidborycki/Arm.Halide.Hello-World.git) and [here](https://github.com/dawidborycki/Arm.Halide.AndroidDemo.git)
+
+## Key concepts in Halide
+### Separation of algorithm and schedule
+At the core of Halide’s design philosophy is the principle of clearly separating algorithms from schedules. Traditional image-processing programming tightly couples algorithmic logic with execution strategy, complicating optimization and portability. In contrast, Halide explicitly distinguishes these two components:
+* Algorithm. Defines what computations are performed—for example, image filters, pixel transformations, or other mathematical operations on image data.
+* Schedule. Specifies how and where these computations are executed, addressing critical details such as parallel execution, memory usage, caching strategies, and hardware-specific optimizations.
+
+This separation allows developers to rapidly experiment and optimize their code for different hardware architectures or performance requirements without altering the core algorithmic logic.
+
+Halide provides three key building blocks, including Functions, Vars, and Pipelines, to simplify and structure image processing algorithms. Consider the following illustrative example:
+
+```cpp
+Halide::Var x("x"), y("y"), c("c");
+Halide::Func brighter("brighter");
+
+// Define a function to increase image brightness by 50
+brighter(x, y, c) = Halide::cast(Halide::min(input(x, y, c) + 50, 255));
+```
+
+Functions (Func) represent individual computational steps or image operations. Each Func encapsulates an expression applied to pixels, allowing concise definition of complex image processing tasks. Vars symbolically represent spatial coordinates or dimensions (e.g., horizontal x, vertical y, color channel c). They specify where computations are applied in the image data Pipelines are formed by interconnecting multiple Func objects, structuring a clear workflow where the output of one stage feeds into subsequent stages, enabling modular and structured image processing.
+
+Halide is a domain-specific language (DSL) tailored explicitly for image and signal processing tasks. It provides a concise set of predefined operations and building blocks optimized for expressing complex image processing pipelines. By abstracting common computational patterns into simple yet powerful operators, Halide allows developers to succinctly define their processing logic, facilitating readability, maintainability, and easy optimization for various hardware targets.
+
+### Scheduling strategies (parallelism, vectorization, tiling)
+Halide offers several powerful scheduling strategies designed for maximum performance:
+* Parallelism. Executes computations concurrently across multiple CPU cores, significantly reducing execution time for large datasets.
+* Vectorization. Enables simultaneous processing of multiple data elements using SIMD (Single Instruction, Multiple Data) instructions available on CPUs and GPUs, greatly enhancing performance.
+* Tiling. Divides computations into smaller blocks (tiles) optimized for cache efficiency, thus improving memory locality and reducing overhead due to memory transfers.
+
+By combining these scheduling techniques, developers can achieve optimal performance tailored specifically to their target hardware architecture.
+
+Beyond manual scheduling strategies, Halide also provides an Autoscheduler, a powerful tool that automatically generates optimized schedules tailored to specific hardware architectures, further simplifying performance optimization.
+
+## System requirements and environment setup
+To start developing with Halide, your system must meet several requirements and dependencies.
+
+### Installation options
+Halide can be set up using one of two main approaches:
+* Installing pre-built binaries - pre-built binaries are convenient, quick to install, and suitable for most beginners or standard platforms (Windows, Linux, macOS). This approach is recommended for typical use cases.
+* Building Halide from source is required when pre-built binaries are unavailable for your specific environment, or if you wish to experiment with the latest Halide features or LLVM versions still under active development. This method typically requires greater familiarity with build systems and may be more suitable for advanced users.
+
+Here, we’ll use pre-built binaries:
+1. Visit the official Halide releases [page](https://github.com/halide/Halide/releases). As of this writing, the latest Halide version is v19.0.0.
+2. Download and unzip the binaries to a convenient location (e.g., /usr/local/halide on Linux/macOS or C:\halide on Windows).
+3. Optionally set environment variables to simplify further usage:
+```console
+export HALIDE_DIR=/path/to/halide
+export PATH=$HALIDE_DIR/bin:$PATH
+```
+
+To proceed futher, let's make sure to install the following components:
+1. LLVM (Halide requires LLVM to compile and execute pipelines):
+* Linux (Ubuntu):
+```console
+sudo apt-get install llvm-19-dev libclang-19-dev clang-19
+```
+* macOS (Homebrew):
+```console
+brew install llvm
+```
+2. OpenCV (for image handling in later lessons):
+* Linux (Ubuntu):
+```console
+sudo apt-get install libopencv-dev pkg-config
+```
+* macOS (Homebrew):
+```console
+brew install opencv pkg-config
+```
+
+Halide examples were tested with OpenCV 4.11.0
+
+## Your first Halide program
+Now you’re ready to build your first Halide-based application. Save the following as hello-world.cpp:
+```cpp
+#include "Halide.h"
+#include
+#include
+#include
+#include
+
+using namespace Halide;
+using namespace cv;
+
+int main() {
+ // Static path for the input image.
+ std::string imagePath = "img.png";
+
+ // Load the input image using OpenCV (BGR by default).
+ Mat input = imread(imagePath, IMREAD_COLOR);
+ // Alternative: Halide has a built-in IO function to directly load images as Halide::Buffer.
+ // Example: Halide::Buffer inputBuffer = Halide::Tools::load_image(imagePath);
+ if (input.empty()) {
+ std::cerr << "Error: Unable to load image from " << imagePath << std::endl;
+ return -1;
+ }
+
+ // Convert RGB back to BGR for correct color display in OpenCV (optional but recommended for OpenCV visualization).
+ cvtColor(input, input, COLOR_BGR2RGB);
+
+ // Wrap the OpenCV Mat data in a Halide::Buffer.
+ Buffer inputBuffer(input.data, input.cols, input.rows, input.channels());
+
+ // Example Halide pipeline definition directly using inputBuffer
+ // Define Halide pipeline variables:
+ // x, y - spatial coordinates (width, height)
+ // c - channel coordinate (R, G, B)
+ Var x("x"), y("y"), c("c");
+ Func invert("inverted");
+ invert(x, y, c) = 255 - inputBuffer(x, y, c);
+
+ // Schedule the pipeline so that the channel dimension is the innermost loop,
+ // ensuring that the output is interleaved.
+ invert.reorder(c, x, y);
+
+ // Realize the output buffer with the same dimensions as the input.
+ Buffer outputBuffer = invert.realize({input.cols, input.rows, input.channels()});
+
+ // Wrap the Halide output buffer directly into an OpenCV Mat header.
+ // CV_8UC3 indicates an 8-bit unsigned integer image (CV_8U) with 3 color channels (C3), typically representing RGB or BGR images.
+ // This does not copy data; it creates a header that refers to the same memory.
+ Mat output(input.rows, input.cols, CV_8UC3, outputBuffer.data());
+
+ // Convert from BGR to RGB for consistency (optional, but recommended if your pipeline expects RGB).
+ cvtColor(output, output, COLOR_RGB2BGR);
+
+ // Display the input and processed image.
+ imshow("Original Image", input);
+ imshow("Inverted Image", output);
+
+ // Wait indefinitely until a key is pressed.
+ waitKey(0); // Wait for a key press before closing the window.
+
+ return 0;
+}
+```
+
+This program demonstrates how to combine Halide’s image processing capabilities with OpenCV’s image I/O and display functionality. It begins by loading an image from disk using OpenCV, specifically reading from a static file named img.png (here we use a Cameraman image). Since OpenCV loads images in BGR format by default, the code immediately converts the image to RGB format so that it is compatible with Halide’s expectations.
+
+Once the image is loaded and converted, the program wraps the raw image data into a Halide buffer, capturing the image’s dimensions (width, height, and color channels). Next, the Halide pipeline is defined through a function named invert, which specifies the computations to perform on each pixel—in this case, subtracting the original pixel value from 255 to invert the colors. The pipeline definition alone does not perform any actual computation; it only describes what computations should occur and how to schedule them.
+
+The actual computation occurs when the pipeline is executed with the call to invert.realize(...). This is the step that processes the input image according to the defined pipeline and produces an output Halide buffer. The scheduling directive (invert.reorder(c, x, y)) ensures that pixel data is computed in an interleaved manner (channel-by-channel per pixel), aligning the resulting data with OpenCV’s expected memory layout for images.
+
+Finally, the processed Halide output buffer is efficiently wrapped in an OpenCV Mat header without copying pixel data. For proper display in OpenCV, which uses BGR channel ordering by default, the code converts the processed image back from RGB to BGR. The program then displays the original and inverted images in separate windows, waiting for a key press before exiting. This approach demonstrates a streamlined integration between Halide for high-performance image processing and OpenCV for convenient input and output operations.
+
+By default, Halide orders loops based on the order of variable declaration. In this example, the original ordering (x, y, c) implies processing the image pixel-by-pixel across all horizontal positions (x), then vertical positions (y), and finally channels (c). This ordering naturally produces a planar memory layout (e.g., processing all red pixels first, then green, then blue).
+
+However, the optimal loop order depends on your intended memory layout and compatibility with external libraries:
+1. Interleaved Layout (RGBRGBRGB…):
+* Commonly used by libraries such as OpenCV.
+* To achieve this, the color channel (c) should be the innermost loop, followed by horizontal (x) and then vertical (y) loops
+
+Specifically, calling:
+```cpp
+invert.reorder(c, x, y);
+```
+
+changes the loop nesting to process each pixel’s channels together (R, G, B for the first pixel, then R, G, B for the second pixel, and so on), resulting in:
+* Better memory locality and cache performance when interfacing with interleaved libraries like OpenCV.
+* Reduced overhead for subsequent image-handling operations (display, saving, or further processing).
+
+By default, OpenCV stores images in interleaved memory layout, using the HWC (Height, Width, Channel) ordering. To correctly represent this data layout in a Halide buffer, you can also explicitly use the Buffer::make_interleaved() method, which ensures the data layout is properly specified. The code snippet would look like this:
+
+```cpp
+// Wrap the OpenCV Mat data in a Halide buffer with interleaved HWC layout.
+Buffer inputBuffer = Buffer::make_interleaved(
+ input.data, input.cols, input.rows, input.channels()
+);
+```
+
+2. Planar Layout (RRR...GGG...BBB...):
+* Preferred by certain image-processing routines or hardware accelerators (e.g., some GPU kernels or certain ML frameworks).
+* Achieved naturally by Halide’s default loop ordering (x, y, c).
+
+Thus, it is essential to select loop ordering based on your specific data format requirements and integration scenario. Halide provides full flexibility, allowing you to explicitly reorder loops to match the desired memory layout efficiently.
+
+In Halide, two distinct concepts must be distinguished clearly:
+1. Loop execution order (controlled by reorder). Defines the nesting order of loops during computation. For example, to make the channel dimension (c) innermost during computation:
+
+```cpp
+invert.reorder(c, x, y);
+```
+2. Memory storage layout (controlled by reorder_storage). Defines the actual order in which data is stored in memory, such as interleaved or planar:
+
+```cpp
+invert.reorder_storage(c, x, y);
+```
+
+Using only reorder(c, x, y) affects the computational loop order but not necessarily the memory layout. The computed data could still be stored in planar order by default. Using reorder_storage(c, x, y) explicitly defines the memory layout as interleaved.
+
+## Compilation instructions
+Compile the program as follows (replace /path/to/halide accordingly):
+```console
+export DYLD_LIBRARY_PATH=/path/to/halide/lib/libHalide.19.dylib
+g++ -std=c++17 hello-world.cpp -o hello-world \
+ -I/path/to/halide/include -L/path/to/halide/lib -lHalide \
+ $(pkg-config --cflags --libs opencv4) -lpthread -ldl \
+ -Wl,-rpath,/path/to/halide/lib
+```
+
+Note that, on Linux, you would set LD_LIBRARY_PATH instead:
+```console
+export LD_LIBRARY_PATH=/path/to/halide/lib/
+```
+
+Run the executable:
+```console
+./hello-world
+```
+
+You will see two windows displaying the original and inverted images:
+
+
+
+## Summary
+In this lesson, you’ve learned Halide’s foundational concepts, explored the benefits of separating algorithms and schedules, set up your development environment, and created your first functional Halide application integrated with OpenCV.
+
+While the example introduces the core concepts of Halide pipelines (such as defining computations symbolically and realizing them), it does not yet showcase the substantial benefits of explicitly separating algorithm definition from scheduling strategies.
+
+In subsequent lessons, you’ll explore advanced Halide scheduling techniques, including parallelism, vectorization, tiling, and loop fusion, which will clearly demonstrate the practical advantages of separating algorithm logic from scheduling. These techniques enable fine-grained performance optimization tailored to specific hardware without modifying algorithmic correctness.
+
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md
new file mode 100644
index 0000000000..157704cce2
--- /dev/null
+++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md
@@ -0,0 +1,585 @@
+---
+# User change
+title: "Building a Simple Camera Image Processing Workflow"
+
+weight: 3
+
+layout: "learningpathall"
+---
+
+## Objective
+In this section, we will build a real-time camera processing pipeline using Halide. First, we capture video frames from a webcam using OpenCV, then implement a Gaussian (binomial) blur to smooth the captured images, followed by thresholding to create a clear binary output highlighting prominent image features. After establishing this pipeline, we will measure performance and then explore Halide’s scheduling options—parallelization and tiling—to understand when they help and when they don’t.
+
+## Gaussian blur and thresholding
+Create a new camera-capture.cpp file and modify it as follows:
+```cpp
+#include "Halide.h"
+#include "HalideRuntime.h" // for Runtime::Buffer make_interleaved
+#include
+#include
+#include
+#include
+#include
+
+using namespace cv;
+using namespace std;
+
+// Clamp coordinate within [0, maxCoord - 1].
+static inline Halide::Expr clampCoord(Halide::Expr coord, int maxCoord) {
+ return Halide::clamp(coord, 0, maxCoord - 1);
+}
+
+int main() {
+ // Open the default camera.
+ VideoCapture cap(0);
+ if (!cap.isOpened()) {
+ cerr << "Error: Unable to open camera." << endl;
+ return -1;
+ }
+
+ while (true) {
+ // Capture frame (typically interleaved BGR).
+ Mat frame;
+ cap >> frame;
+ if (frame.empty()) {
+ cerr << "Error: Received empty frame." << endl;
+ break;
+ }
+ if (!frame.isContinuous()) frame = frame.clone();
+
+ const int width = frame.cols;
+ const int height = frame.rows;
+ const int channels = frame.channels(); // 3 (BGR) or 4 (BGRA)
+
+ // Wrap the interleaved OpenCV frame for Halide.
+ auto in_rt = Halide::Runtime::Buffer::make_interleaved(
+ frame.data, width, height, channels);
+ Halide::Buffer<> inputBuffer(*in_rt.raw_buffer()); // front-end view
+
+ // Define ImageParam (x, y, c) and declare interleaved layout.
+ Halide::ImageParam input(Halide::UInt(8), 3, "input");
+ input.set(inputBuffer);
+ input.dim(0).set_stride(channels); // x-stride = C (interleaved)
+ input.dim(2).set_stride(1); // c-stride = 1 (adjacent bytes)
+ input.dim(2).set_bounds(0, channels);
+
+ // Spatial vars.
+ Halide::Var x("x"), y("y");
+
+ // Grayscale in Halide
+ Halide::Func gray("gray");
+ Halide::Expr r16 = Halide::cast(input(x, y, 2));
+ Halide::Expr g16 = Halide::cast(input(x, y, 1));
+ Halide::Expr b16 = Halide::cast(input(x, y, 0));
+
+ // Integer approx: Y ≈ (77*R + 150*G + 29*B) >> 8
+ gray(x, y) = Halide::cast((77 * r16 + 150 * g16 + 29 * b16) >> 8);
+
+ // 3×3 binomial kernel (sum = 16).
+ int kernel_vals[3][3] = {
+ {1, 2, 1},
+ {2, 4, 2},
+ {1, 2, 1}
+ };
+ Halide::Buffer kernelBuf(&kernel_vals[0][0], 3, 3);
+
+ // Blur via reduction over a 3×3 neighborhood.
+ Halide::RDom r(0, 3, 0, 3);
+ Halide::Func blur("blur");
+
+ // Use int16_t for safe multiply-and-accumulate with 8-bit input.
+ Halide::Expr val =
+ Halide::cast(
+ gray(clampCoord(x + r.x - 1, width),
+ clampCoord(y + r.y - 1, height))
+ ) * Halide::cast(kernelBuf(r.x, r.y));
+
+ blur(x, y) = Halide::cast(Halide::sum(val) / 16);
+
+ // Thresholding.
+ Halide::Func thresholded("thresholded");
+ thresholded(x, y) = Halide::cast(
+ Halide::select(blur(x, y) > 128, 255, 0)
+ );
+
+ // Realize and display.
+ Halide::Buffer outputBuffer;
+ try {
+ outputBuffer = thresholded.realize({ width, height });
+ } catch (const std::exception &e) {
+ cerr << "Halide pipeline error: " << e.what() << endl;
+ break;
+ }
+
+ Mat blurredThresholded(height, width, CV_8UC1, outputBuffer.data());
+ imshow("Processed Image", blurredThresholded);
+
+ // ~33 FPS; exit on any key.
+ if (waitKey(30) >= 0) break;
+ }
+
+ cap.release();
+ destroyAllWindows();
+ return 0;
+}
+```
+
+This code demonstrates a real-time image processing pipeline using Halide and OpenCV. The default camera is accessed, continuously capturing color video frames in an interleaved BGR format. The images are then converted to the grayscale directly inside the Halide pipeline. A Halide function gray(x, y) computes the luminance from the red, green, and blue channels using an integer approximation of the Rec.601 formula:
+
+```cpp
+Halide::Expr r16 = Halide::cast(input(x, y, 2));
+Halide::Expr g16 = Halide::cast(input(x, y, 1));
+Halide::Expr b16 = Halide::cast(input(x, y, 0));
+gray(x, y) = Halide::cast((77 * r16 + 150 * g16 + 29 * b16) >> 8);
+```
+
+The pipeline then applies a Gaussian blur using a 3×3 kernel explicitly defined in a Halide buffer:
+```
+int kernel_vals[3][3] = {
+ {1, 2, 1},
+ {2, 4, 2},
+ {1, 2, 1}
+};
+Halide::Buffer kernelBuf(&kernel_vals[0][0], 3, 3);
+```
+
+Why this kernel?
+* It provides effective smoothing while remaining computationally lightweight.
+* The weights approximate a Gaussian distribution, which reduces noise but preserves edges better than a box filter.
+* This is mathematically a binomial filter, a standard and efficient approximation of Gaussian blurring.
+
+The Gaussian blur is computed using a Halide reduction domain (RDom), which iterates over the 3×3 neighborhood around each pixel. To handle boundaries, pixel coordinates are manually clamped to valid ranges. Intermediate products use 16-bit arithmetic to safely accumulate pixel values before normalization:
+```cpp
+Halide::Expr val =
+ Halide::cast(
+ gray(clampCoord(x + r.x - 1, width),
+ clampCoord(y + r.y - 1, height))
+ ) * Halide::cast(kernelBuf(r.x, r.y));
+
+blur(x, y) = Halide::cast(Halide::sum(val) / 16);
+```
+
+After the blur stage, the pipeline applies a thresholding operation to highlight prominent features. Thresholding converts the blurred grayscale image into a binary image: pixels with intensity greater than 128 become white (255), while all others become black (0). This is expressed in Halide as:
+```cpp
+Halide::Func thresholded("thresholded");
+thresholded(x, y) = Halide::cast(
+ Halide::select(blur(x, y) > 128, 255, 0)
+);
+```
+
+This simple but effective step emphasizes strong edges and regions of high contrast, often used as a building block in segmentation and feature extraction pipelines
+
+Finally, the result is realized by Halide into a buffer and directly wrapped into an OpenCV matrix (cv::Mat) without extra copying:
+```cpp
+Halide::Buffer outputBuffer = thresholded.realize({width, height});
+Mat blurredThresholded(height, width, CV_8UC1, outputBuffer.data());
+imshow("Processed Image", blurredThresholded);
+```
+
+The main loop continues capturing frames, running the Halide pipeline, and displaying the processed output in real-time until a key is pressed. This demonstrates how Halide integrates with OpenCV to build efficient, interactive image processing applications.
+
+In the examples above, pixel coordinates were manually clamped with a helper function:
+
+```cpp
+gray(clampCoord(x + r.x - 1, width),
+ clampCoord(y + r.y - 1, height))
+```
+
+This ensures that when the reduction domain r extends beyond the image borders (for example, at the left or top edge), the coordinates are clipped into the valid range [0, width-1] and [0, height-1]. Manual clamping is explicit and easy to understand, but it scatters boundary-handling logic across the pipeline.
+
+Halide provides an alternative through boundary condition functions, which wrap an existing Func and define its behavior outside the valid region. For the Gaussian blur, we can clamp the grayscale function instead of the raw input, producing a new function that automatically handles out-of-bounds coordinates:
+```cpp
+// Clamp the grayscale function instead of raw input
+Halide::Func grayClamped = Halide::BoundaryConditions::repeat_edge(gray);
+
+// Use grayClamped inside the blur definition
+Halide::Expr val =
+ Halide::cast(grayClamped(x + (r.x - 1), y + (r.y - 1))) *
+ Halide::cast(kernelBuf(r.x, r.y));
+```
+
+In practice, both manual clamping and BoundaryConditions produce the same visual results. But for maintainability and performance tuning, using BoundaryConditions::repeat_edge (or another suitable policy) can be the preferred approach in production Halide pipelines.
+
+## Compilation instructions
+Compile the program as follows (replace /path/to/halide accordingly):
+```console
+g++ -std=c++17 camera-capture.cpp -o camera-capture \
+ -I/path/to/halide/include -L/path/to/halide/lib -lHalide \
+ $(pkg-config --cflags --libs opencv4) -lpthread -ldl \
+ -Wl,-rpath,/path/to/halide/lib
+```
+
+Run the executable:
+```console
+./camera-capture
+```
+
+The output should look as in the figure below:
+
+
+## Parallelization and Tiling
+In this section, we will explore two complementary scheduling optimizations provided by Halide: Parallelization and Tiling. Both techniques help enhance performance but achieve it through different mechanisms—parallelization leverages multiple CPU cores, whereas tiling improves cache efficiency by optimizing data locality.
+
+Below, we’ll demonstrate each technique separately for clarity and to emphasize their distinct benefits.
+
+Let’s first lock in a measurable baseline before we start changing the schedule. We’ll make a second file, camera-capture-perf-measurement.cpp, that runs the same grayscale → blur → threshold pipeline but prints per-frame timing, FPS, and MPix/s around the Halide realize() call. This lets us quantify each optimization we add next (parallelization, tiling, caching).
+
+Create camera-capture-perf-measurement.cpp with the following code:
+```cpp
+#include "Halide.h"
+#include "HalideRuntime.h"
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+
+using namespace cv;
+using namespace std;
+
+// Clamp coordinate within [0, maxCoord - 1].
+static inline Halide::Expr clampCoord(Halide::Expr coord, int maxCoord) {
+ return Halide::clamp(coord, 0, maxCoord - 1);
+}
+
+int main() {
+ // Open the default camera.
+ VideoCapture cap(0);
+ if (!cap.isOpened()) {
+ cerr << "Error: Unable to open camera." << endl;
+ return -1;
+ }
+
+ bool warmed_up = false; // skip/report first-frame JIT separately
+
+ while (true) {
+ // Capture frame.
+ Mat frame;
+ cap >> frame;
+ if (frame.empty()) {
+ cerr << "Error: Received empty frame." << endl;
+ break;
+ }
+ if (!frame.isContinuous()) {
+ frame = frame.clone();
+ }
+
+ int width = frame.cols;
+ int height = frame.rows;
+ int channels = frame.channels(); // typically 3 (BGR) or 4 (BGRA)
+
+ // Wrap the interleaved BGR[BGR...] frame for Halide
+ auto in_rt = Halide::Runtime::Buffer::make_interleaved(
+ frame.data, width, height, channels);
+ Halide::Buffer<> inputBuffer(*in_rt.raw_buffer()); // front-end Buffer view
+
+ // Define ImageParam for color input (x, y, c).
+ Halide::ImageParam input(Halide::UInt(8), 3, "input");
+ input.set(inputBuffer);
+
+ const int C = frame.channels(); // 3 (BGR) or 4 (BGRA)
+ input.dim(0).set_stride(C); // x stride = channels (interleaved)
+ input.dim(2).set_stride(1); // c stride = 1 (adjacent bytes)
+ input.dim(2).set_bounds(0, C); // c in [0, C)
+
+ // Define variables representing image coordinates.
+ Halide::Var x("x"), y("y");
+
+ // Grayscale in Halide (BGR order; ignore alpha if present)
+ Halide::Func gray("gray");
+ Halide::Expr r16 = Halide::cast(input(x, y, 2));
+ Halide::Expr g16 = Halide::cast(input(x, y, 1));
+ Halide::Expr b16 = Halide::cast(input(x, y, 0));
+
+ // Integer approx: Y ≈ (77*R + 150*G + 29*B) >> 8
+ gray(x, y) = Halide::cast((77 * r16 + 150 * g16 + 29 * b16) >> 8);
+
+ // Kernel layout: [1 2 1; 2 4 2; 1 2 1], sum = 16.
+ int kernel_vals[3][3] = {
+ {1, 2, 1},
+ {2, 4, 2},
+ {1, 2, 1}
+ };
+ Halide::Buffer kernelBuf(&kernel_vals[0][0], 3, 3);
+
+ Halide::RDom r(0, 3, 0, 3);
+ Halide::Func blur("blur");
+
+ Halide::Expr val =
+ Halide::cast( gray(clampCoord(x + r.x - 1, width),
+ clampCoord(y + r.y - 1, height)) ) *
+ Halide::cast( kernelBuf(r.x, r.y) );
+
+ blur(x, y) = Halide::cast(Halide::sum(val) / 16);
+
+ // Thresholding stage
+ Halide::Func thresholded("thresholded");
+ thresholded(x, y) = Halide::cast(
+ Halide::select(blur(x, y) > 128, 255, 0)
+ );
+
+ // Performance timing around realize() only
+ Halide::Buffer outputBuffer;
+ auto t0 = std::chrono::high_resolution_clock::now();
+
+ try {
+ outputBuffer = thresholded.realize({ width, height });
+ } catch (const std::exception &e) {
+ cerr << "Halide pipeline error: " << e.what() << endl;
+ break;
+ }
+
+ auto t1 = std::chrono::high_resolution_clock::now();
+ double ms = std::chrono::duration(t1 - t0).count();
+
+ // First frame includes JIT; mark it so you know why it's slower
+ double fps = (ms > 0.0) ? 1000.0 / ms : 0.0;
+ double mpixps = (ms > 0.0) ? (double(width) * double(height)) / (ms * 1000.0) : 0.0;
+
+ std::cout << std::fixed << std::setprecision(2)
+ << (warmed_up ? "" : "[warm-up] ")
+ << "Halide realize: " << ms << " ms | "
+ << fps << " FPS | "
+ << mpixps << " MPix/s" << endl;
+
+ warmed_up = true;
+
+ // Wrap output in OpenCV Mat and display.
+ Mat blurredThresholded(height, width, CV_8UC1, outputBuffer.data());
+ imshow("Processed Image", blurredThresholded);
+
+ // Wait for 30 ms (~33 FPS). Exit if any key is pressed.
+ if (waitKey(30) >= 0) {
+ break;
+ }
+ }
+
+ std::cout << std::endl;
+ cap.release();
+ destroyAllWindows();
+ return 0;
+}
+```
+
+What this gives us:
+* The console prints ms, FPS, and MPix/s per frame, measured strictly around realize() (camera capture and UI are excluded).
+* The very first line is labeled [warm-up] because it includes Halide’s JIT compilation. We can ignore it when comparing schedules.
+* MPix/s = (width*height)/seconds is a good resolution-agnostic metric to compare schedule variants.
+
+Build and run the application. Here is the sample output:
+```console
+% ./camera-capture-perf-measurement
+[warm-up] Halide realize: 327.13 ms | 3.06 FPS | 6.34 MPix/s
+Halide realize: 77.32 ms | 12.93 FPS | 26.82 MPix/s
+Halide realize: 82.86 ms | 12.07 FPS | 25.03 MPix/s
+Halide realize: 83.59 ms | 11.96 FPS | 24.81 MPix/s
+Halide realize: 79.20 ms | 12.63 FPS | 26.18 MPix/s
+Halide realize: 78.97 ms | 12.66 FPS | 26.26 MPix/s
+Halide realize: 80.37 ms | 12.44 FPS | 25.80 MPix/s
+Halide realize: 79.60 ms | 12.56 FPS | 26.05 MPix/s
+Halide realize: 80.52 ms | 12.42 FPS | 25.75 MPix/s
+Halide realize: 80.22 ms | 12.47 FPS | 25.85 MPix/s
+Halide realize: 80.91 ms | 12.36 FPS | 25.63 MPix/s
+Halide realize: 79.90 ms | 12.51 FPS | 25.95 MPix/s
+Halide realize: 79.49 ms | 12.58 FPS | 26.09 MPix/s
+Halide realize: 79.78 ms | 12.53 FPS | 25.99 MPix/s
+Halide realize: 80.74 ms | 12.38 FPS | 25.68 MPix/s
+Halide realize: 80.88 ms | 12.36 FPS | 25.64 MPix/s
+Halide realize: 81.07 ms | 12.34 FPS | 25.58 MPix/s
+Halide realize: 79.98 ms | 12.50 FPS | 25.93 MPix/s
+Halide realize: 79.73 ms | 12.54 FPS | 26.01 MPix/s
+Halide realize: 80.24 ms | 12.46 FPS | 25.84 MPix/s
+Halide realize: 80.99 ms | 12.35 FPS | 25.60 MPix/s
+Halide realize: 80.70 ms | 12.39 FPS | 25.69 MPix/s
+Halide realize: 81.24 ms | 12.31 FPS | 25.52 MPix/s
+Halide realize: 79.77 ms | 12.54 FPS | 26.00 MPix/s
+Halide realize: 79.81 ms | 12.53 FPS | 25.98 MPix/s
+Halide realize: 80.13 ms | 12.48 FPS | 25.88 MPix/s
+Halide realize: 80.12 ms | 12.48 FPS | 25.88 MPix/s
+Halide realize: 80.45 ms | 12.43 FPS | 25.78 MPix/s
+Halide realize: 77.72 ms | 12.87 FPS | 26.68 MPix/s
+Halide realize: 80.54 ms | 12.42 FPS | 25.74 MPix/s
+Halide realize: 80.44 ms | 12.43 FPS | 25.78 MPix/s
+Halide realize: 79.47 ms | 12.58 FPS | 26.09 MPix/s
+Halide realize: 79.68 ms | 12.55 FPS | 26.02 MPix/s
+Halide realize: 79.79 ms | 12.53 FPS | 25.99 MPix/s
+Halide realize: 79.86 ms | 12.52 FPS | 25.97 MPix/s
+Halide realize: 80.52 ms | 12.42 FPS | 25.75 MPix/s
+Halide realize: 79.47 ms | 12.58 FPS | 26.09 MPix/s
+Halide realize: 82.55 ms | 12.11 FPS | 25.12 MPix/s
+Halide realize: 78.59 ms | 12.72 FPS | 26.38 MPix/s
+Halide realize: 79.98 ms | 12.50 FPS | 25.93 MPix/s
+Halide realize: 79.06 ms | 12.65 FPS | 26.23 MPix/s
+Halide realize: 80.54 ms | 12.42 FPS | 25.75 MPix/s
+Halide realize: 79.19 ms | 12.63 FPS | 26.19 MPix/s
+Halide realize: 80.70 ms | 12.39 FPS | 25.70 MPix/s
+```
+
+This gives an rverage FPS of 12.48, and average throughput of 25.88 MPix/s. Now let’s start measuring potential improvements from scheduling.
+
+### Parallelization
+Parallelization lets Halide run independent pieces of work at the same time on multiple CPU cores. For image pipelines, rows (or tiles of rows) are naturally parallel: each can be processed independently once producer data is available. By distributing work across cores, we reduce wall-clock time—crucial for real-time video.
+
+With the baseline measured, we apply a minimal schedule that parallelizes the blur reduction across rows while keeping the threshold stage at root. This avoids tricky interactions between a parallel consumer and an unscheduled reduction (a common source of internal errors).
+
+Add these lines right after the threshold definition (and before any realize()):
+```cpp
+blur.compute_root().parallel(y); // parallelize reduction across scanlines
+thresholded.compute_root(); // cheap pixel-wise stage at root
+```
+
+This does two important things:
+* compute_root() on blur moves the reduction to the top level, so it isn’t nested under a parallel loop that might complicate reduction ordering.
+* parallel(y) parallelizes over the pure loop variable y (rows), not the reduction domain r, which is the safe/idiomatic way to parallelize reductions in Halide.
+
+Let's re-buld and re-run the app. The results should look like here:
+```output
+% ./camera-capture-perf-measurement
+[warm-up] Halide realize: 312.66 ms | 3.20 FPS | 6.63 MPix/s
+Halide realize: 84.86 ms | 11.78 FPS | 24.44 MPix/s
+Halide realize: 88.53 ms | 11.30 FPS | 23.42 MPix/s
+Halide realize: 85.46 ms | 11.70 FPS | 24.26 MPix/s
+Halide realize: 83.12 ms | 12.03 FPS | 24.95 MPix/s
+Halide realize: 88.70 ms | 11.27 FPS | 23.38 MPix/s
+Halide realize: 87.58 ms | 11.42 FPS | 23.68 MPix/s
+Halide realize: 83.38 ms | 11.99 FPS | 24.87 MPix/s
+Halide realize: 81.65 ms | 12.25 FPS | 25.39 MPix/s
+Halide realize: 84.88 ms | 11.78 FPS | 24.43 MPix/s
+Halide realize: 84.40 ms | 11.85 FPS | 24.57 MPix/s
+Halide realize: 85.30 ms | 11.72 FPS | 24.31 MPix/s
+Halide realize: 83.15 ms | 12.03 FPS | 24.94 MPix/s
+Halide realize: 85.69 ms | 11.67 FPS | 24.20 MPix/s
+Halide realize: 83.39 ms | 11.99 FPS | 24.87 MPix/s
+
+% g++ -std=c++17 camera-capture-perf-measurement.cpp -o camera-capture-perf-measurement \
+ -I/Users/db/Repos/Halide-19.0.0-arm-64-osx/include -L/Users/db/Repos/Halide-19.0.0-arm-64-osx/lib -lHalide \
+ $(pkg-config --cflags --libs opencv4) -lpthread -ldl \
+ -Wl,-rpath,/Users/db/Repos/Halide-19.0.0-arm-64-osx
+% ./camera-capture-perf-measurement
+[warm-up] Halide realize: 300.76 ms | 3.32 FPS | 6.89 MPix/s
+Halide realize: 64.23 ms | 15.57 FPS | 32.29 MPix/s
+Halide realize: 64.68 ms | 15.46 FPS | 32.06 MPix/s
+Halide realize: 71.92 ms | 13.90 FPS | 28.83 MPix/s
+Halide realize: 63.78 ms | 15.68 FPS | 32.51 MPix/s
+Halide realize: 67.95 ms | 14.72 FPS | 30.52 MPix/s
+Halide realize: 67.31 ms | 14.86 FPS | 30.81 MPix/s
+Halide realize: 67.90 ms | 14.73 FPS | 30.54 MPix/s
+Halide realize: 68.81 ms | 14.53 FPS | 30.14 MPix/s
+Halide realize: 68.57 ms | 14.58 FPS | 30.24 MPix/s
+Halide realize: 66.83 ms | 14.96 FPS | 31.03 MPix/s
+Halide realize: 68.04 ms | 14.70 FPS | 30.47 MPix/s
+Halide realize: 67.72 ms | 14.77 FPS | 30.62 MPix/s
+Halide realize: 68.79 ms | 14.54 FPS | 30.14 MPix/s
+Halide realize: 67.56 ms | 14.80 FPS | 30.69 MPix/s
+Halide realize: 67.65 ms | 14.78 FPS | 30.65 MPix/s
+Halide realize: 67.81 ms | 14.75 FPS | 30.58 MPix/s
+Halide realize: 67.81 ms | 14.75 FPS | 30.58 MPix/s
+Halide realize: 68.03 ms | 14.70 FPS | 30.48 MPix/s
+Halide realize: 67.44 ms | 14.83 FPS | 30.75 MPix/s
+Halide realize: 70.11 ms | 14.26 FPS | 29.58 MPix/s
+Halide realize: 66.23 ms | 15.10 FPS | 31.31 MPix/s
+Halide realize: 67.96 ms | 14.72 FPS | 30.51 MPix/s
+Halide realize: 68.00 ms | 14.71 FPS | 30.49 MPix/s
+Halide realize: 67.98 ms | 14.71 FPS | 30.50 MPix/s
+Halide realize: 67.56 ms | 14.80 FPS | 30.69 MPix/s
+Halide realize: 68.53 ms | 14.59 FPS | 30.26 MPix/s
+Halide realize: 67.06 ms | 14.91 FPS | 30.92 MPix/s
+```
+
+This gives, on average FPS: 14.79, and throughput of 30.67 MPix/s, leading to ~+18.5% improvement vs baseline.
+
+### Tiling
+Tiling is a scheduling technique that divides computations into smaller, cache-friendly blocks or tiles. This approach significantly enhances data locality, reduces memory bandwidth usage, and leverages CPU caches more efficiently. While tiling can also use parallel execution, its primary advantage comes from optimizing intermediate data storage.
+
+Tiling splits the image into cache-friendly blocks (tiles). Two wins:
+* Partitioning: tiles are easy to parallelize across cores.
+* Locality: when you cache intermediates per tile, you avoid refetching/recomputing data and hit L1/L2 more often.
+
+Below we show both flavors.
+
+### Tiling with explicit intermediate storage (best for cache efficiency)
+Here we cache gray once per tile so the 3×3 blur can reuse it instead of recomputing RGB -> gray up to 9× per output pixel.
+
+Before using this, remove any earlier compute_root().parallel(y) schedule for blur.
+
+```cpp
+// After defining: input, gray, blur, thresholded
+Halide::Var xo("xo"), yo("yo"), xi("xi"), yi("yi");
+
+// Tile & parallelize the consumer; vectorize inner x on planar output.
+thresholded
+ .tile(x, y, xo, yo, xi, yi, 128, 64)
+ .vectorize(xi, 16)
+ .parallel(yo);
+
+// Compute blur inside each tile and vectorize its inner x.
+blur
+ .compute_at(thresholded, xo)
+ .vectorize(x, 16);
+
+// Cache RGB→gray per tile (reads interleaved input → keep unvectorized).
+gray
+ .compute_at(thresholded, xo)
+ .store_at(thresholded, xo);
+```
+
+In this scheduling:
+* tile(...) splits the image into cache-friendly blocks and makes it easy to parallelize across tiles.
+* blur.compute_at(thresholded, xo) localizes the blur computation to each tile (it doesn’t force storing blur; it just computes it where it’s needed, keeping the working set small).
+* gray.compute_at(...).store_at(...) materializes a tile-local planar buffer for the grayscale intermediate so blur can reuse it within the tile.
+* Vectorization is applied only to planar stages (blur, thresholded), gray stays unvectorized because it reads interleaved input (x-stride = channels).
+
+Recompile your application as before, then run. On our machine, this version ran at ~7.6 FPS (~15.76 MPix/s, ~139 ms/frame), slower than baseline (~12.48 FPS) and the parallelization-only schedule (~14.79 FPS). The 3×3 blur is very small (low arithmetic intensity), the extra writes/reads of a tile-local buffer add overhead, and the interleaved source still limits how efficiently the gray producer can be read/vectorized.
+
+This pattern shines when the cached intermediate is expensive and reused a lot (bigger kernels, multi-use intermediates, or separable/multi-stage pipelines). For a tiny 3×3 on CPU, the benefit often doesn’t amortize.
+
+### Tiling for parallelization (without explicit intermediate storage)
+Tiling can also be used just to partition work across cores, without caching intermediates. This keeps the schedule simple: we split the output into tiles, parallelize across tiles, and vectorize along unit-stride x. Producers are computed inside each tile to keep the working set small, but we don’t materialize extra tile-local buffers:
+```cpp
+// Tiling (partitioning only)
+Halide::Var xo("xo"), yo("yo"), xi("xi"), yi("yi");
+
+thresholded
+ .tile(x, y, xo, yo, xi, yi, 128, 64) // try 128x64; tune per CPU
+ .vectorize(xi, 16) // safe: planar, unit-stride along x
+ .parallel(yo); // run tiles across cores
+
+blur
+ .compute_at(thresholded, xo) // keep work tile-local
+ .vectorize(x, 16); // vectorize planar blur
+```
+
+What this does
+* tile(...) splits the image into cache-friendly blocks and makes parallelization straightforward.
+* parallel(yo) distributes tiles across CPU cores.
+* compute_at(thresholded, xo) evaluates blur per tile (better locality) without forcing extra storage.
+* Vectorization is applied to planar stages (blur, thresholded).
+
+Recompile your application as before, then run. On our test machine, we got 9.35 FPS (19.40 MPix/s, ~106.93 ms/frame). This is slower than both the baseline and the parallelization-only schedule. The main reasons:
+* Recomputation of gray: with a 3×3 blur, each output reuses up to 9 neighbors; leaving gray inlined means RGB→gray is recomputed for each tap.
+* Interleaved input: gray reads BGR interleaved data (x-stride = channels), limiting unit-stride vectorization efficiency upstream.
+* Overhead vs. work: a 3×3 blur has low arithmetic intensity; extra tile/task overhead isn’t amortized.
+
+Tiling without caching intermediates mainly helps partition work, but for tiny kernels on CPU (and interleaved sources) it often underperforms. The earlier “quick win” (blur.compute_root().parallel(y)) remains the better choice here.
+
+### Tiling vs. parallelization
+* Parallelization spreads independent work across CPU cores. For this pipeline, the safest/most effective quick win was:
+```cpp
+blur.compute_root().parallel(y);
+thresholded.compute_root();
+```
+* Tiling for cache efficiency helps when an expensive intermediate is reused many times per output (e.g., larger kernels, separable/multi-stage pipelines, multiple consumers) and when producers read planar data. Caching gray per tile with a tiny 3×3 kernel over an interleaved source added overhead and ran slower (~8.2 FPS / 17.0 MPix/s).
+* Tiling for parallelization (partitioning only) simplifies work distribution and enables vectorization of planar stages, but with low arithmetic intensity (3×3) and an interleaved source it underperformed here (~9.35 FPS / 19.40 MPix/s).
+
+When to choose what:
+* Start with parallelizing the main reduction at root.
+* Add tiling + caching only if: kernel ≥ 5×5, separable/multi-pass blur, or the intermediate is reused by multiple consumers—and preferably after converting sources to planar (or precomputing a planar gray).
+* Keep stages that read interleaved inputs unvectorized; vectorize only planar consumers.
+
+## Summary
+In this section, we built a real-time Halide+OpenCV pipeline—grayscale, a 3×3 binomial blur, then thresholding—and instrumented it to measure throughput. The baseline settled around 12.48 FPS (25.88 MPix/s). A small, safe schedule tweak that parallelizes the blur reduction across rows lifted performance to about 14.79 FPS (30.67 MPix/s). In contrast, tiling used only for partitioning landed near 9.35 FPS (19.40 MPix/s), and tiling with a cached per-tile grayscale buffer was slower still at roughly 8.2 FPS (17.0 MPix/s).
+
+The pattern is clear. On CPU, with a small kernel and an interleaved camera source, parallelizing the reduction is the most effective first step. Tiling starts to pay off only when an expensive intermediate is reused enough to amortize the overhead, e.g., after making the blur separable (horizontal+vertical), producing a planar grayscale once per frame with gray.compute_root(), and applying boundary conditions to unlock interior fast paths. From there, tune tile sizes and thread count to squeeze out the remaining headroom.
+