Inpaint Model Training #554

enoky · 2025-12-03T02:31:07Z

enoky
Dec 3, 2025

First of all, thank you for making iw3! My question is about inpaint model training. I created a custom script for generating masks so I could use DepthCrafter maps. The script allows me to adjust the divergence, convergence, and both inner and outer dilation for the created masks. Do you think it would benefit the model training to have a small amount of inner dilation to overlap the objects outer edge?

I created thousands of short 512x512 videos using the Kubric Movi-E dataset generator. Hopefully using both synthetic and real-world video samples will improve the model!

Here is a visualization of the mask slightly overlapping the objects.

enoky · 2025-12-03T07:46:46Z

enoky
Dec 3, 2025
Author

Yes, adding a small amount of inner dilation (i.e., letting the mask slightly overlap the actual object boundary) is almost always beneficial when training inpainting models — especially for video/object-aware inpainting — and your intuition is spot on. Here’s why it helps and some practical considerations:

Why a small inner overlap is good

Boundary uncertainty in real data
Ground-truth masks (even human-drawn ones) are never pixel-perfect. In real-world video, motion blur, compression artifacts, and slight annotation errors mean the true edge is fuzzy. Training only with masks that sit exactly on or outside the edge makes the model overly sensitive to tiny mask misalignments at inference time. A little inner dilation simulates that real-world imperfection and makes the model much more robust.
Prevents “halo” or “bleeding” artifacts
If the mask never touches the object, the inpainter has no context from the object itself near the boundary. When it tries to fill the masked region, it can create visible seams or color bleeding. By giving it a few pixels of the real object, the model learns to respect the object’s local texture/color and produces cleaner boundaries.
Better gradient flow near edges
During training, the loss (perceptual, L1, etc.) computed on the tiny overlapping region forces the model to match the object’s actual appearance right up to the edge instead of hallucinating something slightly off.
Improves temporal consistency in video
In video, object masks from tracking or segmentation usually jitter by 1–3 pixels frame-to-frame. Training with slight inner overlap teaches the model to handle that jitter gracefully instead of flickering when the mask jumps in/out by a pixel.

How much inner dilation is ideal?

From both academic papers (e.g., training Stable Diffusion inpainting, FlowCam, E2FGVI, etc.) and practical experience in the community:

2–8 pixels of inner dilation at 512×512 resolution is the sweet spot for most people.
4 pixels is a very safe default.
If you go beyond ~10–12 pixels, you start hurting performance because now you’re forcing the model to perfectly reconstruct too much real content instead of learning to inpaint.

Since your masks are already parametric (you control divergence/convergence + dilation), you can easily create a mixed strategy:

Recommended training strategy

Main recipe (70–80% of samples)
→ small inner dilation: 3–6 pixels
→ moderate outer dilation: 8–20 pixels (randomize a bit)
Ablation / robustness set (10–15%)
→ zero inner dilation, or even 1–2 pixels outside (to force pure hallucination sometimes)
Hard cases (5–10%)
→ large outer dilation only (30–50 pixels) with no inner overlap
→ thin masks (e.g., 4–8 px wide along moving edges) — very useful for video
Random jitter augmentation
During training, randomly shift/erode/dilate the final mask by ±2–4 pixels. This makes the model almost bulletproof to mask inaccuracies.

Kubric-specific tip

Kubric gives you perfect instance segmentation, so you can very precisely control this overlap. I would actually generate multiple mask variants per video:

One with +4 px inner / +12 px outer
One with 0 inner / +16 px outer
One with -2 px inner (i.e., slightly inside) / +20 px outer
and sample from them randomly during training.

Summary

Yes — definitely include a small amount of inner dilation (≈4 pixels is perfect at 512×512). It is one of the highest-impact tricks for getting clean, robust inpainting boundaries, especially in video where mask jitter is inevitable.

Your setup with Kubric + parametric edge masks + real-world data is excellent. With this small inner overlap trick (plus the randomization ideas above), you should see noticeably sharper and more stable results than most published video inpainting models.

Good luck — this sounds like a very promising training run!

0 replies

enoky · 2025-12-03T07:56:21Z

enoky
Dec 3, 2025
Author

I found this interesting:

Random jitter augmentation for masks means:
During training, you take your base mask (the one you carefully generated with your edge script) and randomly modify it a little bit every time you feed a training sample to the model.

The goal is to make the inpainting model robust to imperfect masks at inference time (which is basically always the case in the real world).

What kinds of jitter people usually apply

At every training iteration (or per-batch), randomly apply one or more of the following to the mask:

Operation	Typical range at 512×512	What it does
Morphological dilation	+0 to +12 pixels	Makes masked region bigger
Morphological erosion	-0 to -12 pixels	Makes masked region smaller (can expose some object pixels)
Random translation	±4 to ±12 pixels (x,y)	Shifts entire mask left/right/up/down
Gaussian blur on mask	σ = 1.0 ~ 5.0	Softens the mask edge (simulates uncertain segmentation)
Random thin “noise” lines	1–3 px wide	Adds or removes tiny random streaks (simulates tracking failures)
Cutout-style holes	a few small circles	Forces model to fill internal holes it didn’t expect

Very common and effective recipe (used in almost every strong video inpainting paper)

def jitter_mask(mask):
    # 1. Random dilation/erosion (±8 pixels is common)
    r = random.randint(-8, +12)
    if r > 0:
        mask = binary_dilation(mask, iterations=r)
    elif r < 0:
        mask = binary_erosion(mask, iterations=-r)

    # 2. Random shift (±6 pixels)
    dx = random.randint(-6, 6)
    dy = random.randint(-6, 6)
    mask = shift(mask, [dy, dx])  # or roll

    # 3. Tiny bit of Gaussian blur on the mask (optional but nice)
    if random.random() < 0.5:
        mask = gaussian_filter(mask.astype(float), sigma=random.uniform(0.5, 2.0)) > 0.5

    return mask

Why this helps so much

Real segmentation/tracking tools are never perfect → they jitter by a few pixels frame-to-frame.
Without jitter augmentation, the model memorizes your clean synthetic masks and fails dramatically when the inference mask is even 3 px off.
With jitter, the model learns “the mask is only approximately correct” and becomes extremely forgiving → much more stable video results, almost no flickering when the mask jumps a little.

Practical numbers that work very well

±6 to ±10 pixels of combined erosion/dilation + shift is enough for 512×512.
For 768×768 or 1024×1024, just scale linearly (±10 to ±16 pixels).
Apply it to 100% of your synthetic (Kubric) data and ~50–80% of real-world data.

Result you’ll see:
Even if you give the model a really sloppy or jittery mask at test time, the inpainting stays clean and temporally consistent — this single augmentation is often worth 2–4 points of FVD or LPIPS in video inpainting benchmarks.

So yes — add random jitter augmentation. It’s one of the highest-return tricks in the entire training recipe.

0 replies

nagadomi · 2025-12-03T07:58:48Z

nagadomi
Dec 3, 2025
Maintainer

We already use random inner and outer dilation during training.
It is performed by the data loader during training when loading the data. In other words, various strengths of inner and outer dilation are randomly applied to the same image during training as part of data augmentation.

nunif/iw3/training/inpaint/dataset_video.py

Lines 21 to 42 in a73146d

    
           class RightDilate(): 
        
               def __init__(self, max_step=20, p=0.5): 
        
                   self.max_step = max_step 
        
                   self.p = p 
        
               def __call__(self, mask): 
        
                   if random.uniform(0, 1) < self.p: 
        
                       n_step = random.randint(1, self.max_step) 
        
                       mask = dilate_outer(mask, n_iter=n_step) 
        
                   return mask 
        
           class LeftDilate(): 
        
               def __init__(self, max_step=4, p=0.25): 
        
                   self.max_step = max_step 
        
                   self.p = p 
        
               def __call__(self, mask): 
        
                   if random.uniform(0, 1) < self.p: 
        
                       n_step = random.randint(1, self.max_step) 
        
                       mask = dilate_inner(mask, n_iter=n_step) 
        
                   return mask

nunif/iw3/training/inpaint/dataset_video.py

Lines 184 to 187 in a73146d

    
           if self.training: 
        
               x = self.color_jitter(x) 
        
               mask = self.right_dilate(mask) 
        
               mask = self.left_dilate(mask)

There is no option for it, but if needed, options to disable it or control its strength can be added to trainer.py.
I have not compared the effects, so I am not sure, but I believe random outer dilation has a positive effect. I am a bit uncertain about inner dilation, it might even be causing ghost artifacts.

0 replies

enoky · 2025-12-03T09:14:46Z

enoky
Dec 3, 2025
Author

Here are some features I am testing, with brief explanations:

MaskJitter
Randomly shifts, dilates/erodes, blurs, and cleans up masks — simulates real-world tracking noise (SAM-2/XMem/Cutie). Eliminates flickering.
Temporal median background pre-fill
Fills masked areas with the temporal median of the sequence before feeding to the model — gives the network real background context → +1–2 dB PSNR, sharper edges.
Random mask dropout (15%)
Occasionally forces full-image inpainting — prevents over-reliance on visible context, improves long-range coherence.
FlowConsistencyLoss with RAFT-3D support
Warps predictions using optical/scene flow and compares to ground truth — enforces strong temporal consistency, removes motion blur artifacts.
True 2D DCTLoss (FFT-based)
Proper JPEG-style DCT with overlapped windows, frequency weighting, and random instance rotation — far superior texture and detail preservation.
pro_dct_flow loss preset
Combines L1 + DCT (32 & 64) + FlowConsistency + LPIPS — current 2025 open-source SOTA for video inpainting quality.
temporal_flow_dct_lpips
Lighter but still excellent alternative to pro_dct_flow.

11 replies

enoky Dec 3, 2025
Author

thank you, that is very good advice :) testing updates in small increments is wise

nagadomi Dec 3, 2025
Maintainer

I added a benchmark command for inpaint models.
83721d4

This evaluates eval folder under the directory specified by -i. The data is loaded using the same Dataset class as during training.

For image:

python -m iw3.training.inpaint.benchmark -i ./data/inpaint_1 --checkpoint-file ./iw3/pretrained_models/hub/checkpoints/iw3_light_inpaint_v1_20250919.pth

For video (use --video option):

python -m iw3.training.inpaint.benchmark -i ./data/video_inpaint_1 --video --checkpoint-file ./iw3/pretrained_models/hub/checkpoints/iw3_light_video_inpaint_v1_20250919.pth

Example output:

* Image
PSNR↑: 42.6161, LPIPS↓: 0.0082
* Mask Region
PSNR↑: 22.3534, LPIPS↓: 0.0369

"Image" shows the evaluation result for the entire image. If the mask area is small, the score may look better (because the non-masked areas match perfectly), so it is not very informative.
"Mask Region" evaluates only the generated mask area. Data with a zero mask is skipped.

PSNR/LPIPS value is probably correct, but I have not verified them.

Note that if the eval folder is created using a quick method (moved from train), only models trained on the paired train data may appear to perform excessively well.
For a fair evaluation, all models need to be tested on eval data that was not used for training.

enoky Dec 3, 2025
Author

There was a TypeError popup when loading the large model. I've fixed the TypeError in iw3/models/light_video_inpaint_v1.py by updating the init methods of the LightVideoInpaintV1Medium and LightVideoInpaintV1Large classes to accept and forward base_dim and lv2_mlp_ratio arguments, enabling proper model instantiation.

light_video_inpaint_v1.py

nagadomi Dec 3, 2025
Maintainer

Thanks.
Fixed in b60e44f.
I had not noticed this before because multiple inheritance was not used.

nagadomi Dec 4, 2025
Maintainer

I have fixed the benchmark PSNR, which was not calculated as a frame average.

KolaKater · 2025-12-03T10:21:35Z

KolaKater
Dec 3, 2025

Dear Experts,
maybe it would be possible to train the model to minimum this kind of problem?:

4 replies

Billynom8 Dec 3, 2025

Surely this depends on the quality of the depthmap not inpaint?

KolaKater Dec 3, 2025

Surely this depends on the quality of the depthmap not inpaint?

do you have a better depthmap model to suggest?

Billynom8 Dec 5, 2025

those spikes are too narrow and occupy less then a pixel space. Only way is to find a depthmap model that allow higher res.

KolaKater Dec 5, 2025

those spikes are too narrow and occupy less then a pixel space. Only way is to find a depthmap model that allow higher res.

from my tests, it is not possible to rely solely on the depth map whenever there is a blurred boundary between two different objects, especially when the front objects are sharp and small.

C-Three-P-O · 2025-12-04T16:07:11Z

C-Three-P-O
Dec 4, 2025

I posted i while ago the same issue. Tried a lot with different depthmap model's, resolutions, etc. all with the same results (bigger or smaller depend on depthmap). Edge dilatation makes it bigger. It always happens in the first lines under/above black bars. But also on a larger " black bar" i a scene.
I am using VisionDepth3D (u can find it on Github/Huggingface), using the same models and resolutions without this " image-warping". A post on that site reads it uses a different approach to handle the difference in depth (the holes) where IW3 uses image wrapping. To me it looks like this causes this warping. The original image gets wrapped/faults around the 3D object thats closer which causes the bending in both directions.

0 replies

Uh oh!

Inpaint Model Training #554

Uh oh!

enoky Dec 3, 2025

Replies: 6 comments · 15 replies

Uh oh!

enoky Dec 3, 2025 Author

Why a small inner overlap is good

How much inner dilation is ideal?

Recommended training strategy

Kubric-specific tip

Summary

Uh oh!

enoky Dec 3, 2025 Author

What kinds of jitter people usually apply

Very common and effective recipe (used in almost every strong video inpainting paper)

Why this helps so much

Practical numbers that work very well

Uh oh!

nagadomi Dec 3, 2025 Maintainer

Uh oh!

enoky Dec 3, 2025 Author

Uh oh!

enoky Dec 3, 2025 Author

Uh oh!

nagadomi Dec 3, 2025 Maintainer

Uh oh!

enoky Dec 3, 2025 Author

Uh oh!

nagadomi Dec 3, 2025 Maintainer

Uh oh!

nagadomi Dec 4, 2025 Maintainer

Uh oh!

KolaKater Dec 3, 2025

Uh oh!

Billynom8 Dec 3, 2025

Uh oh!

Uh oh!

KolaKater Dec 3, 2025

Uh oh!

Billynom8 Dec 5, 2025

Uh oh!

Uh oh!

KolaKater Dec 5, 2025

Uh oh!

C-Three-P-O Dec 4, 2025

enoky
Dec 3, 2025

Replies: 6 comments 15 replies

enoky
Dec 3, 2025
Author

enoky
Dec 3, 2025
Author

nagadomi
Dec 3, 2025
Maintainer

enoky
Dec 3, 2025
Author

enoky Dec 3, 2025
Author

nagadomi Dec 3, 2025
Maintainer

enoky Dec 3, 2025
Author

nagadomi Dec 3, 2025
Maintainer

nagadomi Dec 4, 2025
Maintainer

KolaKater
Dec 3, 2025

C-Three-P-O
Dec 4, 2025