Skip to content

Releases: john-rocky/CoreML-Models

MoGe-2 ViT-B Normal (504×504, FP16)

08 Apr 02:49
46808c8

Choose a tag to compare

MoGe-2 ViT-B + normal (CVPR 2025 Oral) — monocular 3D geometry estimation.

Single CoreML mlpackage (~200 MB FP16, 504×504 fixed input). Predicts metric depth, surface normals, and confidence mask from a DINOv2 ViT-B/14 backbone.

Variant: Ruicheng/moge-2-vitb-normal (104M params)

Output Shape Description
points (1, 504, 504, 3) 3D point map (affine, exp-remapped)
depth (1, 504, 504) Raw depth (multiply by metric_scale for meters)
normal (1, 504, 504, 3) Surface normals, L2-normalized
mask (1, 504, 504) Confidence mask (sigmoid, threshold at 0.5)
metric_scale (1,) Scalar to convert raw depth → metric meters

See sample_apps/MoGe2Demo/ for the iOS demo app.

Hyper-SD v1 — Single-Step SD1.5 Text-to-Image

06 Apr 05:34
f5cc1c9

Choose a tag to compare

4 CoreML models for Hyper-SD 1-step text-to-image (SD1.5 base + ByteDance Hyper-SD LoRA fused, 512×512 output, ~947 MB total).

  • HyperSDTextEncoder.mlpackage.zip (216 MB, FP16) — CLIP ViT-L text encoder
  • HyperSDUnetChunk1.mlpackage.zip (310 MB, 6-bit palettized) — Hyper-SD UNet first half (Split-Einsum attention)
  • HyperSDUnetChunk2.mlpackage.zip (290 MB, 6-bit palettized) — Hyper-SD UNet second half
  • HyperSDVAEDecoder.mlpackage.zip (87 MB, FP16) — Stable Diffusion VAE decoder

Single-step generation via TCD scheduler. Runs on iPhone 15+ with Neural Engine.

See conversion_scripts/convert_hypersd.py and sample_apps/HyperSDDemo.

Kokoro-82M CoreML v1

07 Apr 03:30

Choose a tag to compare

First CoreML port of hexgrad/Kokoro-82M with on-device bilingual (English + Japanese) free-text input. Contains: Predictor (flexible 1-256 phonemes) + Decoder buckets (128/256/512 frames). FP32. Includes the mod-by-1 fix in iSTFTNet SineGen. See README.md → Text-to-Speech → Kokoro-82M for details.

Stable Audio Open Small — CoreML

04 Apr 05:09
b45b00a

Choose a tag to compare

CoreML conversion of stabilityai/stable-audio-open-small — text-to-music generation (497M params).

Generates up to 11.9 seconds of stereo 44.1kHz audio from text prompts.

Models

File Size Description
StableAudioT5Encoder 94 MB T5-base text encoder (INT8)
StableAudioNumberEmbedder 367 KB Seconds conditioning (FP16)
StableAudioDiT 292 MB Diffusion transformer (INT8, use cpuAndGPU)
StableAudioDiT_FP32 1.2 GB Diffusion transformer (FP32 compute, use cpuOnly, best quality)
StableAudioVAEDecoder 138 MB Oobleck stereo decoder (FP16)

Usage

See StableAudioDemo sample app and convert_stable_audio.py.

Conversion notes

  • DiT FP16 weights cause NaN in attention on iOS GPU → use INT8 or FP32 compute
  • T5 INT8 may produce occasional NaN → sanitize before DiT input
  • DiT FP32 requires cpuOnly (GPU background permission restriction on iOS)

SinSR v1 — Single-Step Diffusion Super-Resolution

05 Apr 18:11
2eb6c26

Choose a tag to compare

3 CoreML models for SinSR 4x super-resolution (256×256 → 1024×1024).

  • SinSR_Encoder.mlpackage.zip (39 MB, FP16) — VQ-VAE encoder
  • SinSR_Denoiser.mlpackage.zip (420 MB, FP32) — Swin-UNet denoiser
  • SinSR_Decoder.mlpackage.zip (58 MB, FP16) — VQ-VAE decoder with vector quantization

See conversion_scripts/convert_sinsr.py and sample_apps/SinSRDemo.

EfficientAD Anomaly Detection (FP16)

04 Apr 10:05

Choose a tag to compare

EfficientAD (PDN-Small) anomaly detection model trained on MVTec AD bottle category. 256x256 RGB → anomaly heatmap + score. FP16, 15MB.

SigLIP ViT-B/16 Zero-Shot Classification (FP16)

03 Apr 16:55
7f6e123

Choose a tag to compare

Google SigLIP ViT-B/16 converted to 2 CoreML models (FP16).

v2: FP16 replaces INT8. Contrastive models require FP16 for reliable similarity scoring.

Model Size Input Output
SigLIP_ImageEncoder 162 MB 224x224 RGB image L2-normalized 768-dim embedding
SigLIP_TextEncoder 195 MB SentencePiece token IDs L2-normalized 768-dim embedding

Scoring: softmax(image_emb · text_emb * 117.33) across labels.

Total: 357 MB (zipped). iOS 17+.

License: Apache-2.0

SigLIP ViT-B/16 Zero-Shot Classification (INT8)

03 Apr 14:59
7f6e123

Choose a tag to compare

Google SigLIP ViT-B/16 converted to 2 CoreML models (INT8 quantized).

Zero-shot image classification: type any labels and get per-label probabilities.

Model Size Input Output
SigLIP_ImageEncoder 79 MB 224x224 RGB image L2-normalized 768-dim embedding
SigLIP_TextEncoder 96 MB SentencePiece token IDs L2-normalized 768-dim embedding

Similarity: sigmoid(image_emb · text_emb * 117.33 + (-12.93))

Total: 175 MB (zipped). iOS 17+.

See SigLIPDemo for a complete SwiftUI sample app.

License: Apache-2.0

RMBG-1.4 Background Removal (INT8)

03 Apr 15:29
7f6e123

Choose a tag to compare

BRIA RMBG-1.4 background removal model converted to CoreML (INT8 quantized).

Model Size Input Output
RMBG_1_4 37 MB (zipped) 1024x1024 RGB image Alpha mask [1, 1, 1024, 1024]

iOS 17+. Use .cpuOnly compute units.

See RMBGDemo for a SwiftUI sample app with photo cutout.

License: Creative Commons (same as original)

OpenVoice V2 Voice Conversion Models

03 Apr 07:07
e2e1e21

Choose a tag to compare

OpenVoice V2 CoreML models for zero-shot voice conversion. Record source + target voice → convert. SpeakerEncoder (1.7MB): extracts 256-dim speaker embedding. VoiceConverter (64MB): converts voice using source/target embeddings. License: MIT.