Releases: john-rocky/CoreML-Models
MoGe-2 ViT-B Normal (504×504, FP16)
MoGe-2 ViT-B + normal (CVPR 2025 Oral) — monocular 3D geometry estimation.
Single CoreML mlpackage (~200 MB FP16, 504×504 fixed input). Predicts metric depth, surface normals, and confidence mask from a DINOv2 ViT-B/14 backbone.
Variant: Ruicheng/moge-2-vitb-normal (104M params)
| Output | Shape | Description |
|---|---|---|
points |
(1, 504, 504, 3) | 3D point map (affine, exp-remapped) |
depth |
(1, 504, 504) | Raw depth (multiply by metric_scale for meters) |
normal |
(1, 504, 504, 3) | Surface normals, L2-normalized |
mask |
(1, 504, 504) | Confidence mask (sigmoid, threshold at 0.5) |
metric_scale |
(1,) | Scalar to convert raw depth → metric meters |
See sample_apps/MoGe2Demo/ for the iOS demo app.
Hyper-SD v1 — Single-Step SD1.5 Text-to-Image
4 CoreML models for Hyper-SD 1-step text-to-image (SD1.5 base + ByteDance Hyper-SD LoRA fused, 512×512 output, ~947 MB total).
- HyperSDTextEncoder.mlpackage.zip (216 MB, FP16) — CLIP ViT-L text encoder
- HyperSDUnetChunk1.mlpackage.zip (310 MB, 6-bit palettized) — Hyper-SD UNet first half (Split-Einsum attention)
- HyperSDUnetChunk2.mlpackage.zip (290 MB, 6-bit palettized) — Hyper-SD UNet second half
- HyperSDVAEDecoder.mlpackage.zip (87 MB, FP16) — Stable Diffusion VAE decoder
Single-step generation via TCD scheduler. Runs on iPhone 15+ with Neural Engine.
See conversion_scripts/convert_hypersd.py and sample_apps/HyperSDDemo.
Kokoro-82M CoreML v1
First CoreML port of hexgrad/Kokoro-82M with on-device bilingual (English + Japanese) free-text input. Contains: Predictor (flexible 1-256 phonemes) + Decoder buckets (128/256/512 frames). FP32. Includes the mod-by-1 fix in iSTFTNet SineGen. See README.md → Text-to-Speech → Kokoro-82M for details.
Stable Audio Open Small — CoreML
CoreML conversion of stabilityai/stable-audio-open-small — text-to-music generation (497M params).
Generates up to 11.9 seconds of stereo 44.1kHz audio from text prompts.
Models
| File | Size | Description |
|---|---|---|
| StableAudioT5Encoder | 94 MB | T5-base text encoder (INT8) |
| StableAudioNumberEmbedder | 367 KB | Seconds conditioning (FP16) |
| StableAudioDiT | 292 MB | Diffusion transformer (INT8, use cpuAndGPU) |
| StableAudioDiT_FP32 | 1.2 GB | Diffusion transformer (FP32 compute, use cpuOnly, best quality) |
| StableAudioVAEDecoder | 138 MB | Oobleck stereo decoder (FP16) |
Usage
See StableAudioDemo sample app and convert_stable_audio.py.
Conversion notes
- DiT FP16 weights cause NaN in attention on iOS GPU → use INT8 or FP32 compute
- T5 INT8 may produce occasional NaN → sanitize before DiT input
- DiT FP32 requires
cpuOnly(GPU background permission restriction on iOS)
SinSR v1 — Single-Step Diffusion Super-Resolution
3 CoreML models for SinSR 4x super-resolution (256×256 → 1024×1024).
- SinSR_Encoder.mlpackage.zip (39 MB, FP16) — VQ-VAE encoder
- SinSR_Denoiser.mlpackage.zip (420 MB, FP32) — Swin-UNet denoiser
- SinSR_Decoder.mlpackage.zip (58 MB, FP16) — VQ-VAE decoder with vector quantization
See conversion_scripts/convert_sinsr.py and sample_apps/SinSRDemo.
EfficientAD Anomaly Detection (FP16)
EfficientAD (PDN-Small) anomaly detection model trained on MVTec AD bottle category. 256x256 RGB → anomaly heatmap + score. FP16, 15MB.
SigLIP ViT-B/16 Zero-Shot Classification (FP16)
Google SigLIP ViT-B/16 converted to 2 CoreML models (FP16).
v2: FP16 replaces INT8. Contrastive models require FP16 for reliable similarity scoring.
| Model | Size | Input | Output |
|---|---|---|---|
| SigLIP_ImageEncoder | 162 MB | 224x224 RGB image | L2-normalized 768-dim embedding |
| SigLIP_TextEncoder | 195 MB | SentencePiece token IDs | L2-normalized 768-dim embedding |
Scoring: softmax(image_emb · text_emb * 117.33) across labels.
Total: 357 MB (zipped). iOS 17+.
License: Apache-2.0
SigLIP ViT-B/16 Zero-Shot Classification (INT8)
Google SigLIP ViT-B/16 converted to 2 CoreML models (INT8 quantized).
Zero-shot image classification: type any labels and get per-label probabilities.
| Model | Size | Input | Output |
|---|---|---|---|
| SigLIP_ImageEncoder | 79 MB | 224x224 RGB image | L2-normalized 768-dim embedding |
| SigLIP_TextEncoder | 96 MB | SentencePiece token IDs | L2-normalized 768-dim embedding |
Similarity: sigmoid(image_emb · text_emb * 117.33 + (-12.93))
Total: 175 MB (zipped). iOS 17+.
See SigLIPDemo for a complete SwiftUI sample app.
License: Apache-2.0
RMBG-1.4 Background Removal (INT8)
BRIA RMBG-1.4 background removal model converted to CoreML (INT8 quantized).
| Model | Size | Input | Output |
|---|---|---|---|
| RMBG_1_4 | 37 MB (zipped) | 1024x1024 RGB image | Alpha mask [1, 1, 1024, 1024] |
iOS 17+. Use .cpuOnly compute units.
See RMBGDemo for a SwiftUI sample app with photo cutout.
License: Creative Commons (same as original)
OpenVoice V2 Voice Conversion Models
OpenVoice V2 CoreML models for zero-shot voice conversion. Record source + target voice → convert. SpeakerEncoder (1.7MB): extracts 256-dim speaker embedding. VoiceConverter (64MB): converts voice using source/target embeddings. License: MIT.