Releases · john-rocky/CoreML-Models

08 Apr 02:49

46808c8

MoGe-2 ViT-B Normal (504×504, FP16) Latest

Latest

MoGe-2 ViT-B + normal (CVPR 2025 Oral) — monocular 3D geometry estimation.

Single CoreML mlpackage (~200 MB FP16, 504×504 fixed input). Predicts metric depth, surface normals, and confidence mask from a DINOv2 ViT-B/14 backbone.

Variant: Ruicheng/moge-2-vitb-normal (104M params)

Output	Shape	Description
`points`	(1, 504, 504, 3)	3D point map (affine, exp-remapped)
`depth`	(1, 504, 504)	Raw depth (multiply by metric_scale for meters)
`normal`	(1, 504, 504, 3)	Surface normals, L2-normalized
`mask`	(1, 504, 504)	Confidence mask (sigmoid, threshold at 0.5)
`metric_scale`	(1,)	Scalar to convert raw depth → metric meters

See sample_apps/MoGe2Demo/ for the iOS demo app.

Assets 3

06 Apr 05:34

john-rocky

hypersd-v1

f5cc1c9

Hyper-SD v1 — Single-Step SD1.5 Text-to-Image

4 CoreML models for Hyper-SD 1-step text-to-image (SD1.5 base + ByteDance Hyper-SD LoRA fused, 512×512 output, ~947 MB total).

HyperSDTextEncoder.mlpackage.zip (216 MB, FP16) — CLIP ViT-L text encoder
HyperSDUnetChunk1.mlpackage.zip (310 MB, 6-bit palettized) — Hyper-SD UNet first half (Split-Einsum attention)
HyperSDUnetChunk2.mlpackage.zip (290 MB, 6-bit palettized) — Hyper-SD UNet second half
HyperSDVAEDecoder.mlpackage.zip (87 MB, FP16) — Stable Diffusion VAE decoder

Single-step generation via TCD scheduler. Runs on iPhone 15+ with Neural Engine.

See conversion_scripts/convert_hypersd.py and sample_apps/HyperSDDemo.

Assets 6

07 Apr 03:30

john-rocky

kokoro-v1

16e95d2

Kokoro-82M CoreML v1

First CoreML port of hexgrad/Kokoro-82M with on-device bilingual (English + Japanese) free-text input. Contains: Predictor (flexible 1-256 phonemes) + Decoder buckets (128/256/512 frames). FP32. Includes the mod-by-1 fix in iSTFTNet SineGen. See README.md → Text-to-Speech → Kokoro-82M for details.

Assets 6

04 Apr 05:09

john-rocky

stable-audio-v1

b45b00a

Stable Audio Open Small — CoreML

CoreML conversion of stabilityai/stable-audio-open-small — text-to-music generation (497M params).

Generates up to 11.9 seconds of stereo 44.1kHz audio from text prompts.

Models

File	Size	Description
StableAudioT5Encoder	94 MB	T5-base text encoder (INT8)
StableAudioNumberEmbedder	367 KB	Seconds conditioning (FP16)
StableAudioDiT	292 MB	Diffusion transformer (INT8, use cpuAndGPU)
StableAudioDiT_FP32	1.2 GB	Diffusion transformer (FP32 compute, use cpuOnly, best quality)
StableAudioVAEDecoder	138 MB	Oobleck stereo decoder (FP16)

Usage

See StableAudioDemo sample app and convert_stable_audio.py.

Conversion notes

DiT FP16 weights cause NaN in attention on iOS GPU → use INT8 or FP32 compute
T5 INT8 may produce occasional NaN → sanitize before DiT input
DiT FP32 requires cpuOnly (GPU background permission restriction on iOS)

Assets 7

05 Apr 18:11

john-rocky

sinsr-v1

2eb6c26

SinSR v1 — Single-Step Diffusion Super-Resolution

3 CoreML models for SinSR 4x super-resolution (256×256 → 1024×1024).

SinSR_Encoder.mlpackage.zip (39 MB, FP16) — VQ-VAE encoder
SinSR_Denoiser.mlpackage.zip (420 MB, FP32) — Swin-UNet denoiser
SinSR_Decoder.mlpackage.zip (58 MB, FP16) — VQ-VAE decoder with vector quantization

See conversion_scripts/convert_sinsr.py and sample_apps/SinSRDemo.

Assets 5

04 Apr 10:05

john-rocky

efficientad-v1

f569a19

EfficientAD Anomaly Detection (FP16)

EfficientAD (PDN-Small) anomaly detection model trained on MVTec AD bottle category. 256x256 RGB → anomaly heatmap + score. FP16, 15MB.

Assets 3

03 Apr 16:55

john-rocky

siglip-v2

7f6e123

SigLIP ViT-B/16 Zero-Shot Classification (FP16)

Google SigLIP ViT-B/16 converted to 2 CoreML models (FP16).

v2: FP16 replaces INT8. Contrastive models require FP16 for reliable similarity scoring.

Model	Size	Input	Output
SigLIP_ImageEncoder	162 MB	224x224 RGB image	L2-normalized 768-dim embedding
SigLIP_TextEncoder	195 MB	SentencePiece token IDs	L2-normalized 768-dim embedding

Scoring: softmax(image_emb · text_emb * 117.33) across labels.

Total: 357 MB (zipped). iOS 17+.

License: Apache-2.0

Assets 4

03 Apr 14:59

john-rocky

siglip-v1

7f6e123

SigLIP ViT-B/16 Zero-Shot Classification (INT8)

Google SigLIP ViT-B/16 converted to 2 CoreML models (INT8 quantized).

Zero-shot image classification: type any labels and get per-label probabilities.

Model	Size	Input	Output
SigLIP_ImageEncoder	79 MB	224x224 RGB image	L2-normalized 768-dim embedding
SigLIP_TextEncoder	96 MB	SentencePiece token IDs	L2-normalized 768-dim embedding

Similarity: sigmoid(image_emb · text_emb * 117.33 + (-12.93))

Total: 175 MB (zipped). iOS 17+.

See SigLIPDemo for a complete SwiftUI sample app.

License: Apache-2.0

Assets 4

03 Apr 15:29

john-rocky

rmbg-v1

7f6e123

RMBG-1.4 Background Removal (INT8)

BRIA RMBG-1.4 background removal model converted to CoreML (INT8 quantized).

Model	Size	Input	Output
RMBG_1_4	37 MB (zipped)	1024x1024 RGB image	Alpha mask [1, 1, 1024, 1024]

iOS 17+. Use .cpuOnly compute units.

See RMBGDemo for a SwiftUI sample app with photo cutout.

License: Creative Commons (same as original)

Assets 3

03 Apr 07:07

john-rocky

openvoice-v1

e2e1e21

OpenVoice V2 Voice Conversion Models

OpenVoice V2 CoreML models for zero-shot voice conversion. Record source + target voice → convert. SpeakerEncoder (1.7MB): extracts 256-dim speaker embedding. VoiceConverter (64MB): converts voice using source/target embeddings. License: MIT.

Assets 4

Releases: john-rocky/CoreML-Models

MoGe-2 ViT-B Normal (504×504, FP16)

Uh oh!

Hyper-SD v1 — Single-Step SD1.5 Text-to-Image

Uh oh!

Kokoro-82M CoreML v1

Uh oh!

Stable Audio Open Small — CoreML

Models

Usage

Conversion notes

Uh oh!

SinSR v1 — Single-Step Diffusion Super-Resolution

Uh oh!

EfficientAD Anomaly Detection (FP16)

Uh oh!

SigLIP ViT-B/16 Zero-Shot Classification (FP16)

Uh oh!

SigLIP ViT-B/16 Zero-Shot Classification (INT8)

Uh oh!

RMBG-1.4 Background Removal (INT8)

Uh oh!

OpenVoice V2 Voice Conversion Models

Uh oh!