Holster Memory

A tiered GPU memory architecture for consumer AI workloads

Conceived by James Yarber — December 2025. Published as defensive prior art.

Treat VRAM as a cache, not a container.
Large, slower memory holds the full workload while VRAM functions as a high-speed execution cache — staging only what the GPU needs at the moment of computation.

Slow memory holsters it. VRAM draws the gun and pulls the trigger.

Status

Concept / architecture proposal. Not a finished implementation.

This repository documents the Holster Memory concept, its intended behavior, and practical prototype paths. It is meant to describe the architecture clearly enough for discussion, experimentation, and defensive publication.

The One-Line Summary

Consumer GPUs hit a hard VRAM wall. Holster Memory proposes a tiered architecture where the full model lives in larger, slower memory while VRAM acts as a high-speed active set, continuously fed by a scheduler that stages the next needed layers before the GPU asks for them.

The Problem

Consumer GPUs have a hard VRAM limit.

If a model needs 28 GB on a 24 GB card, it does not degrade gracefully — it simply fails to load. Existing workarounds often force a binary choice:

buy more expensive hardware
or do not run the workload at all

This is not purely a hardware problem. It is also a software architecture problem.

The current assumption is that the full working set must fit inside VRAM at once. Holster Memory challenges that assumption.

The Core Idea

GPU memory should work like a mechanic and a tool runner.

The mechanic — the GPU — does the actual work.

The tool runner — Holster Memory — holds the full set of tools, anticipates what is needed next, and hands it over before it is asked for. The mechanic never stops to walk back to the toolbox.

In this model, VRAM stops being treated as the entire container for the workload and starts being treated as the fast execution layer.

The Three Tiers

Tier	Role	Speed
Active VRAM	Executing right now — layers in flight	Fast
Passive / Holster	Full model loaded, staged, and warm	Medium
Cold Storage	Disk — only used on initial load or full swap	Slow

A scheduler watches which layers are active, prefetches upcoming layers into VRAM before the GPU requests them, and evicts cold data back to the holster.

The goal is simple:

Keep the GPU busy.

Idle time — not raw speed — is the real enemy.

Concept Diagram

                    HOLSTER MEMORY: THREE-TIER MODEL

     ┌─────────────────────────────────────────────────────────────┐
     │                    COLD STORAGE (Disk)                     │
     │      Full model archive / initial load / full swap         │
     └─────────────────────────────────────────────────────────────┘
                               │
                               │ load / swap
                               ▼
     ┌─────────────────────────────────────────────────────────────┐
     │              PASSIVE / HOLSTER (RAM or PCIe/CXL)           │
     │     Full model loaded, warm, staged, ready to hand off     │
     └─────────────────────────────────────────────────────────────┘
                               │
                      prefetch upcoming layers
                               │
                               ▼
     ┌─────────────────────────────────────────────────────────────┐
     │                    ACTIVE VRAM (GPU)                       │
     │            Current layers in flight / executing now        │
     └─────────────────────────────────────────────────────────────┘
                               │
                               ▼
                         GPU compute continues

             Scheduler goal: overlap transfer with compute
           so the GPU stays busy instead of waiting on memory.

Why AI Workloads Fit This So Well

Most GPU memory stalls happen because the system is surprised. The GPU needs data that is not resident, so it stops and waits.

AI inference is unusually well-suited to predictive staging because many of its access patterns are highly structured:

layers execute sequentially and predictably
weights are reused across tokens
the next several steps are usually known in advance
prefetching can be effective without perfect guessing

That makes AI a better fit for tiered memory scheduling than many other real-time workloads.

A model running at 80% speed is still far more useful than a model that will not load at all.

The Mobile OS Analogy

This is the same class of problem that mobile operating systems already solved.

Mobile OS	Holster Memory
Foreground app	Active VRAM
Background app	Passive / Holster
Killed app	Cold storage

The key mobile insight is:

Resume instead of restart.

The system keeps everything alive somewhere. It only promotes to the fastest tier what needs to be fast right now.

Holster Memory applies that same logic to consumer AI workloads.

Three Implementation Paths

Option 1 — Software Only

Use system RAM as the holster.

The full model lives in CPU RAM. A scheduler manages the active set in VRAM, prefetches upcoming layers using pinned memory and asynchronous transfers, and overlaps copy with compute so the GPU is not sitting idle waiting on the bus.

This requires no new hardware and could be tested on existing consumer systems.

Option 2 — Smarter Offload

Use the same hardware, but schedule it intentionally.

Existing frameworks already do some CPU offload. Holster Memory proposes doing that more deliberately, with explicit control over:

prefetch depth
pinned buffer management
active set sizing
eviction rules
overlap of transfer and compute

The difference is not the existence of offload. The difference is making it predictable, scheduler-driven, and tuned for AI execution patterns.

Option 3 — Hardware Expansion

Use PCIe or CXL memory as a true intermediate tier.

A dedicated memory expansion card could provide a holster tier faster than system RAM, slower than VRAM, and physically closer to the GPU than standard DIMM memory.

This does not need to exist to validate the concept. Options 1 and 2 are enough to prove the architecture. Hardware expansion would simply make the model stronger.

Potential Prototype Path

A practical prototype does not need custom hardware. A software-first proof of concept is enough to test the core claim.

Phase 1 — Controlled Inference Prototype

Build a scheduler layer around an existing local inference stack that:

keeps the full model in system RAM
loads only an active set of layers into VRAM
prefetches the next likely layers during compute
evicts cold layers after use
measures stall time, throughput, and memory pressure

The goal is not perfect speed. The goal is proving graceful degradation instead of hard failure.

Phase 2 — Instrumented Scheduling

Add visibility into:

transfer timing
VRAM residency
prefetch hit rate
eviction behavior
GPU idle time
effective throughput versus baseline offload

This is where the idea becomes testable rather than purely conceptual.

Phase 3 — Comparative Baselines

Compare Holster-style scheduling against:

full-fit VRAM execution
standard CPU offload
reduced context / reduced model fallback
smaller quantized alternatives

This helps answer the real question: not “Is it free?” but “Is it better than failing to load?”

Phase 4 — Expanded Holster Tier

If the software prototype validates the idea, the next step is experimenting with:

faster host memory paths
PCIe-based memory expansion
CXL-style intermediate tiers
specialized schedulers for inference versus training

What Changes If This Works

If Holster Memory works well, several things change for consumer AI:

24 GB VRAM may behave more like 48–64 GB of effective capacity for many workloads
“Out of memory” becomes graceful degradation instead of hard failure
mid-tier consumer cards remain useful longer
local AI becomes viable for more people without requiring flagship hardware
the VRAM arms race becomes less absolute for end users

The goal is not magic. The goal is better utilization of memory tiers that already exist.

What This Is Not

Holster Memory is not:

a claim that all workloads will run at full speed
a claim that gaming is the ideal target
a requirement for brand-new hardware
a claim that cache hierarchies, paging, or prefetching are new inventions

The novelty is not in the individual components.

The novelty is in the application:

bringing intentional, predictive, consumer-facing memory tiering to GPU workloads for AI inference and training, packaged as a scheduler layer that existing frameworks can adopt.

Why This Matters

Right now, local AI users often face a cliff:

either the model fits
or it does not

Holster Memory suggests a third path:

the full model exists in a slower tier
only the active working set lives in VRAM
scheduling bridges the gap

That reframes VRAM from “how much model can I fit?” to “how much active execution can I sustain efficiently?”

That is a different way to think about local AI hardware.

Defensive Publication / Prior Art Intent

This concept was developed and named in December 2025 and is being published as a defensive publication.

The purpose of this document is to establish a public, timestamped record of the idea so that it remains available for open implementation and cannot be cleanly locked up later by a patent filing that would block others from building it.

No claim is made to ownership of the underlying general techniques such as:

cache hierarchies
memory paging
prefetching
tiered memory systems
unified memory concepts

Those are long-established ideas.

The claim here is to this specific framing and application:

the active / passive VRAM model
the Holster Memory metaphor and naming
the application of predictive tiered scheduling to consumer AI workloads

Author

James Yarber
GitHub: @junkyard22
Mount Sterling, KY

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
holster-memory/holster-memory		holster-memory/holster-memory
Explanation		Explanation
Holster Memory.docx		Holster Memory.docx
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Holster Memory

Status

The One-Line Summary

The Problem

The Core Idea

The Three Tiers

Concept Diagram

Why AI Workloads Fit This So Well

The Mobile OS Analogy

Three Implementation Paths

Option 1 — Software Only

Option 2 — Smarter Offload

Option 3 — Hardware Expansion

Potential Prototype Path

Phase 1 — Controlled Inference Prototype

Phase 2 — Instrumented Scheduling

Phase 3 — Comparative Baselines

Phase 4 — Expanded Holster Tier

What Changes If This Works

What This Is Not

Why This Matters

Defensive Publication / Prior Art Intent

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Holster Memory

Status

The One-Line Summary

The Problem

The Core Idea

The Three Tiers

Concept Diagram

Why AI Workloads Fit This So Well

The Mobile OS Analogy

Three Implementation Paths

Option 1 — Software Only

Option 2 — Smarter Offload

Option 3 — Hardware Expansion

Potential Prototype Path

Phase 1 — Controlled Inference Prototype

Phase 2 — Instrumented Scheduling

Phase 3 — Comparative Baselines

Phase 4 — Expanded Holster Tier

What Changes If This Works

What This Is Not

Why This Matters

Defensive Publication / Prior Art Intent

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages