A tiered GPU memory architecture for consumer AI workloads
Conceived by James Yarber — December 2025. Published as defensive prior art.
Treat VRAM as a cache, not a container.
Large, slower memory holds the full workload while VRAM functions as a high-speed execution cache — staging only what the GPU needs at the moment of computation.
Slow memory holsters it. VRAM draws the gun and pulls the trigger.
Concept / architecture proposal. Not a finished implementation.
This repository documents the Holster Memory concept, its intended behavior, and practical prototype paths. It is meant to describe the architecture clearly enough for discussion, experimentation, and defensive publication.
Consumer GPUs hit a hard VRAM wall. Holster Memory proposes a tiered architecture where the full model lives in larger, slower memory while VRAM acts as a high-speed active set, continuously fed by a scheduler that stages the next needed layers before the GPU asks for them.
Consumer GPUs have a hard VRAM limit.
If a model needs 28 GB on a 24 GB card, it does not degrade gracefully — it simply fails to load. Existing workarounds often force a binary choice:
- buy more expensive hardware
- or do not run the workload at all
This is not purely a hardware problem. It is also a software architecture problem.
The current assumption is that the full working set must fit inside VRAM at once. Holster Memory challenges that assumption.
GPU memory should work like a mechanic and a tool runner.
The mechanic — the GPU — does the actual work.
The tool runner — Holster Memory — holds the full set of tools, anticipates what is needed next, and hands it over before it is asked for. The mechanic never stops to walk back to the toolbox.
In this model, VRAM stops being treated as the entire container for the workload and starts being treated as the fast execution layer.
| Tier | Role | Speed |
|---|---|---|
| Active VRAM | Executing right now — layers in flight | Fast |
| Passive / Holster | Full model loaded, staged, and warm | Medium |
| Cold Storage | Disk — only used on initial load or full swap | Slow |
A scheduler watches which layers are active, prefetches upcoming layers into VRAM before the GPU requests them, and evicts cold data back to the holster.
The goal is simple:
Keep the GPU busy.
Idle time — not raw speed — is the real enemy.
HOLSTER MEMORY: THREE-TIER MODEL
┌─────────────────────────────────────────────────────────────┐
│ COLD STORAGE (Disk) │
│ Full model archive / initial load / full swap │
└─────────────────────────────────────────────────────────────┘
│
│ load / swap
▼
┌─────────────────────────────────────────────────────────────┐
│ PASSIVE / HOLSTER (RAM or PCIe/CXL) │
│ Full model loaded, warm, staged, ready to hand off │
└─────────────────────────────────────────────────────────────┘
│
prefetch upcoming layers
│
▼
┌─────────────────────────────────────────────────────────────┐
│ ACTIVE VRAM (GPU) │
│ Current layers in flight / executing now │
└─────────────────────────────────────────────────────────────┘
│
▼
GPU compute continues
Scheduler goal: overlap transfer with compute
so the GPU stays busy instead of waiting on memory.
Most GPU memory stalls happen because the system is surprised. The GPU needs data that is not resident, so it stops and waits.
AI inference is unusually well-suited to predictive staging because many of its access patterns are highly structured:
- layers execute sequentially and predictably
- weights are reused across tokens
- the next several steps are usually known in advance
- prefetching can be effective without perfect guessing
That makes AI a better fit for tiered memory scheduling than many other real-time workloads.
A model running at 80% speed is still far more useful than a model that will not load at all.
This is the same class of problem that mobile operating systems already solved.
| Mobile OS | Holster Memory |
|---|---|
| Foreground app | Active VRAM |
| Background app | Passive / Holster |
| Killed app | Cold storage |
The key mobile insight is:
Resume instead of restart.
The system keeps everything alive somewhere. It only promotes to the fastest tier what needs to be fast right now.
Holster Memory applies that same logic to consumer AI workloads.
Use system RAM as the holster.
The full model lives in CPU RAM. A scheduler manages the active set in VRAM, prefetches upcoming layers using pinned memory and asynchronous transfers, and overlaps copy with compute so the GPU is not sitting idle waiting on the bus.
This requires no new hardware and could be tested on existing consumer systems.
Use the same hardware, but schedule it intentionally.
Existing frameworks already do some CPU offload. Holster Memory proposes doing that more deliberately, with explicit control over:
- prefetch depth
- pinned buffer management
- active set sizing
- eviction rules
- overlap of transfer and compute
The difference is not the existence of offload. The difference is making it predictable, scheduler-driven, and tuned for AI execution patterns.
Use PCIe or CXL memory as a true intermediate tier.
A dedicated memory expansion card could provide a holster tier faster than system RAM, slower than VRAM, and physically closer to the GPU than standard DIMM memory.
This does not need to exist to validate the concept. Options 1 and 2 are enough to prove the architecture. Hardware expansion would simply make the model stronger.
A practical prototype does not need custom hardware. A software-first proof of concept is enough to test the core claim.
Build a scheduler layer around an existing local inference stack that:
- keeps the full model in system RAM
- loads only an active set of layers into VRAM
- prefetches the next likely layers during compute
- evicts cold layers after use
- measures stall time, throughput, and memory pressure
The goal is not perfect speed. The goal is proving graceful degradation instead of hard failure.
Add visibility into:
- transfer timing
- VRAM residency
- prefetch hit rate
- eviction behavior
- GPU idle time
- effective throughput versus baseline offload
This is where the idea becomes testable rather than purely conceptual.
Compare Holster-style scheduling against:
- full-fit VRAM execution
- standard CPU offload
- reduced context / reduced model fallback
- smaller quantized alternatives
This helps answer the real question: not “Is it free?” but “Is it better than failing to load?”
If the software prototype validates the idea, the next step is experimenting with:
- faster host memory paths
- PCIe-based memory expansion
- CXL-style intermediate tiers
- specialized schedulers for inference versus training
If Holster Memory works well, several things change for consumer AI:
- 24 GB VRAM may behave more like 48–64 GB of effective capacity for many workloads
- “Out of memory” becomes graceful degradation instead of hard failure
- mid-tier consumer cards remain useful longer
- local AI becomes viable for more people without requiring flagship hardware
- the VRAM arms race becomes less absolute for end users
The goal is not magic. The goal is better utilization of memory tiers that already exist.
Holster Memory is not:
- a claim that all workloads will run at full speed
- a claim that gaming is the ideal target
- a requirement for brand-new hardware
- a claim that cache hierarchies, paging, or prefetching are new inventions
The novelty is not in the individual components.
The novelty is in the application:
bringing intentional, predictive, consumer-facing memory tiering to GPU workloads for AI inference and training, packaged as a scheduler layer that existing frameworks can adopt.
Right now, local AI users often face a cliff:
- either the model fits
- or it does not
Holster Memory suggests a third path:
- the full model exists in a slower tier
- only the active working set lives in VRAM
- scheduling bridges the gap
That reframes VRAM from “how much model can I fit?” to “how much active execution can I sustain efficiently?”
That is a different way to think about local AI hardware.
This concept was developed and named in December 2025 and is being published as a defensive publication.
The purpose of this document is to establish a public, timestamped record of the idea so that it remains available for open implementation and cannot be cleanly locked up later by a patent filing that would block others from building it.
No claim is made to ownership of the underlying general techniques such as:
- cache hierarchies
- memory paging
- prefetching
- tiered memory systems
- unified memory concepts
Those are long-established ideas.
The claim here is to this specific framing and application:
- the active / passive VRAM model
- the Holster Memory metaphor and naming
- the application of predictive tiered scheduling to consumer AI workloads
James Yarber
GitHub: @junkyard22
Mount Sterling, KY