- Introduction
- Architecture Overview
- Core Components
- Utility Components
- Algorithms & Policies
- Implementation Details
- Testing Strategy
- Future Extensions
The Cache Simulator is a comprehensive simulation tool for studying memory hierarchy behavior in computer systems. It models a two-level cache hierarchy with configurable parameters and advanced features like prefetching, cache coherence, and detailed statistics tracking. The simulator is designed to be highly customizable, allowing researchers and students to experiment with different cache configurations and analyze their performance.
The simulator takes as input a memory access trace that describes a sequence of read and write operations with their corresponding memory addresses. It then simulates how these accesses would flow through the cache hierarchy, tracking hits, misses, replacements, and other events. Various statistics are collected during the simulation, which can be used to evaluate the effectiveness of different cache designs and policies.
This document describes the design of the Cache Simulator, including its architecture, components, algorithms, and implementation details.
The Cache Simulator employs a modular, layered architecture that separates core cache functionality from utilities and user interface components:
┌──────────────────────────────────────────────────┐
│ Applications │
│ (Main Program, Tests, Trace Generator, Scripts) │
└───────────────────────┬──────────────────────────┘
│
┌───────────────────────▼──────────────────────────┐
│ Memory Hierarchy API │
│ (MemoryHierarchy, Configuration, Stats) │
└───────────────────────┬──────────────────────────┘
│
┌───────────────────────▼──────────────────────────┐
│ Core Components │
│ (Cache, StreamBuffer, Prefetchers, MESI, etc.) │
└───────────────────────┬──────────────────────────┘
│
┌───────────────────────▼──────────────────────────┐
│ Utility Components │
│ (Trace Parsing, Logging, Visualization, etc.) │
└──────────────────────────────────────────────────┘
The design follows several key architectural principles:
- Separation of Concerns: Each component has a focused responsibility.
- Modularity: Components can be developed, tested, and replaced independently.
- Extensibility: New policies, algorithms, and features can be easily added.
- Configurability: Extensive runtime customization options are available.
- Modern C++ Design: Leveraging C++20 features for safer, more expressive code.
The foundational data structures in the Cache Simulator are CacheBlock and CacheSet:
-
CacheBlockrepresents a single block in the cache, containing:- Valid and dirty flags
- Tag bits for address matching
- MESI state for cache coherence
- Optional data storage (for functional simulation)
-
CacheSetrepresents a set of blocks in a set-associative cache:- Vector of CacheBlocks
- Replacement policy managed via pluggable
ReplacementPolicyBaseinstances
These structures are designed to be memory-efficient while still providing all necessary state information for accurate simulation.
The Cache class is a central component representing a single cache level (L1 or L2). It manages:
- Cache organization (size, associativity, block size)
- Block lookup and placement
- Replacement policy implementation
- Prefetching integration
- Hit/miss statistics collection
- MESI protocol state transitions
The Cache class provides a clean access() method that handles all the complex operations of a cache access while maintaining detailed statistics.
The MemoryHierarchy class orchestrates the entire cache hierarchy, managing:
- Multiple cache levels (L1, L2, L3) - L3 support added in v1.4.0
- Memory access flow through the hierarchy
- Prefetching configuration and coordination
- Statistics aggregation
- Trace file processing
- Performance metrics calculation
This class serves as the main API for applications interacting with the simulator, providing methods to access memory, process traces, and retrieve statistics.
The MemoryHierarchy now supports an optional L3 cache with:
- Configurable size, associativity, and block size
- Inclusive or non-inclusive policy
- Statistics tracking (hit rate, miss rate)
- Proper access flow: L1 miss → L2 access → L2 miss → L3 access
Abstract CoherenceProtocolBase class with implementations:
- MSIProtocol: 3-state (Modified, Shared, Invalid)
- MESIProtocol: 4-state with Exclusive
- MOESIProtocol: 5-state with Owned for dirty sharing
The InterconnectFactory supports:
- Bus, Crossbar, Mesh (existing)
- RingInterconnect: Bidirectional ring with wrap-around
- TorusInterconnect: 2D torus with minimal path routing
The StreamBuffer class implements a simple hardware prefetching mechanism:
- Prefetches sequential blocks following a cache miss
- Maintains a buffer of prefetched addresses
- Tracks hits in the buffer
- Shifts out consumed entries
- Provides statistics on buffer effectiveness
Stream buffers are a simple but effective prefetching technique that works well for sequential access patterns.
The StridePredictor class implements a more sophisticated prefetching mechanism:
- Detects constant stride patterns in memory accesses
- Maintains a table of recent addresses and strides
- Predicts future accesses based on detected strides
- Provides confidence levels for predictions
- Tracks prediction accuracy statistics
Stride prediction is effective for regular access patterns with non-unit strides, such as array traversals in multidimensional arrays.
The AdaptivePrefetcher class implements an advanced prefetching mechanism:
- Dynamically selects between prefetching strategies
- Adjusts prefetch distance based on effectiveness
- Tracks different metrics to guide adaptation
- Supports multiple concurrently active strategies
- Maintains detailed effectiveness statistics
Adaptive prefetching can deliver good performance across a wider range of access patterns compared to fixed-strategy prefetchers.
The MESIProtocol class implements the Modified-Exclusive-Shared-Invalid cache coherence protocol:
- Manages state transitions for cache blocks
- Handles local reads and writes
- Processes remote cache operations
- Coordinates writebacks for modified blocks
- Tracks protocol state transition statistics
The MESI protocol ensures coherent memory views in systems with multiple caches accessing shared memory.
The PowerModel class provides CACTI-inspired power and energy analysis:
- Dynamic energy calculation (read/write per access)
- Static leakage power with temperature scaling
- Technology node support (7nm, 14nm, 22nm, 32nm, 45nm)
- Energy-Delay Product (EDP) computation
- Detailed power breakdown reporting
The power model enables analysis of energy efficiency across different cache configurations.
The AreaModel class provides silicon area estimation:
- Component-level area breakdown (data array, tag array, decoders, sense amps)
- Technology-specific scaling factors
- Cell efficiency metrics
- Aspect ratio estimation
Area modeling helps evaluate the physical implementation costs of cache designs.
The TraceParser class handles trace file input:
- Parses different trace file formats
- Validates memory access operations
- Provides detailed error information
- Offers both single-operation and bulk parsing
- Uses efficient, streaming file reading
This component decouples trace file handling from the core simulation logic.
The Statistics class manages simulation metrics:
- Collects and organizes various numerical metrics
- Calculates derived statistics
- Provides data export capabilities
- Supports metric comparison between runs
- Offers customizable reporting formats
This utility enables detailed analysis of simulation results.
The Visualization class provides text-based visualization:
- Generates ASCII visualizations of cache state
- Renders histograms and heatmaps of memory access patterns
- Supports colorized output where available
- Exports data for external visualization tools
- Provides progress indicators for long simulations
These visualizations aid in understanding simulator behavior.
The ConfigManager class manages configuration parameters:
- Loads configurations from files (JSON, INI)
- Parses command-line arguments
- Validates configuration parameters
- Generates parameter sweeps for benchmarking
- Supports configuration comparison
This utility simplifies experimentation with different simulator settings.
The Logger class provides diagnostic information:
- Supports multiple log levels
- Offers both file and console output
- Provides thread-safe logging
- Allows component-specific named loggers
- Enables selectively verbose output
Logging helps diagnose issues and understand simulator behavior.
The Benchmark class supports performance measurement:
- Times the execution of simulation functions
- Compares multiple configurations
- Calculates statistical measures of performance
- Provides formatted benchmark reports
- Supports repeated runs for variance analysis
This utility is useful for simulator development and optimization.
The MemoryProfiler class analyzes memory access patterns:
- Tracks memory access frequencies
- Detects common access patterns
- Profiles locality characteristics
- Identifies hot regions in memory
- Provides insights for optimization
This tool helps understand workload behavior and aids in prefetcher design.
The simulator implements several cache block replacement policies:
-
LRU (Least Recently Used): Replaces the block that has not been accessed for the longest time. This is the default policy and is implemented using an ordering vector in each cache set.
-
Pseudo-LRU: A tree-based approximation of LRU that requires less state information but still provides good performance.
-
Random: Randomly selects a block for replacement, which can be effective for certain workloads and eliminates pathological cases that affect deterministic policies.
-
FIFO (First-In-First-Out): Replaces the oldest block regardless of access frequency.
-
NRU (Not Recently Used): A simple approximation of LRU using a single bit per block to indicate recent usage.
The policy can be configured on a per-cache basis.
The simulator supports several prefetching algorithms:
-
Sequential Prefetching: Prefetches a configurable number of sequential blocks after a miss. Implemented by the
StreamBufferclass. -
Stride Prefetching: Detects regular strides in access patterns and prefetches accordingly. Implemented by the
StridePredictorclass. -
Adaptive Prefetching: Dynamically adjusts prefetching strategy and aggressiveness based on effectiveness. Implemented by the
AdaptivePrefetcherclass. -
Next-N-Line Prefetching: A simple variation that prefetches the next N blocks on any miss.
-
PC-based Prefetching: Uses program counter information (if available in the trace) to predict access patterns specific to different code regions.
These algorithms can be combined in various ways and customized with different parameters.
The MESI protocol implements cache coherence:
-
Modified (M): The block is present only in the current cache and has been modified. It must be written back to memory when evicted.
-
Exclusive (E): The block is present only in the current cache and matches memory.
-
Shared (S): The block may be present in other caches and is clean.
-
Invalid (I): The block is not present or valid in the cache.
The protocol handles transitions between these states based on local and remote operations, ensuring coherent memory views across caches.
The simulator leverages modern C++20 features:
-
std::optional: Used for representing potentially missing values, such as nullable parameters and results.
-
std::variant: Used for type-safe unions, particularly in return values that can represent different outcomes.
-
std::string_view: Used for non-owning references to strings, improving performance in text processing.
-
std::filesystem: Used for portable file operations in trace handling and result saving.
-
Structured Bindings: Used to simplify code that returns multiple values, like tag and set index extraction.
-
if constexpr: Used for compile-time conditional code, improving performance in template functions.
-
Concepts and Constraints: Used for clearer template interfaces and better error messages.
-
[[nodiscard]]: Used to ensure return values are properly handled for functions that shouldn't be called for side effects.
-
Ranges Library: Used for cleaner iteration and transformation of collections.
-
Three-way Comparison (spaceship operator): Simplifies comparison operations.
These features improve code safety, clarity, and performance.
Several optimizations are employed:
-
Memory Pool Allocation: Cache blocks are allocated from a memory pool to reduce dynamic allocation overhead.
-
Bit Manipulation: Fast bit operations are used for address decomposition and tag comparison.
-
Vectorized Processing: Common operations are vectorized when processing multiple blocks.
-
Cache-Friendly Data Structures: Data is organized to maximize spatial locality.
-
Custom Hash Functions: Optimized hash functions for address lookup.
-
String Interning: Reuse of common strings to reduce memory usage.
-
Streaming I/O: Efficient trace file reading without loading entire files.
-
Compiler Hints: Strategic use of likely/unlikely hints for branch prediction.
These optimizations significantly improve simulation speed, especially for large traces.
The code is organized following modern C++ practices:
-
Namespace Usage: All components are within the
cachesimnamespace to avoid pollution. -
Header-Only When Appropriate: Small utilities are implemented as header-only for simplicity.
-
Separation of Interface and Implementation: Larger components separate declaration and definition.
-
PIMPL Pattern: Used for components with complex internal details to improve compilation speed.
-
Consistent Naming: Camel case for class names, snake case for methods and variables.
-
Const Correctness: Rigorous use of const for parameters and methods.
-
Forward Declarations: Used to minimize include dependencies.
-
Explicit Memory Management: Clear ownership semantics with smart pointers.
-
Modular Design: Each component has a single responsibility.
-
Documentation: Comprehensive comments for all public interfaces.
This organization enhances maintainability, readability, and compile times.
The simulator employs a multi-level testing approach:
-
Unit Tests: Verify individual components in isolation.
- Test each cache operation separately
- Verify replacement policy behavior
- Validate prefetcher predictions
- Check MESI state transitions
-
Integration Tests: Verify interactions between components.
- Test L1-L2 interactions
- Verify prefetcher-cache coordination
- Check coherence protocol across caches
-
Validation Tests: Verify the simulator against known outcomes.
- Test with hand-crafted traces with predictable results
- Compare against simplified analytical models
- Validate against academic benchmark results
-
Property-Based Tests: Verify invariants and properties.
- Ensure cache coherence is maintained
- Verify inclusion properties when applicable
- Check conservation of blocks
-
Performance Tests: Measure and optimize performance.
- Benchmark with large traces
- Profile memory usage
- Analyze scalability with increasing parameters
This comprehensive testing strategy ensures correctness and robustness.
Planned extensions include:
-
Non-Inclusive Policies: Support for non-inclusive and exclusive cache hierarchies.
-
Advanced Prefetching Algorithms: Implementation of more sophisticated prefetchers like GHB and Markov predictors.
-
GPU Cache Modeling: Extensions for modeling GPU-specific cache architectures.
-
NUMA Topology: Modeling non-uniform memory access patterns.
-
Graphical Interface: Development of a GUI for visualization and configuration.
-
Dynamic Trace Generation: Integration with instruction-level simulators.
-
Machine Learning Integration: Using ML techniques for prefetching and replacement.
-
Network-on-Chip Modeling: Detailed NoC simulation for many-core systems.
These extensions will enhance the simulator's capabilities and keep it relevant for future research.