|
| 1 | +--- |
| 2 | +globs: **/*.cpp,**/*.h,**/*.cc,**/*.c |
| 3 | +alwaysApply: false |
| 4 | +--- |
| 5 | +# libxlio Architecture Guide |
| 6 | + |
| 7 | +Accelerated IO SW library (XLIO) - High-performance network acceleration library |
| 8 | + |
| 9 | +## Project Overview |
| 10 | + |
| 11 | +libxlio is a high-performance network acceleration library that provides socket API acceleration through InfiniBand/RDMA hardware. The library intercepts standard socket calls and redirects them to accelerated network hardware for improved performance. |
| 12 | + |
| 13 | +**Key Features:** |
| 14 | +- Kernel-bypass architecture for high bandwidth and low CPU usage |
| 15 | +- Hardware-based direct copy between application memory and network interface |
| 16 | +- Support for TLS encryption/decryption with Crypto Enabled NVIDIA ConnectX |
| 17 | +- Hardware features: LRO/TSO, Striding-RQ for increased TCP performance |
| 18 | +- Compatible with both POSIX socket and XLIO Ultra APIs |
| 19 | +- No application code changes required |
| 20 | + |
| 21 | +**Technology Stack:** |
| 22 | +- **Language**: C++11 with C compatibility layer |
| 23 | +- **Build System**: Autotools (autoconf/automake) with libtool |
| 24 | +- **Key Dependencies**: OFED, InfiniBand verbs, DPCP, JSON-C, Netlink libraries |
| 25 | +- **Testing**: Google Test framework, unit tests, performance tests |
| 26 | +- **Supported Platforms**: x86_64, ARM |
| 27 | +- **Supported Transports**: IPv4/6, TCP, UDP |
| 28 | + |
| 29 | +## Core Architecture |
| 30 | + |
| 31 | +### Socket Interception Layer |
| 32 | +- **LD_PRELOAD mechanism**: Library intercepts standard socket calls transparently |
| 33 | +- **Socket abstraction hierarchy**: `sockinfo` → `sockinfo_tcp`/`sockinfo_udp`/`sockinfo_ulp` |
| 34 | +- **File descriptor management**: `fd_collection` manages socket lifecycle |
| 35 | +- **Event-driven I/O**: All operations are event-driven through polling groups |
| 36 | + |
| 37 | +### Memory Management Patterns |
| 38 | +- **Buffer pools**: Pre-allocated memory pools for zero-copy operations |
| 39 | +- **Ring buffers**: Circular buffers for data transmission (simple, bonded, slave types) |
| 40 | +- **Huge page optimization**: Automatic huge page allocation for better TLB performance |
| 41 | +- **Zero-copy operations**: Direct memory access between application and network interface |
| 42 | + |
| 43 | +### Hardware Abstraction Layer |
| 44 | +- **InfiniBand verbs**: Low-level RDMA operations through `ib_ctx_handler` |
| 45 | +- **Device management**: `net_device_entry` and `net_device_table_mgr` for hardware abstraction |
| 46 | +- **Queue management**: CQ (Completion Queue) and QP (Queue Pair) management |
| 47 | +- **Hardware offloading**: TSO, LRO, GRO, RFS for performance optimization |
| 48 | + |
| 49 | +### Data Flow |
| 50 | +1. Application makes standard socket call (socket, bind, connect, send, recv, etc.) |
| 51 | +2. XLIO intercepts via LD_PRELOAD and redirects to accelerated path |
| 52 | +3. Hardware acceleration through InfiniBand/RDMA with zero-copy operations |
| 53 | +4. Direct memory transfer between application buffers and network interface |
| 54 | +5. Event-driven completion handling through polling groups |
| 55 | + |
| 56 | +### Execution Modes |
| 57 | +- **R2C (Run-to-Completion)**: Single-threaded processing for maximum performance |
| 58 | +- **Worker Threads**: Multi-threaded processing with dedicated worker threads |
| 59 | +- **Polling Groups**: Event management and callback registration for concurrent operations |
| 60 | + |
| 61 | +## File Organization |
| 62 | + |
| 63 | +``` |
| 64 | +src/ |
| 65 | +├── core/ # Core library functionality |
| 66 | +│ ├── config/ # Configuration management (JSON schema, registry) |
| 67 | +│ ├── dev/ # Device and hardware abstraction (rings, CQs, buffers) |
| 68 | +│ ├── event/ # Event handling and threading (poll groups, workers) |
| 69 | +│ ├── ib/ # InfiniBand specific code (verbs, MLX5) |
| 70 | +│ ├── iomux/ # I/O multiplexing (select, poll, epoll) |
| 71 | +│ ├── lwip/ # Lightweight IP stack (TCP implementation) |
| 72 | +│ ├── netlink/ # Netlink communication (routing, neighbors) |
| 73 | +│ ├── proto/ # Protocol handling (TCP, UDP, routing, ARP) |
| 74 | +│ ├── sock/ # Socket abstraction layer (sockinfo classes) |
| 75 | +│ └── util/ # Utility functions (memory, sys vars, instrumentation) |
| 76 | +├── stats/ # Statistics and monitoring |
| 77 | +├── state_machine/ # State machine implementations |
| 78 | +├── utils/ # Common utilities |
| 79 | +└── vlogger/ # Logging system |
| 80 | + |
| 81 | +tests/ |
| 82 | +├── unit_tests/ # Unit tests using Google Test |
| 83 | +├── gtest/ # Integration tests |
| 84 | +└── extra_api/ # Extra API tests |
| 85 | + |
| 86 | +docs/ |
| 87 | +├── configuration.md # Build configuration options |
| 88 | +├── coding-style.md # Code style guidelines |
| 89 | +└── contributing.md # Contribution guidelines |
| 90 | +``` |
| 91 | + |
| 92 | +## Key Terminology |
| 93 | + |
| 94 | +### Core Components |
| 95 | +- **Ring**: Circular buffer for data transmission, can be simple, bonded, or slave |
| 96 | +- **CQ (Completion Queue)**: Hardware queue for completion notifications |
| 97 | +- **QP (Queue Pair)**: InfiniBand communication endpoint |
| 98 | +- **RFS (Receive Flow Steering)**: Hardware-based packet steering |
| 99 | +- **GRO (Generic Receive Offload)**: Hardware packet aggregation |
| 100 | +- **TSO (TCP Segmentation Offload)**: Hardware TCP segmentation |
| 101 | +- **LRO (Large Receive Offload)**: Hardware packet reassembly |
| 102 | + |
| 103 | +### Execution Modes |
| 104 | +- **R2C (Run-to-Completion)**: Single-threaded processing |
| 105 | +- **Worker Threads**: Multi-threaded processing with dedicated worker threads |
| 106 | + |
| 107 | +### Memory Management |
| 108 | +- **Buffer Pools**: Pre-allocated memory pools for network buffers |
| 109 | +- **Huge Pages**: Large memory pages for better TLB performance |
| 110 | +- **Zero-Copy**: Direct memory access without copying |
| 111 | + |
| 112 | +### Network Stack |
| 113 | +- **lwIP**: Lightweight IP stack for TCP implementation |
| 114 | +- **Netlink**: Kernel-user space communication for routing/neighbors |
| 115 | +- **ARP**: Address Resolution Protocol implementation |
| 116 | +- **Routing**: Route table management and packet forwarding |
| 117 | + |
| 118 | +### Hardware Abstraction |
| 119 | +- **InfiniBand Verbs**: Low-level InfiniBand operations |
| 120 | +- **MLX5**: Mellanox ConnectX-5/6 specific optimizations |
| 121 | +- **DPCP**: Direct Packet Control Plane for advanced features |
| 122 | +- **OFED**: OpenFabrics Enterprise Distribution for InfiniBand/RDMA support |
| 123 | + |
| 124 | +## Common Code Patterns |
| 125 | + |
| 126 | +### Socket Implementation Pattern |
| 127 | +```cpp |
| 128 | +class sockinfo_new_protocol : public sockinfo { |
| 129 | +public: |
| 130 | + sockinfo_new_protocol(int fd, int domain); |
| 131 | + |
| 132 | + // Override required virtual methods |
| 133 | + virtual int bind(const struct sockaddr *addr, socklen_t addrlen) override; |
| 134 | + virtual int connect(const struct sockaddr *addr, socklen_t addrlen) override; |
| 135 | + virtual ssize_t send(const void *buf, size_t len, int flags) override; |
| 136 | + virtual ssize_t recv(void *buf, size_t len, int flags) override; |
| 137 | + |
| 138 | + // Protocol-specific methods |
| 139 | + void handle_protocol_specific_event(); |
| 140 | +}; |
| 141 | +``` |
| 142 | + |
| 143 | +### Event Handler Pattern |
| 144 | +```cpp |
| 145 | +class new_event_handler : public event_handler { |
| 146 | +public: |
| 147 | + virtual void handle_event() override; |
| 148 | + virtual void handle_timer_expired(void *user_data) override; |
| 149 | + |
| 150 | + // Register with poll group |
| 151 | + void register_with_poll_group(poll_group *pg); |
| 152 | +}; |
| 153 | +``` |
| 154 | + |
| 155 | +### Configuration Access Pattern |
| 156 | +```cpp |
| 157 | +// Access configuration through registry |
| 158 | +config_registry ®istry = config_registry::get_instance(); |
| 159 | + |
| 160 | +// Type-safe parameter access with validation |
| 161 | +auto memory_limit = registry.get_parameter<int>("core.resources.memory_limit"); |
| 162 | +auto tcp_nodelay = registry.get_parameter<bool>("network.protocols.tcp.nodelay.enable"); |
| 163 | +``` |
| 164 | + |
| 165 | +## Key Data Structures |
| 166 | + |
| 167 | +```cpp |
| 168 | +class sockinfo { |
| 169 | + // Base socket information class |
| 170 | + // Inherited by sockinfo_tcp, sockinfo_udp, sockinfo_ulp |
| 171 | +}; |
| 172 | + |
| 173 | +class net_device_entry { |
| 174 | + // Network device abstraction |
| 175 | + // Handles InfiniBand context and queue pairs |
| 176 | + // Inherits from event_handler_ibverbs and timer_handler |
| 177 | +}; |
| 178 | + |
| 179 | +class ring { |
| 180 | + // Ring buffer for data transmission |
| 181 | + // Types: ring_simple, ring_bond, ring_slave |
| 182 | +}; |
| 183 | + |
| 184 | +class buffer_pool { |
| 185 | + // Memory pool for network buffers |
| 186 | + // Manages allocation and deallocation |
| 187 | +}; |
| 188 | + |
| 189 | +class poll_group { |
| 190 | + // Manages I/O multiplexing and event handling |
| 191 | + // Handles epoll, select, poll operations |
| 192 | +}; |
| 193 | + |
| 194 | +class worker_thread { |
| 195 | + // Worker thread for processing events |
| 196 | + // Part of worker_thread_manager |
| 197 | +}; |
| 198 | +``` |
| 199 | + |
| 200 | +## Configuration System |
| 201 | + |
| 202 | +### Architecture |
| 203 | +- **Registry**: `config_registry` class manages all configuration parameters |
| 204 | +- **Schema**: JSON Schema validation in `xlio_config_schema.json` |
| 205 | +- **Loaders**: JSON file, inline config, environment variables (deprecated) |
| 206 | +- **Priority**: JSON file → Inline config → Environment variables → Defaults |
| 207 | + |
| 208 | +### Key Configuration Sections |
| 209 | +- **core**: Essential configuration (memory, resources, initialization) |
| 210 | +- **network**: Protocol-specific settings (TCP, UDP, routing) |
| 211 | +- **hardware**: Hardware-specific configurations (InfiniBand, rings, offloads) |
| 212 | +- **performance**: Performance tuning parameters (polling, batching, allocation) |
| 213 | +- **monitor**: Logging, statistics, and monitoring settings |
| 214 | + |
| 215 | +## Quick Reference |
| 216 | + |
| 217 | +### Essential Commands |
| 218 | +```bash |
| 219 | +# Build and install |
| 220 | +./autogen.sh && ./configure --with-dpcp --enable-utls && make -j && make install |
| 221 | + |
| 222 | +# Run with XLIO |
| 223 | +LD_PRELOAD=libxlio.so ./your_application |
| 224 | + |
| 225 | +# Monitor performance |
| 226 | +xlio_stats --all |
| 227 | + |
| 228 | +# Check hardware |
| 229 | +ibstat && ibv_devinfo -l |
| 230 | +``` |
| 231 | + |
| 232 | +### Key Files |
| 233 | +- `src/core/sock/sockinfo.h` - Socket abstraction base class |
| 234 | +- `src/core/util/sys_vars.h` - Configuration parameters |
| 235 | +- `src/core/config/` - Configuration system |
| 236 | +- `src/core/dev/ring.h` - Ring buffer abstraction |
| 237 | +- `src/core/event/poll_group.h` - Event management |
| 238 | +- `tests/unit_tests/` - Unit tests |
| 239 | +- `contrib/jenkins_tests/style.conf` - Code formatting rules |
| 240 | + |
| 241 | +### Key Configuration Parameters |
| 242 | +- **Memory**: `core.resources.memory_limit`, `core.resources.hugepages.enable` |
| 243 | +- **Performance**: `performance.polling.*`, `performance.ring_allocation_logic` |
| 244 | +- **Logging**: `monitor.log.level`, `monitor.log.file` |
| 245 | +- **Network**: `network.protocols.tcp.nodelay`, `network.protocols.udp.*` |
| 246 | +- **Hardware**: `hardware.ib.*`, `hardware.ring.*` |
| 247 | + |
| 248 | +### Development Tools |
| 249 | +- **xlio_stats**: Runtime statistics and performance monitoring |
| 250 | +- **ethtool**: Hardware capabilities verification |
| 251 | +- **ibstat/ibv_devinfo**: InfiniBand device information |
| 252 | +- **valgrind**: Memory debugging (configure with `--with-valgrind`) |
| 253 | +- **gdb**: Debugging (enable with `--enable-debug`) |
| 254 | + |
| 255 | +## Performance Considerations |
| 256 | + |
| 257 | +### Key Optimization Areas |
| 258 | +- Minimize memory allocations in data path |
| 259 | +- Use zero-copy techniques where possible |
| 260 | +- Leverage hardware offloading (TSO, LRO, GRO) |
| 261 | +- Consider NUMA topology for multi-socket systems |
| 262 | +- Use huge pages for better TLB performance |
| 263 | +- Choose appropriate ring allocation strategy (per-core, per-interface, per-IP) |
| 264 | +- Tune polling intervals and batch sizes |
| 265 | + |
| 266 | +### Common Bottlenecks |
| 267 | +- Insufficient buffer pool sizes |
| 268 | +- Suboptimal ring allocation strategy |
| 269 | +- Disabled hardware offloading features |
| 270 | +- Incorrect polling parameters |
| 271 | +- Memory allocation in hot paths |
0 commit comments