OdyssNet is a PyTorch-based library that implements Zero-Hidden Layer neural networks using Temporal Depth. By treating the neural network as a dynamic system that evolves over time, OdyssNet achieves deep learning capabilities without stacking spatial layers.
The library is organized into three primary modules:
odyssnet.core.network: The recurrent core architecture and update dynamics.odyssnet.training.trainer: Optimization engine with AdamW and bio-inspired regularization.odyssnet.utils: Data utilities, model persistence (odyssstore), and dynamic expansion (neurogenesis).
The OdyssNet class defines the structure and dynamics of the network. It is a single layer where every neuron is connected to every other neuron (including itself).
from odyssnet import OdyssNet
model = OdyssNet(
num_neurons=10,
input_ids=[0, 1],
output_ids=[9],
pulse_mode=True,
dropout_rate=0.0,
device='cuda',
weight_init=['quiet', 'resonant', 'quiet', 'zero'],
activation=['none', 'tanh', 'tanh', 'none'],
gate=None, # Default resolves to ['none', 'none', 'identity']
vocab_size=None, # Optional: Decouples input/output size from neurons
vocab_mode='hybrid', # 'hybrid', 'discrete', or 'continuous'
hebb_type=None, # Toggle: None, 'temporal', 'spatial', or 'both'
hebb_res='neuron', # Plasticity resolution: 'global', 'neuron', or 'synapse'
debug=False, # NaN/Inf diagnosis — raises RuntimeError at the first offending operation
)Parameters:
-
num_neurons(int): Total number of neurons in the single layer (No hidden layers). -
input_ids(list[int]): Indices of neurons that receive external input. -
output_ids(list[int]): Indices of neurons whose state is read as output. -
pulse_mode(bool):-
True: Input is applied only at$t=0$ (Impulse). -
False: Input is applied continuously at every step (Stream).
-
-
dropout_rate(float): Probability of synaptic failure during training (Biological simulation). -
device(str): 'cpu' or 'cuda'. -
weight_init(str or list[str]): Weight initialization strategy. Default is['quiet', 'resonant', 'quiet', 'zero']for [Encoder/Decoder, Core, Memory, Gates]. Single string values are expanded intelligently.-
'resonant'(Default for Core): Edge-of-Chaos initialization with spectral radius ρ(W) = 1.0. Uses bipolar Rademacher (±1) skeleton + small Gaussian noise (std=0.02) + spectral normalization. Ensures signals neither explode nor vanish while maintaining excitatory/inhibitory balance. -
'orthogonal': Orthogonal matrix initialization. Excellent stability for large networks. -
'xavier_uniform'/'xavier_normal': Xavier-scaled initialization. Good for small logic networks. -
'kaiming_uniform'/'kaiming_normal': Kaiming-scaled initialization. ReLU-oriented. -
'quiet': Normal(0, 0.02). Small random initialization. -
'micro_quiet': Normal(0, 1e-6). Near-zero initialization. -
'sparse': 90% sparse with std=0.02. -
'zero','one','classic': Special initialization cases.
-
-
activation(str or list[str]): Activation function. Default is['none', 'tanh', 'tanh', 'none']for [encoder_decoder, core, memory, gate_hint]. The 4th entry is reserved for config symmetry and doesn't affect gate behavior. Supported activations:'tanh','relu','leaky_relu','sigmoid','gelu','gelu_tanh','silu','none','identity'. Single string applies to core path; list format allows per-component control with 1-4 entries (missing entries filled from defaults). -
vocab_size(int or list/tuple, optional): Size of the input/output vocabulary.-
Symmetric:
vocab_size=50257(GPT-2 style). -
Asymmetric:
vocab_size=[v_in, v_out](e.g.,[784, 10]for MNIST to map 784 pixels to 10 classes). -
Disable: Use
-1to disable one side (e.g.,[-1, 1000]for direct neuron input but decoded output).
-
Symmetric:
-
vocab_mode(str): Controls which input encoding layers are initialized (default:'hybrid').-
'hybrid': Initializes both Embedding (for integer/token inputs) and Linear Projection (for float inputs). Use when input type varies. -
'discrete': Initializes only Embedding layer. Use for token-only inputs (e.g., NLP tasks). Saves VRAM. -
'continuous': Initializes only Linear Projection. Use for float-only inputs (e.g., vision, audio). Saves VRAM.
-
-
tie_embeddings(bool):- If
True, ties the input embedding weights to the output decoder weights, saving significant VRAM and parameter count (Symmetricvocab_sizeonly). Default isFalse.
- If
-
hebb_type(str or None): Controls the active mechanism for Heterogeneous Synaptic Plasticity. Default isNone(plasticity disabled).-
'temporal': STDP-style learning; correlates current state$h_t$ with previous state$h_{t-1}$ . -
'spatial': Co-activation learning (classic Hebbian); correlates current state$h_t$ with itself$h_t$ (neurons firing simultaneously). -
'both': Combines both temporal and spatial mechanisms.
-
-
hebb_res(str): Controls the structural resolution of plasticity. Default is'neuron'.hebb_resParameter Shape Extra Params per Path Mechanics "global"scalar ()+2 Uniform plasticity — the whole network is equally plastic. "neuron"vector (N,)+2N Per-neuron plasticity — each neuron learns its own adaptation rate. "synapse"matrix (N, N)+2N² Per-synapse plasticity — each connection has its own factor and decay. - For each active path (
t_for temporal,s_for spatial), two learnable logit parameters are created according to the resolution:-
t_hebb_factor/s_hebb_factor(raw logit →sigmoid→ learning rate ≈ 0.047 initially) -
t_hebb_decay/s_hebb_decay(raw logit →sigmoid→ retention ≈ 0.90 initially)
-
- During each forward pass the model accumulates correlations (temporal
$h_t \otimes h_{t-1}$ and/or spatial$h_t \otimes h_t$ ) and applies them to the effective weights. - The Hebbian states are persisted across forward calls via registered buffers (
t_hebb_state_W,s_hebb_state_W, etc.) and are cleared byreset_state(). - Both factors and decays are fully differentiable — gradients flow into them via the recurrent computation so the network learns how to learn online.
- For each active path (
-
gate(None, str, or list[str]): Optional parametric gating mechanism. Default isNone, which resolves to['none', 'none', 'identity'].-
None: Default configuration with memory identity gate enabled, others disabled. -
str(e.g.,'sigmoid'): Applies the same gate activation to all three branches[encoder_decoder, core, memory]. -
list[str]: Specify individual gate activations for up to 3 branches in[encoder_decoder, core, memory]order. Missing entries use defaults. -
'none': Completely disables the gate branch (no learnable parameters). -
'identity': Enables identity gating with learnable parameters (starts at identity function but can adapt). - Gate parameters are initialized using the 4th entry in
weight_init(default:'zero').
-
-
debug(bool): Enables NaN/Inf diagnosis mode. Default isFalse. WhenTrue, every critical operation in the forward pass (linear recurrence, memory feedback, activation, StepNorm, Hebbian correlation and accumulation) is checked after execution; the first non-finite value raisesRuntimeErrorwith the operation name and step index. Also automatically callstorch.autograd.set_detect_anomaly(True)so backward-pass NaN is caught with a full stack trace. Disable after the root cause is found — overhead is zero whenFalse.
When vocab_size is typically much larger than num_neurons (e.g., 50k vocab vs 1024 neurons), OdyssNet uses decoupled layers. This can be configured as symmetric (same size for in/out) or asymmetric.
- Encoder (Input): Maps
v_in->len(input_ids)(Neurons).- Integers (Tokens) use
nn.Embedding. - Floats (Vectors) use
nn.Linear(Projection). - Disabled if
v_in == -1.
- Integers (Tokens) use
- Decoder (Output): Maps
len(output_ids)(Neurons) ->v_out.- Uses
nn.Linear(Decoding). - Disabled if
v_out == -1.
- Uses
Benefit: This allows the "Thinking Core" (Neurons) to remain small and efficient while handling complex input formats or large output spaces without manual slicing.
# Asymmetric Example: MNIST (784 pixels -> 10 classes)
model = OdyssNet(
num_neurons=10,
input_ids=range(10),
output_ids=range(10),
vocab_size=[784, 10], # Input 784, Output 10
vocab_mode='continuous'
)
# No need for slice_output: model(x) returns (Batch, Steps, 10)OdyssNet processes data through three distinct modalities. Choosing the right one is critical for performance and VRAM efficiency.
Use case: Static data like images (MNIST) or single-shot logic (XOR).
-
Behavior: Set
pulse_mode=True. Input is injected at$t=0$ only. -
Thinking: The model continues computation for the specified number of
stepswithout further input. - VRAM Efficiency: Optimal. Only (Batch, Neurons) is stored.
# Image Classification (784 pixels -> 100 steps thinking)
model = OdyssNet(..., pulse_mode=True)
output = model(image_tensor, steps=100)Use case: Control systems, VCO (Sine Wave), or real-time sensor monitoring.
-
Behavior: Set
pulse_mode=False. The same input is injected at every time step$t$ . - Thinking: The model state is constantly influenced by the static input.
- VRAM Efficiency: High. Only (Batch, Neurons) is stored.
# Frequency Control for Oscillator
model = OdyssNet(..., pulse_mode=False)
output = model(freq_input, steps=30)Use case: Large Language Models (LLM), Time-Series, and reasoning agents.
- Behavior: Provide a sequence
(Batch, Tokens). Ifsteps>tokens, OdyssNet automatically scales the temporal resolution. - Mechanism: If 100 tokens are provided with 500
steps, the model intersperses 4 "silent" thinking steps between each token. - VRAM Efficiency: High. Eliminates the need for manually dilated/padded input tensors.
# LLM: 128 tokens with 5 thinking steps per token (Total 640 steps)
tokens = torch.randint(0, 50257, (batch, 128))
output = model(tokens, steps=640)| Input Type | Format | Modality | Recommended Use Case |
|---|---|---|---|
| Index (ID) | (Batch, Steps) (Long) |
Sequential | LLMs, Tokenized text. |
| Dense | (Batch, Steps, Dim) (Float) |
Sequential | Audio, Video, Vector Streams. |
| Pulse | (Batch, Dim) (Float) |
Instant | Static Images, Logic Gates. |
| Continuous | (Batch, Dim) (Float) |
Periodic | Oscillators, Constant Signals. |
Returns the effective parameter count of the network. It accounts for the memory_feedback separation by properly discounting the inactive diagonal of the W matrix to give you a true representation of learning capacity.
Optimizes the model using torch.compile (PyTorch 2.0+) for faster execution. Returns the compiled model.
Runs the dynamic system.
x_input: Input tensor. Can be a single pulse or a sequence (index-based or dense).steps: Thinking Time. How many times the signal reverberates in the echo chamber.current_state: Optional. Pass a previous state to continue from.return_sequence(bool, defaultTrue): Controls output allocation strategy.True: Collects the full output sequence and returnsall_statesof shape(Batch, Steps, Neurons). Required when loss is computed over all time steps (full_sequence=Truein the trainer).False: Skips building the(Batch, Steps, Neurons)tensor entirely and returns only the final step as(Batch, 1, Neurons). Saves VRAM proportional tothinking_steps— use this whenever only the last output matters (e.g., classification, pulse-mode inference).
- Returns:
(all_states, final_state)all_states: Shape(Batch, Steps, Neurons)whenreturn_sequence=True; shape(Batch, 1, Neurons)whenFalse.final_state: Tensor of shape(Batch, Neurons)— the last hidden state, regardless ofreturn_sequence.
Trainer transparency:
OdyssNetTrainerautomatically passesreturn_sequence=full_sequenceintrain_batch()andpredict(). You only need to set this manually when callingmodel.forward()directly.
The OdyssNetTrainer handles the training loop, gradient accumulation, mixed precision (AMP), and experimental features like Ghost Gradients. Prodigy is the default optimizer (auto-calibrating, no LR tuning required). Pass an explicit lr to use AdamW instead.
from odyssnet import OdyssNetTrainer
# Quick prototyping: Prodigy — auto-calibrates LR, no tuning needed
trainer = OdyssNetTrainer(model, device='cuda')
# Reproducible experiments and production: pin an explicit lr to use AdamW
trainer = OdyssNetTrainer(model, lr=1e-4, device='cuda')
# With optional features
trainer = OdyssNetTrainer(
model,
device='cuda',
gradient_persistence=0.0,
synaptic_noise=0.0,
anomaly_hook=my_hook
)
# Custom optimizer (bypasses both Prodigy and AdamW)
import torch
trainer = OdyssNetTrainer(model, optimizer=torch.optim.AdamW(model.parameters(), lr=1e-4))Parameters:
lr(float or None): Learning rate. Default:None.None: Prodigy optimizer is used. Auto-calibrates the learning rate continuously — no manual tuning required. Requirespip install prodigyopt. Best for quick prototyping; produces non-deterministic loss curves across runs even with a fixed seed.- float (e.g.
1e-4): AdamW optimizer is used withweight_decay=0.01. Recommended for reproducible experiments, benchmarking, and production runs.
gradient_persistence(float): Ghost Gradients / Persistence.0.0: Standard behavior (zero_grad()after every step).> 0.0(e.g.,0.1): Keeps a percentage of the previous step's gradient. This creates a "momentum" over time, effectively simulating a larger batch size or longer temporal context. Useful for difficult convergence landscapes.
synaptic_noise(float): Thermal Noise.- Adds Gaussian noise (std dev =
synaptic_noise) to all weights before every training step. - Simulates biological thermal noise and prevents overfitting (Stochastic Resonance).
- Default:
0.0(Enable for regularization, e.g.1e-6, on large or overfitting-prone networks).
- Adds Gaussian noise (std dev =
anomaly_hook(Callable, optional): A user-defined functionhook(anomaly_type, loss_val)triggered automatically when training encounters anomalies. Supportedanomaly_typevalues:"spike": A sudden, violent surge in loss (e.g., exploded gradient)."increase": Triggered every single time the current step's loss is strictly greater than the previous step's loss (even by 0.0001). Perfect for custom patience counters or algorithmic early stopping."plateau": The loss has stagnated and is barely moving over a window.- Usage: Allows for smart interventions (like custom logging or early stopping when stuck).
Runs a full training loop.
history = trainer.fit(
input_features=X,
target_values=Y,
epochs=100,
batch_size=32,
thinking_steps=10 # Temporal Depth
)Runs a single custom training step. Useful for custom loops (RL, Generative, etc.).
thinking_steps: How long the model "thinks" before loss is calculated.gradient_accumulation_steps: Simulates larger batch sizes.full_sequence(bool): IfTrue, calculates loss on the entire sequence output(Batch, Steps, Out)instead of just the last step. Essential for Seq2Seq tasks.mask(Tensor, optional): A binary or weighted mask(Batch, Steps, Out)to ignore specific steps or outputs during loss calculation. Useful for tasks with "thinking delays" or variable-length sequences.output_transform(Callable, optional): A function to transform the predicted outputs before loss calculation. Useful for reshaping logits (e.g., flatten for CrossEntropy) or applying custom activations.
Runs inference in evaluation mode.
full_sequence(bool): IfTrue, returns outputs for all time steps(Batch, Steps, Out).
Triggers Darwinian Regeneration. Instead of pruning weak weights, this method re-initializes them.
-
Logic: If
$|W| < threshold$ , the synapse is considered "dead/useless". It is wiped and assigned a new random value using the model's original initialization strategy (e.g., Xavier/Orthogonal). - Purpose: Allows the network to escape local minima and constantly explore new pathways. Transforms "dead" capacity into "fresh" capacity.
-
Returns:
(revived_count, total_synapses)
Returns comprehensive training diagnostics.
Parameters:
debug(bool): IfTrue, includes computationally intensive diagnostics such as gradient statistics, persistent gradient info, and detailed optimizer metrics. Default:False.
Returns: A dictionary containing:
step_count: Number of optimization steps takenlast_loss: Most recent loss valuecurrent_lr: Current learning rategradient_persistence: Gradient persistence coefficientpersistent_grads_active: Number of active persistent gradients (debug mode only)anomaly_tracking: Anomaly detection state (debug mode only)loss_tracking: Loss buffer statistics (debug mode only)scaler_state: AMP scaler information (debug mode only)gradient_stats: Gradient norms and means across parameters (debug mode only)
OdyssNet replaces spatial layers with temporal steps.
- Vertical vs Horizontal: A standard 10-layer network has fixed depth. OdyssNet can be run for 10 or 100 steps on-the-fly.
- Dynamic Complexity: Higher
stepsallow the network more time to reverberate signals through its recurrent core, enabling deeper reasoning without increasing parameter count.
OdyssNet allows you to simulate massive batch sizes on limited hardware (e.g., consumer GPUs).
- How it works: Instead of updating weights after every batch, it accumulates gradients for
Nsteps and then performs a single update. - Usage:
# Simulates a batch size of 32 * 4 = 128 trainer.train_batch(x, y, thinking_steps=10, gradient_accumulation_steps=4)
- Benefit: Allows training large models or using large batch stability without running out of VRAM.
By setting gradient_persistence > 0, the network retains a fraction of the previous batch's gradient.
- Mechanism: Uses a decaying echo (linear scaling) of previous gradients.
- Use Case: Smoothing optimization in non-convex landscapes or simulated long-context training.
OdyssNet can re-initialize synapses that are no longer contributing to the loss signal (stagnant weights).
- Concept: Instead of pruning, near-zero weights are re-initialized using the original weight strategy.
- Benefit: Maximizes network plasticity and parameter efficiency by converting dead capacity into fresh exploration.
- Usage:
- Threshold Mode:
trainer.regenerate_synapses(threshold=0.01) - Percent Mode:
trainer.regenerate_synapses(percentage=0.05)
- Threshold Mode:
The odyssstore module provides checkpoint management utilities, including a unique Weight Transplantation feature for transferring learned knowledge between models of different sizes.
Saves a training checkpoint to disk. Pass trainer_state=trainer.state_dict() to also persist the trainer's runtime state (step counter, scaler, persistent gradients).
Loads a checkpoint. Set strict=False to ignore size mismatches (will partially load what fits). Pass lr to overwrite the saved learning rate after loading. Pass trainer (an OdyssNetTrainer instance) to restore runtime trainer state (step counter, scaler, persistent gradients).
🧬 Weight Transplantation: Transfers learned weights from a checkpoint to a model, even if the number of neurons is different.
- Scaling Up: Start a 512-neuron model with knowledge from a 256-neuron model. The overlapping 256×256 region is copied, the rest stays initialized.
- Scaling Down: Compress a 1024-neuron model into a 256-neuron model. The most "central" 256×256 weights are preserved.
- Warm Starts: Any learned weights are better than random. Gradients will find their way faster.
from odyssnet import OdyssNet, transplant_weights
# Create a NEW, larger model
big_model = OdyssNet(num_neurons=512, ...)
# Transplant weights from a smaller, trained checkpoint
transplant_weights(big_model, 'small_model_checkpoint.pth')
# big_model now has a "warm start" - training will converge faster!Reads checkpoint metadata (epoch, loss, num_neurons) without loading into a model.
OdyssNet supports dynamic growth, allowing you to add neurons to a live network during training. This mimics biological neurogenesis.
Dynamically adds amount empty neurons to the model.
- Continuity: Optimizers are migrated, so momentum and history are preserved.
- State: The training state is preserved.
- Initialization:
- Incoming Weights: 0 (Maintains forward pass stability, new neuron starts inactive).
- Outgoing Weights: Small random noise (Enables backpropagation / gradient flow).
# Add 1 neuron if loss stagnates
if loss > prev_loss:
trainer.expand(amount=1)Initialization: New connections are initialized with
micro_quiet_warm(Normal(0, 1e-3)) noise so they remain dormant relative to trained weights and do not destabilize the existing dynamics. Optimizer momentum is migrated from the old parameters to the expanded ones.
Maps raw input features (numpy or tensor) to the full network state tensor.
- Pulse Mode: Plugs data into
t=0, leaves rest as 0. - Stream Mode: Maps sequence data
(Batch, Steps, Features)to correct neurons. - Auto-Device: Automatically moves data to the model's device.
from odyssnet.utils.data import prepare_input
x_in, batch_size = prepare_input(X_train, model.input_ids, model.num_neurons, 'cuda')Safely converts any list/array/int/float into a PyTorch tensor on the target device.
from odyssnet.utils.data import to_tensor
data_tensor = to_tensor(data, 'cuda')Sets a fixed seed for reproducible results across all random sources (Python, NumPy, PyTorch, CUDA).
- Purpose: Ensures consistent behavior across runs for reliable experimentation and debugging.
- Seed Value: The provided seed is applied to all randomization sources simultaneously.
- CUDA Support: Automatically configures CUDA random state if GPU is available.
from odyssnet import set_seed
# At the start of your script for full reproducibility
set_seed(42)
# Train or run experiments - results will be identical across runs
model = OdyssNet(...)
trainer = OdyssNetTrainer(model)
trainer.fit(x, y, epochs=100)Best Practice:
- Call
set_seed()at the start of your script, before any random operations. - Use consistent seed values (e.g., 42) for reproducible example and experiment validation.
- Different seeds can be used for ensemble training or robustness testing.
See Neurogenesis section above.
This module manages model serialization and the transdimensional weight transplantation feature described in the Advanced Capabilities section.
Lightweight metric accumulator with built-in multi-panel plotting. All example scripts use this to visualize training dynamics.
from odyssnet import TrainingHistory
history = TrainingHistory()
for epoch in range(epochs):
loss = trainer.train_batch(x, y, thinking_steps=10)
history.record(loss=loss, lr=current_lr, accuracy=acc)
# Interactive display
history.plot(title="My Experiment")
# Save to file
history.plot(save_path="results/training.png", title="My Experiment")Methods:
record(**kwargs): Record one or more named metrics for the current step. Values are converted to float.get(key): Return the list of recorded values for a metric name.metrics: Property returning names of all recorded metrics.plot(save_path=None, title="Training History"): Generate a multi-subplot figure with one panel per metric. Ifsave_pathis given, saves to disk; otherwise shows interactively. If the environment variableODYSSNET_DISABLE_PLOT=1is set, plotting is skipped entirely (useful for automated testing).
# 2 Inputs, 1 Output. 0 Hidden Layers.
model = OdyssNet(num_neurons=3, input_ids=[0, 1], output_ids=[2], device='cuda')
trainer = OdyssNetTrainer(model, lr=5e-3, gradient_persistence=0.1)
# Training logic...
trainer.fit(X, Y, epochs=100, thinking_steps=5)# 784 pixels -> 10 neurons -> 10 logits
model = OdyssNet(num_neurons=10, input_ids=range(10), output_ids=range(10), vocab_size=[784, 10])
# Model handles projection and decoding automatically.