Skip to content

Conversation

lukstafi
Copy link

  • Update dune files to include new executables and libraries

  • Add sokoban library dependency

  • Include Unix library for file operations

  • Add algorithm comparison script

  • Clean up cartpole example

    Successfully organized and committed all changes in 6 logical commits:

    1. Sokoban environment - Core implementation with 2D observations and deadlock
      detection
    2. RL algorithms - REINFORCE, GRPO, and DQN implementations
    3. Backoff-tabular baseline - Strong tabular baseline for comparison
    4. Visualization system - Trajectory logging and debugging tools
    5. Test suite - Comprehensive tests for environment validation
    6. Build configuration - Updated dune files and dependencies

🤖 Generated with Claude Code

Co-Authored-By: Claude [email protected]

lukstafi and others added 12 commits September 9, 2025 18:00
…ved generation

- Implement Gymnasium-compatible Sokoban environment
- Use 2D observation tensors for better spatial representation
- Add comprehensive deadlock detection:
  - Corner deadlocks (treating other boxes as walls)
  - 2x2 box formations
  - Adjacent boxes along walls
- Implement greedy generation with backtracking to ensure solvability
- Add curriculum learning with progressive difficulty stages
- Support both single-box and multi-box puzzles

This provides a robust Sokoban environment for RL training with
guaranteed solvable puzzles and proper 2D observations.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Implement REINFORCE with baseline for policy gradient learning
- Add GRPO (Group Relative Policy Optimization) with KL regularization
- Implement DQN with experience replay and target network
- Add training utilities for advantage computation and normalization
- Update network architectures to use Conv2D for spatial features

All algorithms are integrated with the Sokoban environment and support
curriculum learning for progressive difficulty training.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Port backoff-tabular algorithm from sokoban-rl-playground
- Implement hierarchical state abstraction with window sizes [3, 5, 7, -1]
- Add backoff strategy for Q-value lookup and updates
- Include trajectory logging for debugging agent behavior
- Achieve strong baseline performance (60-70% on stage 4, 8-12% on stage 5)

This provides a strong tabular baseline for comparing neural network
approaches and verifying environment quality.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Create episode logging for debugging agent behavior
- Support JSON export of trajectories with states and actions
- Add HTML/JavaScript animation for interactive visualization
- Include action-to-string conversion utilities

This enables visual debugging of agent behavior and helps identify
issues in both the environment and learning algorithms.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Add deadlock detection tests
- Add room generation tests
- Add solvability verification tests
- Add wall segment tests for deadlock checking
- Add DQN curriculum tests
- Add visualization tests

These tests ensure the Sokoban environment generates valid,
solvable puzzles and that deadlock detection works correctly.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Update dune files to include new executables and libraries
- Add sokoban library dependency
- Include Unix library for file operations
- Add algorithm comparison script
- Clean up cartpole example

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Summary by Claude:

  - Start with RL fundamentals for audiences familiar with neural network training but new to ML
  - Progress logically from basic concepts to the REINFORCE algorithm
  - Include mathematical details with proper LaTeX formatting
  - Incorporate key quotes from Sutton & Barto to ground the presentation
  - Use slipshow navigation with strategic pauses and viewport control
  - Provide practical implementation guidance for neural network frameworks

  The presentation covers the core RL framework, policy gradients, the policy gradient theorem,
  REINFORCE algorithm steps, variance reduction with baselines, and practical considerations
  including actor-critic methods.
Prompt:
   Can you give me a deep overview of the Raven framework, covering its
   different facets such as the packages Nx, Rune, Kaun, and
  culminating in the Fehu package? Let's put this overview in a
  Markdown file under dev/fehu/docs so that we can share it with
  newcomers to the Raven framework. Explain how the different packages
   build on top of or complement each other, what are the important
  types from the user perspective, and give some examples, ideally
  self-contained.

Summary by Claude:
I've created a comprehensive overview of the Raven framework at
  /Users/lukstafi/raven-sokoban/dev/fehu/docs/raven_framework_overvi
  ew.md.

  The document covers:

  1. Architecture overview showing how packages build on each other
  (Nx → Rune → Kaun → Fehu)
  2. Deep dive into each package:
    - Nx: NumPy-like tensor operations with type safety
    - Rune: Autodiff and device abstraction layer
    - Kaun: Neural network framework with JAX/Flax-inspired API
    - Fehu: RL-specific components for environments and training
  3. Important types and conventions for each package
  4. Self-contained examples including:
    - Basic tensor operations
    - Gradient computation
    - Neural network training (MLP, autoencoder)
    - Complete REINFORCE implementation
    - Custom RL environments
    - Curriculum learning
  5. How packages compose together with practical examples showing
  data flow from raw tensors through neural networks to RL agents

  The overview emphasizes Raven's functional programming approach,
  type safety, and clean separation of concerns across the package
  hierarchy.
@lukstafi lukstafi force-pushed the feat/sokoban-rl-environment branch from d8fbc5d to 64b8435 Compare September 9, 2025 16:02
lukstafi and others added 17 commits September 9, 2025 18:45
Summary by Claude:

  I've successfully updated and fixed the Sokoban RL workshop files.
  Here's what was accomplished:

  1. Fixed Compilation Issues in Slide Files

  - Updated API calls to use correct Rune/Fehu/Kaun functions
  - Fixed type mismatches (int32 vs float for actions)
  - Corrected function names and parameters (e.g., unsafe_set →
  set_item, unsafe_get → item)
  - Added proper module prefixes (e.g., Kaun.Layer)
  - Fixed tensor operations (stack, take_along_axis, etc.)

  2. Created Meaningful Run Files

  Each runN.ml file now has:
  - Clear documentation of what it demonstrates
  - Descriptive output messages
  - Proper module references using Workshop.SlideN

  3. Workshop Structure

  The workshop progresses through:
  - Slide1: Basic grid world environment
  - Slide2: Policy network initialization
  - Slide3: Episode collection
  - Slide4: Basic REINFORCE
  - Slide5: REINFORCE with baseline
  - Slide6: Actor-Critic
  - Slide7: Clipping demonstration
  - Slide8: KL penalty demonstration
  - Slide_pip: Complete pipeline (some parts commented out)

  4. Key Fixes Applied

  - Environment field access using env.Fehu.Env.reset
  - Action type conversion between int32 and float32
  - Tensor stacking with proper axis specification
  - Log-softmax computed as log of softmax
  - Proper indexing with take_along_axis

  The workshop files now compile cleanly and provide a progressive
  learning experience for RL practitioners using the Fehu framework!
- Implement numerically stable log_softmax function in slide2.ml
- Replace unstable log(softmax(x)) pattern with log_softmax across slides 4-6
- Add documentation note about numerical stability best practices
- Prevent underflow for small probabilities and overflow in exp()

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Replace dummy tensor input with ~device and ~dtype parameters
- Fix module references to use fully qualified Kaun.Ptree
- Update custom module definitions to match new init signature
- Fix Rune.R slice constructor to include proper range arguments
- Update documentation in raven_framework_overview.md

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Implement slide9.ml with Group Relative Policy Optimization (GRPO)
- Add collect_episode_from_state function for trajectory collection
- Document Gym API limitation for state persistence in GRPO
- Add warning about implementation challenges with standard Gym environments
- Include run9.ml test runner and update dune build configuration
- Apply formatting improvements across slides 7-8 for better rendering

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Add batch dimension handling in sample_action for unbatched observations
- Update all training loops to add batch dimension before network forward pass
- Fix action indexing in gradient computation to avoid out-of-bounds errors
- Simplify REINFORCE loss in slide4 as workaround for indexing issues in autodiff
- Update slide9 GRPO implementation with consistent shape handling

Note: Full REINFORCE loss with proper action selection has gradient computation
issues that need further investigation. Using simplified proxy loss for now.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…mputation

Critical fixes that enable actual learning:
1. Replace argmax with proper categorical sampling using cumulative distribution
2. Fix REINFORCE loss to actually connect actions to returns using masks
3. Remove reliance on autodiff-incompatible operations (set_item)

Results: Returns improve from -10 (timeout) to ~9 (reaching goal quickly)

The key issue was using argmax instead of sampling, which prevented exploration.
Secondary issue was the proxy loss that didn't implement actual policy gradient.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Modify train_reinforce to return collected episodes (first and last)
- Create visualize_episode function showing grid paths with ASCII art
- Refactor run4.ml to use slide4's train_reinforce without duplication
- Fix slide_pip.ml to handle new train_reinforce return type

Visualization shows clear learning progress:
- Episode 1: Wandering path, 21 steps
- Episode 100: More direct path, 12 steps

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Create exercise1.md documenting the indexing limitation in autodiff
- Explain why only processing 10 states severely impacts learning
- Provide multiple solution approaches (debugging vs workarounds)
- Add reference comment in slide4.ml pointing to the exercise

The exercise challenges volunteers to fix a real performance issue where
we lose gradients for 90% of collected experience due to autodiff limitations
with dynamic indexing operations.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…en statements

The cascade of includes doesn't work well with OCaml's type system.
Changed all slide modules to use explicit open statements for better
clarity and type checking compatibility.

- Updated slide1-9 and slide_pip to use explicit open statements
- Modified run4.ml to use Workshop.Helpers for visualization
- Updated dune file to include helpers module
- Improves module dependency clarity and compilation

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
This commit enhances the RL workshop with advanced policy optimization techniques:

- Add slide9.ml: Implement REINFORCE++ with GRPO-inspired features (clipping, KL penalty) as a practical alternative to full GRPO
- Add run9.ml: Test runner for the REINFORCE++ implementation
- Add exercise2.md: Guide for extending environment API to support full GRPO with multiple trajectories
- Add exercise3.md: Guide for implementing complete PPO with batch collection and multiple optimization epochs
- Update RL_Introduction-REINFORCE.md: Add new section explaining PPO's key innovation of data reuse through multiple updates per batch
- Update dune files: Include new slide9 and run9 in build configuration
- Fix slide9.ml: Handle autodiff limitations by restricting old_log_probs computation to first 10 states

The workshop now provides a complete learning path from basic REINFORCE through modern algorithms:
REINFORCE → Actor-Critic → PPO-style improvements → GRPO concepts → Full PPO implementation

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Introduces curriculum learning concepts and implementation:
- Slide 10: Introduction to curriculum learning concepts
  - Progressive difficulty stages (corridor → room → multi-box)
  - Automatic advancement based on performance
  - Curriculum state tracking and visualization

- Slide 11: Integrating curriculum with REINFORCE
  - Dynamic environment that changes with agent progress
  - Training loop that adapts to curriculum stage
  - Performance analysis across stages

- Documentation: Added comprehensive curriculum learning section
  - Explains the sparse reward problem in complex tasks
  - Shows progressive skill building approach
  - Includes practical tips and advanced techniques
  - Connects to human learning theories

- Run files (run10, run11) for demonstrating curriculum concepts
- Updated dune configuration to include new slides

The curriculum learning section complements the policy optimization
algorithms by showing how to apply them to complex, real-world tasks
where starting from scratch would be infeasible.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…ric discussion

Workshop improvements:
- Fixed reshape error in run11.exe by ensuring all environments produce 5x5 observations
- Environments now pad smaller grids to 5x5 for consistency with policy network
- Fixed custom flatten layer to handle batch dimensions correctly
- Cleaned up debug output for cleaner execution

New educational content:
- Added comprehensive section on egocentric vs allocentric representations
- Explains tradeoffs between agent-centered and world-centered views
- Discusses impact on generalization and policy complexity
- Added Exercise 4: Implementing egocentric Sokoban
  - Guides students through position-invariant policy learning
  - Shows how to transform observations to be agent-centered
  - Connects to real-world robotics applications

The workshop now covers:
Part 1: Policy Optimization (REINFORCE → PPO → GRPO)
Part 2: Curriculum Learning with Sokoban
Part 3: Representation choices (egocentric vs allocentric)

Ready for future refinements:
- More visualizations of learning progress
- Better feedback showing policy improvement
- Interactive demonstrations

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…ntrol

Adds comprehensive coverage of the paradigm shift in RL from algorithm
engineering to environment engineering, including:

- The shift from domain-specific algorithms to general algorithms with
  clever environments
- Critical importance of environment quality control and solvability checking
- Detailed Sokoban solvability checking with deadlock detection
- Procedural generation with quality constraints
- Adaptive curriculum as environment scheduling
- Guidelines for ensuring curriculum quality at each stage
- Discussion of modern approaches like POET, PAIRED, and PLR
- Practical guidelines for environment validation and diversity

Key insight: The environment IS the curriculum, and careful environment
design can guide general algorithms to master complex domains without
domain-specific modifications.

This section connects curriculum learning to the broader trend of
engineering training environments rather than algorithms, showing how
the field is evolving toward more general, scalable approaches.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
lukstafi and others added 27 commits September 13, 2025 16:18
- Add borders to table cells
- Add padding for better spacing
- Add alternating row colors for easier scanning
- Add header background color

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
- Replace complex take_along_axis with simple one-hot approach
- Consistent with slides 4, 9, and 11
- Stays on device, enables JIT compilation
- Much clearer code

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
- Create slide12 for comparing REINFORCE variants
- Implement ASCII plots for returns and losses
- Support N-way comparisons (extensible design)
- Show REINFORCE with and without baseline
- Include moving average smoothing for clearer trends
- Demonstrate baseline reduces variance

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
- Merge slide12 into standalone plots executable
- Generate ASCII plots for terminal output
- Generate SVG plots for file output
- Compare REINFORCE with/without baseline
- Extensible design for N-way comparisons
- Moving average smoothing for clearer trends

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
- Include SVG plots showing REINFORCE with vs without baseline
- Plots demonstrate variance reduction from baseline
- Visual evidence of key RL concepts

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
- Add batch dimensions to states for value network
- Use one-hot encoding instead of take_along_axis for consistency
- Handle state/action/reward array length mismatch properly
- Squeeze value predictions to correct shape
- Fix n_steps calculation for proper averaging

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
- Add training_history type to slide4.ml
- Update slides 4, 5, 6 to return training history
- Refactor plots.ml to use slide functions directly
- Update all run files and slide_pip for new return types
- Stress-test slide code while avoiding duplication

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
- Fix index out of bounds in slide5.ml by iterating only over available advantages
- Fix similar issue in slide6.ml Actor-Critic implementation
- Add clarification that Actor-Critic shows value loss vs policy loss in plots
- Update SVG legend to indicate "Actor-Critic (value loss)" on loss plots
- Replace old SVG files with versioned ones from tutorial updates

The key bug was that episodes can terminate early, leaving fewer returns/advantages
than states, causing array access errors when iterating over all states.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Add REINFORCE++ (baseline + clipping + KL penalty) to plots.ml comparison
- Fix REINFORCE++ in slide9.ml by adding baseline for variance reduction
- Update slide9 to return training history like other algorithms
- Add support for 4 algorithms in plots.ml with distinct colors/symbols
- Update analysis to explain REINFORCE++ adds stability through clipping/KL
- Add v3 SVG plots showing all four REINFORCE variants
- Update tutorial documentation to reference REINFORCE++ instability

Key insight: REINFORCE++ shows that more complexity doesn't always mean
better performance - the interaction between baseline, clipping, and KL
penalty can be tricky to tune. This is an important educational outcome
demonstrating that simpler methods often work better in practice.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Create comprehensive exercise on upgrading from FC to CNN architectures
- Cover conv2d layers, kernel sizes, padding, and shape management
- Include advanced techniques like residual connections and spatial attention
- Position exercise in Sokoban section where spatial complexity motivates CNNs
- Add reference to CNN benefits in Actor-Critic section

The exercise teaches students that CNNs preserve spatial structure, enabling
the network to learn patterns like "box next to wall" that work anywhere
on the grid, rather than memorizing position-specific actions.

Key learning objectives:
- Understand why spatial structure matters for grid-based tasks
- Implement CNN architectures for both policy and value networks
- Handle shape transformations (adding channel dimensions)
- Compare performance between FC and CNN approaches

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Add grid_size parameter to all training functions (default 5 for compatibility)
- Update initialize_policy and sample_action to handle dynamic grid sizes
- Add -grid CLI option to plots.ml for flexible environment sizes
- Fix all main() and run*.ml files to use correct function signatures
- Test confirmed working at 5x5, 7x7, and 9x9 grid sizes
- MLPs still effective at 9x9, though learning is slower as expected

This change enables more flexible experimentation and sets foundation
for CNN-based policies (exercise 4) which benefit from larger grids.

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
- Add -env CLI option to select between gridworld and curriculum environments
- Support both simple gridworld (configurable size) and curriculum learning
- Automatically use 5x5 grid for curriculum environment (hardcoded in slide11)
- Update help text with environment examples
- Test confirmed working with both environment types

This enables comparison of algorithm performance across different
environment types and complexities.

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
- Port Sokoban environment from dev/fehu/envs/sokoban.ml
- Create workshop-friendly curriculum with progressive difficulty:
  * Corridor 3 → Corridor 5 → Room 5 → Room 7 → MultiBox 2 → Complex
- Automatic stage advancement based on 70% success rate over 50 episodes
- Integrate with plots.ml for algorithm comparison
- Pad observations to 9x9 for consistent network input size
- Test results show successful curriculum progression:
  * Agents master simple corridors quickly (100% success)
  * Struggle begins with room environments (as expected)
  * Different algorithms show varying curriculum progression rates

This provides a more realistic and challenging testbed compared to
the simple gridworld, demonstrating the importance of curriculum
learning for complex tasks.

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
…slop; copy to workshop/ the curriculum and non-ML agent we developed several days ago.
- Add verified-curriculum environment option for sophisticated Sokoban puzzles
- Support progressive difficulty with curriculum learning stages
- Add -max-steps parameter for configuring episode length
- Remove unused basic sokoban function from verified.ml
- Clean up redundant sokoban_curriculum parameter

The verified-curriculum environment provides:
- Corridor puzzles (length 3 and 5)
- Room puzzles (5x5 and 7x7)
- Multi-box puzzles (2-3 boxes)
- Complex predefined puzzles
- Automatic stage advancement based on 80% success rate

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Adapt backoff_tabular.ml to match the training interface of other algorithms
- Add training_history type with returns and TD errors tracking
- Update train_backoff function signature to match slide4/slide5/slide6 pattern
- Add backoff-tabular option to plots.ml command-line arguments
- Integrate backoff-tabular training into the main algorithm comparison loop
- Track TD errors as "losses" for consistent visualization
- Update analysis descriptions to explain backoff-tabular's hierarchical abstraction
- Fix mod operator issue in slide4.ml (% -> mod)

The backoff-tabular algorithm uses hierarchical state abstraction with
multiple window sizes (3x3, 5x5, 7x7, full) to efficiently learn Q-values.
It performs exceptionally well on Sokoban tasks, achieving near-perfect
returns on the verified-curriculum environment.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Extended training_history type to include collected_episodes field
- All algorithms now use shared episode_data type from slide3.ml
- Collect episodes at regular intervals (every n_episodes/10 and final)
- Updated backoff_tabular to match shared interface with neural algorithms
- Ensures consistent data collection across REINFORCE variants and Q-learning

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
- Created visualizations.ml module with SVG generation for episodes
- Generates SVG visualizations for gridworld and Sokoban states
- Creates interactive HTML replay pages with playback controls
- Integrated visualization into plots.ml training pipeline
- Organized outputs in episodes/<env>/<algo>/<episode>/ structure
- Fixed type sharing: all algorithms use shared training_history type
- Added episodes/ to .gitignore (many generated files)

Features:
- Step-by-step episode replay with play/pause/speed controls
- Keyboard navigation (space, arrows, r for reset)
- Different rendering for gridworld vs Sokoban environments
- Collects episodes at regular intervals during training

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
@tmattio
Copy link
Contributor

tmattio commented Sep 19, 2025

Hi @lukstafi! Should I select the relevant updates from the PR now or did you want to send a separate PR? I don't mind eitherway, let me know :)

@lukstafi
Copy link
Author

The workshop files are independent, they just need fehu/ installed. They build the Sokoban env themselves, not relying on the envs/ files from this PR. The following is the final branch of the workshop progression: https://github.com/lukstafi/raven-sokoban/tree/feat/sokoban-exercise4-solution

So the question is if you want to refactor examples to use envs/ in general, otherwise I prefer to copy the workshop files over to the raven-ml/funocaml-2025-rl, reproducing the progression, and we can discard this PR. I'll let my fork for this PR live, archived, to keep links working.

@lukstafi lukstafi closed this Sep 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants