This repository contains the codebase developed for the course project in Fundamentals of Artificial Intelligence, Machine and Deep Learning (FAIML) at Politecnico di Torino.
The project explores reinforcement learning algorithms for continuous robotic control and the challenges of sim-to-real transfer within a controlled sim-to-sim framework. It is divided into two main parts:
- From-scratch implementations of classical policy-gradient and Actor-Critic methods on the Gymnasium MuJoCo
Hopper-v4environment. - Transfer learning and domain randomization experiments using Stable-Baselines3 on the
PandaPush-v3environment, focusing on policy robustness under dynamics mismatch (cube mass variation).
| Hopper Environment | PandaPush Environment |
|---|---|
![]() |
![]() |
.
├── p1-policy-gradient-methods/
│ ├── agent.py # REINFORCE and Actor-Critic agent implementations
│ ├── train.py # Training script for Hopper experiments
│ ├── evaluate.py # Evaluation script for trained Hopper policies
│ ├── inspect_hopper.py # Hopper environment inspection utilities
│ ├── test_random_policy.py # Random policy test on Hopper
│ ├── render_policy.py # Rendering script for trained Hopper policies
│ ├── plot_results.py # Plot generation from Part 1 results
│ ├── plot_results.ipynb # Notebook version of the plotting workflow
│ ├── results/ # Training logs, evaluation outputs, and saved results
│ ├── plots/ # Generated plots for Part 1
│ ├── frames/ # Rendered frames
│ └── screenshots/ # Environment screenshots
│
├── p2-advanced-rl-and-transfer/
│ ├── train_ppo_sb3.py # PPO training pipeline with Stable-Baselines3
│ ├── train_sac_sb3.py # SAC training pipeline with Stable-Baselines3
│ ├── eval_ppo_sb3.py # PPO evaluation script
│ ├── eval_sac_sb3.py # SAC evaluation script
│ ├── rand_wrapper.py # Source/target, UDR, and ADR randomization wrapper
│ ├── test_random_policy_p2.py # Random policy test on PandaPush
│ ├── render_policy.py # Rendering script for trained PandaPush policies
│ ├── plot_eval_curves.py # Plot generation from evaluation curves
│ ├── plot_tensorboard_curves.py # Plot generation from TensorBoard logs
│ ├── logs/ # Evaluation and experiment logs
│ ├── models/ # Saved PPO/SAC models
│ ├── tensorboard_logs/ # TensorBoard training logs
│ ├── render_outputs/ # Rendered PandaPush outputs
│ ├── report_figures/ # Figures used in the final report
│ └── panda-gym/ # Local PandaGym-related files
│
├── assets/ # README assets and media
├── logs/ # Additional project-level logs
├── Project report.pdf # Final project report
├── requirements.txt # Python dependencies
├── LICENSE
└── README.md
- REINFORCE: Vanilla policy gradient method using full Monte Carlo trajectory rollouts.
- REINFORCE with Constant Baseline: Variance reduction via an action-independent baseline.
- Actor-Critic (One-Step): Bootstrapped temporal-difference updates to reduce variance.
- n-Step Actor-Critic: A multi-step variant balancing the bias-variance trade-off in credit assignment.
- Proximal Policy Optimization (PPO): On-policy gradient method using a clipped surrogate objective.
- Soft Actor-Critic (SAC): Off-policy maximum-entropy actor-critic algorithm.
- Uniform Domain Randomization (UDR): Domain randomization sampling simulator parameters from a fixed uniform range.
- Adaptive/Automatic Domain Randomization (ADR): Dynamic range adjustment based on agent performance thresholds.
Used to evaluate custom-built policy-gradient and Actor-Critic architectures. The agent must learn to coordinate a single-legged robot to hop forward while maintaining balance.
Used to study advanced policy optimization (PPO vs. SAC) and sim-to-sim transfer robustness. The agent controls a Franka Panda robotic arm tasked with pushing a cube to a target position.
-
Source Environment (Simulation): Cube mass is set to
$1.0\text{ kg}$ . -
Target Environment ("Real" test): Cube mass is increased to
$5.0\text{ kg}$ , introducing a controlled physical discrepancy to evaluate transfer degradation and the stabilizing effects of UDR and ADR.
To set up the repository and install the necessary dependencies, execute the following commands:
# Clone the repository
git clone https://github.com/matteosgobba/rl-sim2real-hopper.git
cd rl-sim2real-hopper
# Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows use: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtNote: MuJoCo installation is required for the Hopper environment. Follow the instructions on the Gymnasium website if you encounter issues.
To train and evaluate a policy-gradient or Actor-Critic agent on the Hopper environment, use the provided script with appropriate arguments:
python hopper/train.py --algo reinforce --baseline 20 --episodes 1000 --seed 0
python hopper/train.py --algo actor_critic --ac-variant n_step --n-steps 10 --episodes 1000 --seed 0To run baseline trainings, source-to-target transfer, or domain randomization configurations:
# Train baseline PPO on source environment
python train_ppo_sb3.py \
--env-type source \
--reward-type dense \
--sampling-strategy none \
--timesteps 1000000 \
--seed 0 \
--learning-rate 0.004 \
--ent-coef 0.01
# Train SAC with Uniform Domain Randomization
python train_sac_sb3.py \
--env-type source \
--reward-type dense \
--sampling-strategy udr \
--timesteps 1000000 \
--seed 0 \
--learning-rate 0.0004 \
--ent-coef 0.01 \
--mass-min 1.0 \
--mass-max 6.0
# Train SAC with Adaptive Domain Randomization
python train_sac_sb3.py \
--env-type source \
--reward-type dense \
--sampling-strategy adr \
--timesteps 1000000 \
--seed 0 \
--learning-rate 0.0004 \
--ent-coef 0.01 \
--initial-mass-min 1.0 \
--initial-mass-max 1.5 \
--mass-limit-min 1.0 \
--mass-limit-max 10.0 \
--adr-step 0.25 \
--boundary-prob 0.5The experimental configurations, hyperparameter tuning logs, training curves, and an in-depth comparative analysis of the algorithms are detailed in the project's final report. Evaluation metrics such as success rates, mean returns, and standard deviations across multiple seeds can be visualized using TensorBoard:
tensorboard --logdir runs/This project was developed by:
- Matteo Sgobba (matteosgobba)
- Federico Di Leo (FedericoDiLeo02)
- Martina Mancini (marti030)
- Paolo Campanini (CampaniniP)
This project is licensed under the MIT License.

