Policy Gradient Reinforcement Learning and Sim-to-Real Transfer for Robotic Control

This repository contains the codebase developed for the course project in Fundamentals of Artificial Intelligence, Machine and Deep Learning (FAIML) at Politecnico di Torino.

The project explores reinforcement learning algorithms for continuous robotic control and the challenges of sim-to-real transfer within a controlled sim-to-sim framework. It is divided into two main parts:

From-scratch implementations of classical policy-gradient and Actor-Critic methods on the Gymnasium MuJoCo Hopper-v4 environment.
Transfer learning and domain randomization experiments using Stable-Baselines3 on the PandaPush-v3 environment, focusing on policy robustness under dynamics mismatch (cube mass variation).

Hopper Environment	PandaPush Environment

Repository Structure

.
├── p1-policy-gradient-methods/
│   ├── agent.py                         # REINFORCE and Actor-Critic agent implementations
│   ├── train.py                         # Training script for Hopper experiments
│   ├── evaluate.py                      # Evaluation script for trained Hopper policies
│   ├── inspect_hopper.py                # Hopper environment inspection utilities
│   ├── test_random_policy.py            # Random policy test on Hopper
│   ├── render_policy.py                 # Rendering script for trained Hopper policies
│   ├── plot_results.py                  # Plot generation from Part 1 results
│   ├── plot_results.ipynb               # Notebook version of the plotting workflow
│   ├── results/                         # Training logs, evaluation outputs, and saved results
│   ├── plots/                           # Generated plots for Part 1
│   ├── frames/                          # Rendered frames
│   └── screenshots/                     # Environment screenshots
│
├── p2-advanced-rl-and-transfer/
│   ├── train_ppo_sb3.py                 # PPO training pipeline with Stable-Baselines3
│   ├── train_sac_sb3.py                 # SAC training pipeline with Stable-Baselines3
│   ├── eval_ppo_sb3.py                  # PPO evaluation script
│   ├── eval_sac_sb3.py                  # SAC evaluation script
│   ├── rand_wrapper.py                  # Source/target, UDR, and ADR randomization wrapper
│   ├── test_random_policy_p2.py         # Random policy test on PandaPush
│   ├── render_policy.py                 # Rendering script for trained PandaPush policies
│   ├── plot_eval_curves.py              # Plot generation from evaluation curves
│   ├── plot_tensorboard_curves.py       # Plot generation from TensorBoard logs
│   ├── logs/                            # Evaluation and experiment logs
│   ├── models/                          # Saved PPO/SAC models
│   ├── tensorboard_logs/                # TensorBoard training logs
│   ├── render_outputs/                  # Rendered PandaPush outputs
│   ├── report_figures/                  # Figures used in the final report
│   └── panda-gym/                       # Local PandaGym-related files
│
├── assets/                              # README assets and media
├── logs/                                # Additional project-level logs
├── Project report.pdf                   # Final project report
├── requirements.txt                     # Python dependencies
├── LICENSE
└── README.md

Implemented Methods

REINFORCE: Vanilla policy gradient method using full Monte Carlo trajectory rollouts.
REINFORCE with Constant Baseline: Variance reduction via an action-independent baseline.
Actor-Critic (One-Step): Bootstrapped temporal-difference updates to reduce variance.
n-Step Actor-Critic: A multi-step variant balancing the bias-variance trade-off in credit assignment.
Proximal Policy Optimization (PPO): On-policy gradient method using a clipped surrogate objective.
Soft Actor-Critic (SAC): Off-policy maximum-entropy actor-critic algorithm.
Uniform Domain Randomization (UDR): Domain randomization sampling simulator parameters from a fixed uniform range.
Adaptive/Automatic Domain Randomization (ADR): Dynamic range adjustment based on agent performance thresholds.

Tasks and Environments

1. Hopper Locomotion (`Hopper-v4`)

Used to evaluate custom-built policy-gradient and Actor-Critic architectures. The agent must learn to coordinate a single-legged robot to hop forward while maintaining balance.

2. Panda Push (`PandaPush-v3`)

Used to study advanced policy optimization (PPO vs. SAC) and sim-to-sim transfer robustness. The agent controls a Franka Panda robotic arm tasked with pushing a cube to a target position.

Source Environment (Simulation): Cube mass is set to $1.0\text{ kg}$.
Target Environment ("Real" test): Cube mass is increased to $5.0\text{ kg}$, introducing a controlled physical discrepancy to evaluate transfer degradation and the stabilizing effects of UDR and ADR.

Installation

To set up the repository and install the necessary dependencies, execute the following commands:

# Clone the repository
git clone https://github.com/matteosgobba/rl-sim2real-hopper.git
cd rl-sim2real-hopper

# Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows use: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Note: MuJoCo installation is required for the Hopper environment. Follow the instructions on the Gymnasium website if you encounter issues.

Usage

Part 1: Hopper (Custom Implementations)

To train and evaluate a policy-gradient or Actor-Critic agent on the Hopper environment, use the provided script with appropriate arguments:

python hopper/train.py --algo reinforce --baseline 20 --episodes 1000 --seed 0
python hopper/train.py --algo actor_critic --ac-variant n_step --n-steps 10 --episodes 1000 --seed 0

Part 2: PandaPush (Transfer and Domain Randomization)

To run baseline trainings, source-to-target transfer, or domain randomization configurations:

# Train baseline PPO on source environment

python train_ppo_sb3.py \
--env-type source \
--reward-type dense \
--sampling-strategy none \
--timesteps 1000000 \
--seed 0 \
--learning-rate 0.004 \
--ent-coef 0.01

# Train SAC with Uniform Domain Randomization
python train_sac_sb3.py \
--env-type source \
--reward-type dense \
--sampling-strategy udr \
--timesteps 1000000 \
--seed 0 \
--learning-rate 0.0004 \
--ent-coef 0.01 \
--mass-min 1.0 \
--mass-max 6.0

# Train SAC with Adaptive Domain Randomization
python train_sac_sb3.py \
--env-type source \
--reward-type dense \
--sampling-strategy adr \
--timesteps 1000000 \
--seed 0 \
--learning-rate 0.0004 \
--ent-coef 0.01 \
--initial-mass-min 1.0 \
--initial-mass-max 1.5 \
--mass-limit-min 1.0 \
--mass-limit-max 10.0 \
--adr-step 0.25 \
--boundary-prob 0.5

Results and Report

The experimental configurations, hyperparameter tuning logs, training curves, and an in-depth comparative analysis of the algorithms are detailed in the project's final report. Evaluation metrics such as success rates, mean returns, and standard deviations across multiple seeds can be visualized using TensorBoard:

tensorboard --logdir runs/

Credits

This project was developed by:

Matteo Sgobba (matteosgobba)
Federico Di Leo (FedericoDiLeo02)
Martina Mancini (marti030)
Paolo Campanini (CampaniniP)

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Policy Gradient Reinforcement Learning and Sim-to-Real Transfer for Robotic Control

Repository Structure

Implemented Methods

Tasks and Environments

1. Hopper Locomotion (`Hopper-v4`)

2. Panda Push (`PandaPush-v3`)

Installation

Usage

Part 1: Hopper (Custom Implementations)

Part 2: PandaPush (Transfer and Domain Randomization)

Results and Report

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
assets		assets
p1-policy-gradient-methods		p1-policy-gradient-methods
p2-advanced-rl-and-transfer		p2-advanced-rl-and-transfer
.gitignore		.gitignore
LICENSE		LICENSE
Project report.pdf		Project report.pdf
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Policy Gradient Reinforcement Learning and Sim-to-Real Transfer for Robotic Control

Repository Structure

Implemented Methods

Tasks and Environments

1. Hopper Locomotion (Hopper-v4)

2. Panda Push (PandaPush-v3)

Installation

Usage

Part 1: Hopper (Custom Implementations)

Part 2: PandaPush (Transfer and Domain Randomization)

Results and Report

Credits

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Hopper Locomotion (`Hopper-v4`)

2. Panda Push (`PandaPush-v3`)

Packages