Air Hockey RL Agent

Self-play reinforcement learning with puck-focused observations

Training Methodology

This agent was trained using Proximal Policy Optimization (PPO) with self-play over 10M timesteps. The key innovation: removing opponent observations to force puck engagement rather than defensive positioning.

Observation Space (8 features)

Results

Code

Training Script | Environment | Web Inference