Air Hockey RL Agent

Training Methodology

This agent was trained using Proximal Policy Optimization (PPO) with self-play over 10M timesteps. The key innovation: removing opponent observations to force puck engagement rather than defensive positioning.

Observation Space (8 features)

Own paddle: position (x, y), velocity (dx, dy)
Puck: position (x, y), velocity (dx, dy)
No opponent tracking - prevents defensive Nash equilibrium

Results

54% win rate vs random opponent
0.54 goals/game (2.25x baseline improvement)
Trained with 10 parallel environments, 20-opponent fictitious self-play pool

Code

Training Script | Environment | Web Inference