Air Hockey RL Agent
Self-play reinforcement learning with puck-focused observations
Training Methodology
This agent was trained using Proximal Policy Optimization (PPO) with self-play over 10M timesteps. The key innovation: removing opponent observations to force puck engagement rather than defensive positioning.
Observation Space (8 features)
- Own paddle: position (x, y), velocity (dx, dy)
- Puck: position (x, y), velocity (dx, dy)
- No opponent tracking - prevents defensive Nash equilibrium
Results
- 54% win rate vs random opponent
- 0.54 goals/game (2.25x baseline improvement)
- Trained with 10 parallel environments, 20-opponent fictitious self-play pool