Trained with Proximal Policy Optimization
This agent learned to play Snake through 37.1 million timesteps of reinforcement learning on a 20×20 grid. The model achieves a best score of 71 cells (17.8% grid fill) and an average of 32 cells across continuous play.
Training was conducted directly on the target grid size without curriculum learning, using 8 parallel environments and PPO with entropy regularization to encourage exploration.
The agent demonstrates learned strategies including food-seeking behavior, obstacle avoidance, and basic space management. Performance continues to improve with extended training, approaching the theoretical ceiling for feature-based representations.