This AI uses PPO, a deep reinforcement learning algorithm that learns through trial and error. PPO is more stable than traditional policy gradient methods because it limits how much the AI's policy can change in each training step, preventing catastrophic forgetting.
The AI learns by playing against itself. Both the top and bottom paddles are controlled by the same neural network, but they learn from different perspectives. This self-play approach allows the AI to discover and adapt to increasingly sophisticated strategies.
The AI learns in three stages:
The AI receives rewards for:
It receives penalties for:
The toggle button switches between:
The AI typically needs about 20,000 training steps to develop basic gameplay skills. During this time, it progresses from random movements to purposeful hits, and eventually to strategic gameplay.