A neural network learns to swing up and balance two chaotic pendulums
Both pendulums start hanging downward. The agent controls only the cart — pushing it left and right to pump energy into the system. Within about 1.5 seconds it swings both links to vertical, recenters the cart, and holds them there for the remaining 8.5 seconds of each episode. You can tap the left and right arrow keys to shove the cart and watch the policy recover.
The base policy was trained with SAC (Soft Actor-Critic) using curriculum learning. The shipped weights are then refined with a closed-loop distillation pass that teaches the same 256×256 actor to recenter the cart after swing-up without any runtime stabilization hack.