Stick Balancing Reinforcement Learning

Mode: Training

Episode: 0

Learning Algorithm

The reinforcement learning agent uses a Policy Gradient method to directly learn the optimal policy for balancing the stick. Unlike value-based methods (like DQN), policy gradients learn by directly adjusting the probabilities of taking actions in each state. The network outputs action probabilities and samples actions from this distribution during training, providing natural exploration without needing an explicit exploration parameter.

Metrics Explanation

Episode Reward: Represents the cumulative reward obtained in a single episode. A higher reward indicates better performance in balancing the stick. The reward structure encourages keeping the pole upright (+1 for each step, scaled by pole angle) while heavily penalizing failures (-10).

Learning Process

The agent learns by:
1. Sampling actions based on their predicted probabilities
2. Collecting entire episodes of experience
3. Computing returns (discounted sum of rewards) for each action
4. Adjusting action probabilities: increasing probabilities of actions that led to good returns, and decreasing probabilities of actions that led to poor returns

This process is similar to learning to ride a bike - try actions with certain probabilities, do successful ones more often, and unsuccessful ones less often, until converging to good actions.

Policy Gradient Intuition

Policy Gradient methods, like the one used here, focus on optimizing the policy directly. The key idea is to adjust the action probabilities in a way that actions leading to higher returns become more likely. This is achieved by calculating the gradient of the expected return with respect to the policy parameters and updating the policy in the direction that increases the expected return. The use of a discount factor ensures that the agent considers both immediate and future rewards, balancing short-term gains with long-term success.

Advantage Normalization

Advantage Normalization is used to stabilize the training process by reducing variance in the policy gradient updates. Instead of using raw returns, the agent calculates advantages by subtracting a baseline (the mean return) from each return. This helps the agent focus on actions that are better than average, leading to more reliable learning. By normalizing the advantages, the agent can more effectively learn which actions are truly advantageous, preventing the model from collapsing into suboptimal policies.