Neural Tic-Tac-Toe Agent
Expert-level PPO with action masking, deployed via TensorFlow.js
System Architecture
This implementation deploys a Proximal Policy Optimization (PPO) neural network for expert-level Tic-Tac-Toe play directly in the browser using TensorFlow.js. The agent was trained offline using PyTorch and Stable-Baselines3, then converted for browser deployment.
Board State Representation
The game state is represented as a 9-element array where each position contains:
1— Agent's piece (X)-1— Opponent's piece (O)0— Empty cell
This compact representation is fed directly to the neural network, which has learned to extract meaningful patterns from the raw board values during training.
Action Masking Implementation
Invalid moves are prevented through action masking—a critical technique for board games. The network outputs logits for all 9 positions, but before action selection:
- Valid moves receive a mask value of
0 - Invalid moves receive
-Infinity - The mask is added to logits before
argmaxselection
This ensures the model never selects occupied squares, eliminating wasted training capacity on learning game rules.
Training Curriculum
During offline training, the agent learned from a diverse opponent pool:
Neural Network Architecture
The PPO policy network is a compact 3-layer MLP with dimensions 9 → 128 → 128 → 9:
- Input layer: 9D board state (raw values)
- Hidden layers: Two layers with ReLU activation, 128 units each
- Output layer: 9D action logits (one per board position)
The network processes the raw board state and outputs a probability distribution over possible moves, with action masking ensuring only legal moves are selected.
Reward Engineering
The reward structure shapes optimal behavior against perfect opponents:
+1.0for wins — Maximum reward for victory+0.5for draws — Positive signal since draws are optimal vs perfect play-1.0for losses — Penalty for defeat-10.0for invalid moves — Strong penalty to discourage rule violations
Web Deployment
The trained PyTorch model is exported to TensorFlow.js format for browser inference:
- PyTorch → ONNX conversion preserves architecture
- ONNX → TensorFlow.js enables browser deployment
- Model loads asynchronously at page initialization
- Inference runs client-side with ~10ms latency
Key Implementation Details
Action Selection (agent.js:26-44): The PPO model's inference pipeline:
- Network outputs raw logits for all 9 positions
- Invalid moves masked with
-Infinityaddition - Argmax over masked logits guarantees legal moves
- Fully deterministic (greedy) for deployed model
Game State Management (game.js): Clean board representation and game logic:
- 9-element array tracks board state
- Efficient win checking across rows, columns, and diagonals
- Valid move detection for action masking
Performance Metrics: