Neural Tic-Tac-Toe Agent

Expert-level PPO with action masking, deployed via TensorFlow.js

System Architecture

This implementation deploys a Proximal Policy Optimization (PPO) neural network for expert-level Tic-Tac-Toe play directly in the browser using TensorFlow.js. The agent was trained offline using PyTorch and Stable-Baselines3, then converted for browser deployment.

Board State Representation

The game state is represented as a 9-element array where each position contains:

This compact representation is fed directly to the neural network, which has learned to extract meaningful patterns from the raw board values during training.

Action Masking Implementation

Invalid moves are prevented through action masking—a critical technique for board games. The network outputs logits for all 9 positions, but before action selection:

This ensures the model never selects occupied squares, eliminating wasted training capacity on learning game rules.

Training Curriculum

During offline training, the agent learned from a diverse opponent pool:

70%
Perfect play (minimax)
Learned optimal defensive patterns
30%
Random opponents
Learned to exploit weak play

Neural Network Architecture

The PPO policy network is a compact 3-layer MLP with dimensions 9 → 128 → 128 → 9:

The network processes the raw board state and outputs a probability distribution over possible moves, with action masking ensuring only legal moves are selected.

Reward Engineering

The reward structure shapes optimal behavior against perfect opponents:

Web Deployment

The trained PyTorch model is exported to TensorFlow.js format for browser inference:

  1. PyTorch → ONNX conversion preserves architecture
  2. ONNX → TensorFlow.js enables browser deployment
  3. Model loads asynchronously at page initialization
  4. Inference runs client-side with ~10ms latency

Key Implementation Details

Action Selection (agent.js:26-44): The PPO model's inference pipeline:

Game State Management (game.js): Clean board representation and game logic:

Performance Metrics:

100%
Never loses to perfect play
Achieved during training evaluation
~10ms
Browser inference time
Real-time response on CPU

Source Code