Neural Tic-Tac-Toe Agent

Expert-level PPO with action masking, deployed via TensorFlow.js

System Architecture

This implementation deploys a Proximal Policy Optimization (PPO) neural network for expert-level Tic-Tac-Toe play directly in the browser using TensorFlow.js. The agent was trained offline using PyTorch and Stable-Baselines3, then converted for browser deployment.

Board State Representation

The game state is represented as a 9-element array where each position contains:

1 — Agent's piece (X)
-1 — Opponent's piece (O)
0 — Empty cell

This compact representation is fed directly to the neural network, which has learned to extract meaningful patterns from the raw board values during training.

Action Masking Implementation

Invalid moves are prevented through action masking—a critical technique for board games. The network outputs logits for all 9 positions, but before action selection:

Valid moves receive a mask value of 0
Invalid moves receive -Infinity
The mask is added to logits before argmax selection

This ensures the model never selects occupied squares, eliminating wasted training capacity on learning game rules.

Training Curriculum

During offline training, the agent learned from a diverse opponent pool:

70%

Perfect play (minimax)

Learned optimal defensive patterns

30%

Random opponents

Learned to exploit weak play

Neural Network Architecture

The PPO policy network is a compact 3-layer MLP with dimensions 9 → 128 → 128 → 9:

Input layer: 9D board state (raw values)
Hidden layers: Two layers with ReLU activation, 128 units each
Output layer: 9D action logits (one per board position)

The network processes the raw board state and outputs a probability distribution over possible moves, with action masking ensuring only legal moves are selected.

Reward Engineering

The reward structure shapes optimal behavior against perfect opponents:

+1.0 for wins — Maximum reward for victory
+0.5 for draws — Positive signal since draws are optimal vs perfect play
-1.0 for losses — Penalty for defeat
-10.0 for invalid moves — Strong penalty to discourage rule violations

Web Deployment

The trained PyTorch model is exported to TensorFlow.js format for browser inference:

PyTorch → ONNX conversion preserves architecture
ONNX → TensorFlow.js enables browser deployment
Model loads asynchronously at page initialization
Inference runs client-side with ~10ms latency

Key Implementation Details

Action Selection (agent.js:26-44): The PPO model's inference pipeline:

Network outputs raw logits for all 9 positions
Invalid moves masked with -Infinity addition
Argmax over masked logits guarantees legal moves
Fully deterministic (greedy) for deployed model

Game State Management (game.js): Clean board representation and game logic:

9-element array tracks board state
Efficient win checking across rows, columns, and diagonals
Valid move detection for action masking

Performance Metrics:

100%

Never loses to perfect play

Achieved during training evaluation

~10ms

Browser inference time

Real-time response on CPU

Source Code

train.py • tictactoe_env.py • perfect_player.py • export.py