minGPT TensorFlow.js

Type: Decoder-only Transformer (like GPT)
Parameters: ~421K (optimized for fast browser inference)
Embedding dimension: 128
Attention heads: 4
Transformer layers: 3
Sequence length: 64 characters
Vocabulary size: 63 (unique characters in Shakespeare)

Status: Loading pre-trained model...

Technical Details

This is a character-level GPT (Generative Pre-trained Transformer) implementation in TensorFlow.js, trained on Shakespeare's works.

Character Embeddings: Converts characters to 128-dimensional vectors
Positional Embeddings: Learned position encodings for sequence order
Multi-Head Self-Attention: Custom implementation with causal masking to prevent looking ahead
Feed-Forward Network: 2-layer MLP with 4x expansion and ReLU activation
Layer Normalization: Applied before each sub-layer (pre-norm architecture)
Residual Connections: Around both attention and FFN blocks
Dropout: 10% dropout rate for regularization

This implementation demonstrates several key concepts:

Custom TensorFlow.js layers for multi-head attention (since TF.js lacks built-in transformer layers)
Causal masking to ensure autoregressive generation
Efficient tensor operations using tf.tidy() for memory management
Browser-based GPU acceleration via WebGL backend

The improved model has achieved:

Training loss: ~1.35 (down from initial ~3.4)
Text quality: Generates recognizable words and basic grammatical structure
Vocabulary usage: Shows proper word boundaries and common English patterns
Sample outputs demonstrate emerging coherence with words like "the", "to", "be", "in", "what"
Model continues to improve with each training session

Note: This is a minimal educational implementation. Production models like GPT-3 have billions of parameters and train on much larger datasets.