Status: Loading pre-trained model...
Technical Details
This is a character-level GPT (Generative Pre-trained Transformer) implementation in TensorFlow.js, trained on Shakespeare's works.
Model Architecture
- Type: Decoder-only Transformer (like GPT)
- Parameters: ~421K (optimized for fast browser inference)
- Embedding dimension: 128
- Attention heads: 4
- Transformer layers: 3
- Sequence length: 64 characters
- Vocabulary size: 63 (unique characters in Shakespeare)
Key Components
- Character Embeddings: Converts characters to 128-dimensional vectors
- Positional Embeddings: Learned position encodings for sequence order
- Multi-Head Self-Attention: Custom implementation with causal masking to prevent looking ahead
- Feed-Forward Network: 2-layer MLP with 4x expansion and ReLU activation
- Layer Normalization: Applied before each sub-layer (pre-norm architecture)
- Residual Connections: Around both attention and FFN blocks
- Dropout: 10% dropout rate for regularization
Training Details
- Optimizer: Adam with adaptive learning rate and transformer-specific betas
- Loss: Cross-entropy loss on next-character prediction
- Batch size: 16-32 sequences
- Training data: ~1M characters of Shakespeare text
- Pre-trained epochs: 150+ with progressive curriculum learning
- Best achieved loss: ~1.35 (significant improvement from initial 3.4+)
- Incremental training: 100 epochs per button click
Implementation Notes
This implementation demonstrates several key concepts:
- Custom TensorFlow.js layers for multi-head attention (since TF.js lacks built-in transformer layers)
- Causal masking to ensure autoregressive generation
- Efficient tensor operations using tf.tidy() for memory management
- Browser-based GPU acceleration via WebGL backend
Current Performance
The improved model has achieved:
- Training loss: ~1.35 (down from initial ~3.4)
- Text quality: Generates recognizable words and basic grammatical structure
- Vocabulary usage: Shows proper word boundaries and common English patterns
- Sample outputs demonstrate emerging coherence with words like "the", "to", "be", "in", "what"
- Model continues to improve with each training session
Note: This is a minimal educational implementation. Production models like GPT-3 have billions of parameters and train on much larger datasets.