Training¶
How to train MicroGPT on any text dataset.
Basic Training¶
from strands_microgpt import MicroGPT
model, tokenizer, docs = MicroGPT.from_dataset()
losses = model.train_on_docs(docs, tokenizer, num_steps=1000)
print(f"Final loss: {losses[-1]:.4f}")
What Happens During Training¶
Each step:
- Pick a document — cycle through dataset
- Tokenize — BOS + chars + BOS
- Forward pass — compute logits for each position
- Loss — cross-entropy between predicted and actual next character
- Backward — autograd computes all gradients
- Adam update — update weights with adaptive learning rate
graph LR
D["📄 Document"] --> T["🔤 Tokenize"]
T --> F["🔄 Forward"]
F --> L["📉 Loss"]
L --> B["⬅️ Backward"]
B --> U["🔧 Adam Update"]
U --> F
Training Parameters¶
losses = model.train_on_docs(
docs,
tokenizer,
num_steps=1000, # More steps = lower loss
learning_rate=0.01, # Linear decay to 0
log_every=100, # Print loss frequency
callback=my_fn, # Optional: callback(step, loss)
)
Optimizer¶
Adam with:
- β₁ = 0.85, β₂ = 0.99
- ε = 1e-8
- Linear LR decay: lr * (1 - step/total_steps)
Scaling Up¶
| Config | ~Params | Speed | Quality |
|---|---|---|---|
n_layer=1, n_embd=16 |
5K | Fast | Basic patterns |
n_layer=2, n_embd=32 |
30K | Moderate | Good names |
n_layer=4, n_embd=64 |
200K+ | Slow | Complex patterns |
Pure Python
This is pure Python with no vectorization. Training with large configs takes time. That's the point — you can read every line and understand exactly what's happening.
→ Next: Custom Datasets | Tool Usage