Skip to content

Custom Datasets

Train on any text — names, poems, molecules, DNA, code, anything.


From a List

from strands_microgpt import MicroGPT, Tokenizer

docs = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "the bird sat on the word",
] * 100  # Repeat for more training data

tokenizer = Tokenizer.from_docs(docs)
model = MicroGPT(
    vocab_size=tokenizer.vocab_size,
    n_embd=32,
    block_size=32,
)

model.train_on_docs(docs, tokenizer, num_steps=2000)
samples = model.generate(tokenizer, num_samples=10, temperature=0.7)

From a File

model, tokenizer, docs = MicroGPT.from_dataset(
    dataset_path="my_data.txt"  # One document per line
)
model.train_on_docs(docs, tokenizer, num_steps=1000)

From a URL

model, tokenizer, docs = MicroGPT.from_dataset(
    dataset_url="https://example.com/poems.txt"
)

Dataset Tips

Quality > Quantity

  • One item per line (names, words, short phrases)
  • Keep items shorter than block_size
  • More repetition = faster learning
  • Character-level: works best with consistent patterns

Examples of Custom Data

Domain Example Data block_size
Names alice, bob, charlie 16
Words python, javascript, rust 16
Molecules CCO, c1ccccc1, CC(=O)O 32
DNA ATCGATCG, GCTAGCTA 32
Phrases the quick brown fox 64

Next: Tool Usage | Architecture