To Data & Beyond

To Data & Beyond

Share this post

To Data & Beyond
To Data & Beyond
Build a Small Language Model (SLM) From Scratch

Build a Small Language Model (SLM) From Scratch

Shravan's avatar
Shravan
Aug 15, 2025
∙ Paid
44

Share this post

To Data & Beyond
To Data & Beyond
Build a Small Language Model (SLM) From Scratch
5
Share

Get 90% off for 1 year

At this current phase of AI evolution, any model with fewer than 1 billion parameters can be called a small language model. If we look at our past models like GPT-3 which has ~175 billion parameters, and when we think about GPT-4, it is heard that it might contain 1 trillion parameters.

As per the scaling law, it says that as parameters increase, the performance of the model increases. So, of course, we can infer from here that having a large universe architecture can give better results, but what if a small architecture model can give better results than a larger model architecture?

Can we construct a language model with just 10–15 M parameters that produces coherent text?


Get All My Books, One Button Away With 40% Off

Youssef Hosni
·
Jun 17
Get All My Books, One Button Away With 40% Off

I have created a bundle for my books and roadmaps, so you can buy everything with just one button and for 40% less than the original price. The bundle features 8 eBooks, including:

Read full story

The complete training pipeline is here :

Part 1: Our dataset

We select and take a tiny dataset. TinyStories is a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-old usually understands, generated by GPT-3.5 and GPT-4. We can get it from HuggingFace.

Get 90% off for 1 year

!pip install datasets
from datasets import load_dataset

ds = load_dataset("roneneldan/TinyStories")

Our dataset looks like this, with every row corresponding to a story. There are 2 million such rows for training and 20000 such rows for validation. Our dataset looks like this, with every row corresponding to a story. There are 2 million such rows for training and 20000 such rows for validation.

Part 2: Data pre-processing (Tokenization, Input-Output Pairs)

Get All My Books With 40% Off

Let us take a small example from our dataset, and our goal is to do 2 things: 1) Tokenize the dataset — we will use the GPT-2 sub-word tokenizer. 2) Store all token IDs in a single .bin file.

Take input data and convert it into a numerical format.

We will take every single story and have each token mapped to a token ID for each of the story words, and then merge all these token IDs into a very big corpus of token IDs. This will be training data.

For example story — this could be tokens and ids. Tokens are stored in a dictionary format with IDs and length.

Similarly, for all stories, we will have different dictionaries like this with token IDs and lengths, and we will merge all these together and save them in .bin format in a local saved storage.

Why do we use disk storage?

1. Installing and Importing Libraries

https://github.com/openai/tiktoken

!pip install tiktoken        # Installs the tiktoken library for tokenization

import tiktoken              # GPT(-2/3/4) compatible token encoding/decoding
import os                    # For filesystem operations
import numpy as np           # Numerical operations / memmap arrays
from tqdm.auto import tqdm   # Progress bars for loops

2. Tokenizer Initialization

enc = tiktoken.get_encoding("gpt2")  # Load the GPT-2 encoding scheme (Byte-Pair Encoding)
  • This tokenizer, used by GPT-2/3/4, converts raw text to integer “tokens.”

3. Tokenization Function

def process(example):
    ids = enc.encode_ordinary(example['text'])  # Encode to token IDs (no special tokens)
    out = {'ids': ids, 'len': len(ids)}
    return out
  • Input: dictionary with an 'text' entry (one document/sample).

  • Output: dictionary with:

  • 'ids': List of token ids (integers)

  • 'len': Number of tokens

4. Tokenizing the Dataset

if not os.path.exists("train.bin"):
    tokenized = ds.map(
        process,
        remove_columns=['text'],
        desc="tokenizing the splits",
        num_proc=8,
    )
  • Checks for existing (“train.bin”) file — avoids redundant work.

ds.map:

  • Applies process to every text sample in the dataset (ds)

  • Removes the original ‘text’ column

  • Uses multiple processors for speed (num_proc=8)

  • Adds progress label

5. Saving Tokenized Data to Binary Files

    for split, dset in tokenized.items():
        arr_len = np.sum(dset['len'], dtype=np.uint64)
        filename = f'{split}.bin'
        dtype = np.uint16  # Safe, as GPT-2 token values < 65536
        arr = np.memmap(filename, dtype=dtype, mode='w+', shape=(arr_len,))
        total_batches = 1024

        idx = 0
        for batch_idx in tqdm(range(total_batches), desc=f'writing {filename}'):
            batch = dset.shard(num_shards=total_batches, index=batch_idx, contiguous=True).with_format('numpy')
            arr_batch = np.concatenate(batch['ids'])
            arr[idx : idx + len(arr_batch)] = arr_batch
            idx += len(arr_batch)
        arr.flush()

This code is part of the data preprocessing pipeline that enables training of transformer models efficiently on large text datasets.

For each split (train, val, etc.):

  • Computes total token count for that split.

  • Prepares a large binary file (*.bin) using numpy.memmap for efficiency.

  • Batches writing for speed and memory (1,024 shards).

  • Loops over each batch; fetches the batch’s tokens as numpy arrays.

  • Writes the sequential batch token IDs to the memory-mapped array.

  • Flushes the memmap file to disk at the end.

Summary & What This Is For

  • Purpose: Converts a text dataset into GPT-style token IDs and stores them sequentially in fast binary files, optimized for training neural LMs like GPT.

  • Why Batches: Efficient I/O and memory usage on large datasets.

  • Why memmap: Handles datasets larger than RAM by mapping files directly in memory.

  • Why np.uint16: Token IDs (max 50256 for GPT-2) fit into 16 bits, so it saves space.

We need to split the dataset into batches and add everything to “train.bin”.

To quickly summarize:

  • Every story is broken into a bunch of token IDs

  • Every token ID has some length associated with it

  • Store all these Token IDs need to be place in some place

  • This place is called train.bin, and this file needs to be saved on the disk instead of RAM

  • train.bin will contain all token IDs together.

(1) Tokenize the dataset into tokenIDs.

(2) Create a file called “train.bin” and “validation.bin” where we will store the tokenIDs from the entire dataset.

(3) We make sure the tokenIDs are stored on a disk, rather than on the RAM for efficient computations.


Part-3: Create Input-Output batches for the dataset

Now, let us create the input-output pairs from the dataset

Let's assume the context size is 4 tokens, and also the batch size is 4 (x1, x2, x3, x4). The output tensor for each input x1 is y1, which is the same as the input but shifted by right side of the text by 1 token.

Get 90% off for 1 year

Keep reading with a 7-day free trial

Subscribe to To Data & Beyond to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
Shravan's avatar
A guest post by
Shravan
Associate Director - DSAI @Novartis, Research Scholar, Vast Experience in Analytics, Data Science and AI.
Subscribe to Shravan
© 2025 Youssef Hosni
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share