Qwen 3 Mathematical Reasoning Fine Tuning with GRPO Technique #1
Hands-on Tutorial on Reasoning Fine Tuning Qwen 3 with GRPO
Enhancing the reasoning abilities of Large Language Models (LLMs) is important for their application in complex tasks. This technical guide initiates a practical walkthrough for converting the Qwen3 4 B-Base model into a reasoning model via General Reinforcement Pretraining Optimization (GRPO) by using OpenR1’s Math dataset
As the first part in a series, this article focuses on the foundational steps required before commencing the fine-tuning loop. It provides an introduction to the GRPO algorithm, details the setup of the necessary computational environment, outlines the procedures for loading the Qwen 3 base model and tokenizer, and describes the essential steps for acquiring and preparing the target dataset. Completing these stages prepares the user for the reward modeling and fine-tuning processes detailed in Part 2.
Table of Contents:
Introduction to GRPO [Part 1]
Setting Up the Working Environment [Part 1]
Loading the Model & Tokenizer [Part 1]
Loading & Preprocessing the Dataset [Part 1]
Define Reward Function [Part 2]
Model Reasoning Fine Tuning [Part 2]
Testing the Fine-Tuned Model [Part 2]
Saving the Model Locally & Hugging Face Hub [Part 2]
My New E-Book: LLM Roadmap from Beginner to Advanced Level
I am pleased to announce that I have published my new ebook LLM Roadmap from Beginner to Advanced Level. This ebook will provide all the resources you need to start your journey towards mastering LLMs.
1. Introduction to GRPO
GRPO (General Reinforcement Pretraining Optimization) is an advanced technique designed to enhance the efficiency of fine-tuning large language models.
It combines the principles of reinforcement learning with pretraining to refine the model’s behaviour using reward signals rather than direct supervision. GRPO optimizes the model’s parameters iteratively by using a policy-based optimization approach.
In a typical fine-tuning scenario, the model is trained on a supervised dataset, where it directly learns from ground truth labels. In contrast, GRPO introduces a reinforcement learning (RL) paradigm where the model is trained to maximize a reward signal that guides its behaviour.
This process allows the model to adapt more flexibly to task-specific nuances, improving both accuracy and generalization.
The key formula for policy optimization in GRPO can be expressed as:
Where:
This policy-based approach ensures that the model continuously adapts to the feedback provided during training, focusing on improving the reward signal that corresponds to task-specific goals.
In GRPO, the reward function can be defined according to specific task requirements, guiding the model to focus on the desired behaviour. The reward can be a function of multiple factors, such as accuracy, formatting, or logical consistency. For instance, a correctness reward function R_correct could be defined as:
This feedback mechanism allows GRPO to progressively refine the model, emphasizing areas that matter most for the given task.
2. Setting Up the Working Environment
Before we dive into the fine-tuning of Qwen 3 with GRPO, we need to ensure our environment is correctly configured with all the necessary libraries. The following code block handles the installation of essential packages:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
!pip install unsloth vllm
else:
!pip install --no-deps unsloth vllm
# [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
# Skip restarting message in Colab
import sys, re, requests; modules = list(sys.modules.keys())
for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
# vLLM requirements - vLLM breaks Colab due to reinstalling numpy
f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
with open("vllm_requirements.txt", "wb") as file:
file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
!pip install -r vllm_requirements.txt
Let’s break down what’s happening here:
%%capture: This is an IPython magic command used in environments like Jupyter Notebooks or Google Colab. It simply suppresses the output of the cell, keeping our notebook cleaner by hiding the potentially lengthy installation logs.
Base Installation:
!pip install — no-deps …: We install the core libraries using pip. The — no-deps flag is important here; it tells pip not to automatically install dependencies for these packages. This gives us finer control over the exact versions of dependencies, preventing potential conflicts, especially in environments like Colab, where some packages are pre-installed.
unsloth: This is a fantastic library designed to significantly speed up LLM fine-tuning and reduce memory usage, often enabling training larger models on consumer GPUs. It’s a key component for efficient fine-tuning.
vllm: A high-throughput engine for LLM inference and serving. While our focus is fine-tuning, vLLM might be useful for efficient evaluation or deployment later.
3. Colab-Specific Handling:
COLAB_” in “”.join(os.environ.keys()): This checks if the code is running within a Google Colab environment by looking for Colab-specific environment variables. Colab has its own pre-installed packages and behaviours, often requiring tailored setup steps.
Module Cleanup: The sys.modules.pop(…) section is a common workaround in Colab. Sometimes, pre-loaded versions of libraries (like PIL/Pillow or Google Cloud libraries) can clash with versions we intend to install, leading to errors or forcing a runtime restart. This code proactively removes potentially conflicting modules from Python’s cache before installing our specific versions, aiming for a smoother setup.
4. Additional Colab Dependencies:
bitsandbytes: Essential for enabling techniques like 4-bit quantization (QLoRA), drastically reducing the model’s memory footprint.
accelerate: A Hugging Face library that simplifies running PyTorch training scripts across different hardware configurations (CPU, single/multi-GPU, TPU) and handles mixed-precision training.
xformers: Provides memory-efficient attention mechanisms and other optimized building blocks for Transformers, often yielding speedups and memory savings. Note the specific version 0.0.29.post3 is pinned for compatibility.
peft: The Hugging Face Parameter-Efficient Fine-Tuning library. This provides methods like LoRA (Low-Rank Adaptation), which Unsloth heavily utilizes, allowing us to fine-tune massive models by only updating a small subset of parameters.
trl: The Hugging Face Transformer Reinforcement Learning library. This is crucial for our task, as it contains implementations for preference tuning algorithms like DPO (Direct Preference Optimization) and, in this context, likely the GRPO (Generalized Reinforcement Preference Optimization) method we’re focusing on. We pinned version 0.15.2.
triton, cut_cross_entropy, unsloth_zoo: These are likely lower-level dependencies or utilities used by Unsloth or other optimization libraries for custom GPU kernels and optimized operations.
sentencepiece, protobuf, datasets, huggingface_hub, hf_transfer: Standard components of the Hugging Face ecosystem for handling tokenization, data loading, and interacting with the Hugging Face Hub (downloading models/datasets, uploading results), with hf_transfer providing accelerated transfers.
5. Careful vLLM Dependency Installation: Because vLLM has its own dependencies, and we already installed specific versions of transformers and xformers, we need to be careful. This code fetches vLLM’s requirements list, uses a regular expression (re.sub) to remove the lines specifying transformers, numpy, and xformers (to avoid overwriting our specific versions or causing conflicts), saves the modified list, and then installs the remaining vLLM dependencies from that file. This ensures compatibility between all the libraries.
3. Loading the Model & Tokenizer
With our environment set up, the next crucial step is to load the Qwen 3 model itself. We’ll leverage the unsloth library here, specifically its FastModel class, which is engineered for significantly faster loading and reduced memory usage compared to standard Hugging Face methods.
Here’s the code to load the model and its corresponding tokenizer:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Qwen3-4B-Base",
max_seq_length = max_seq_length,
load_in_4bit = False, # False for LoRA 16bit
fast_inference = True, # Enable vLLM fast inference
max_lora_rank = lora_rank,
gpu_memory_utilization = 0.7, # Reduce if out of memory
)
Keep reading with a 7-day free trial
Subscribe to To Data & Beyond to keep reading this post and get 7 days of free access to the full post archives.