Gemma 3 Reasoning Fine-Tuning with GRPO: A Step-by-Step Guide [Part 2]

Apr 22, 2025

∙ Paid

Enhancing the reasoning abilities of Large Language Models (LLMs) is important for their application in complex tasks. This technical guide initiates a practical walkthrough for fine-tuning the Gemma 3 model specifically for reasoning, using the GRPO (General Reinforcement Pretraining Optimization) method.

The first part covered the foundational steps required before commencing the fine-tuning loop. It provided an introduction to the GRPO algorithm, details the setup of the necessary computational environment, outlines the procedures for loading the Gemma 3 base model and tokenizer, and describes the essential steps for acquiring and preparing the target dataset.

In this second part, we complete these stages by first defining the reward function that we will use to train the model. Then we fine-tune the model and test it after fine-tuning, and finally, we save it locally and on Hugging Face Hub.

Introduction to GRPO [Part 1]
Setting Up the Working Environment [Part 1]
Loading the Model & Tokenizer [Part 1]
Loading & Preprocessing the Dataset [Part 1]
Define Reward Function [Part 2]
Model Reasoning Fine Tuning [Part 2]
Testing the Fine-Tuned Model [Part 2]
Saving the Model Locally & Hugging Face Hub [Part 2]

You can find the codes used in this article in this GitHub Repo

My New E-Book: LLM Roadmap from Beginner to Advanced Level

Youssef Hosni

June 18, 2024

I am pleased to announce that I have published my new ebook LLM Roadmap from Beginner to Advanced Level. This ebook will provide all the resources you need to start your journey towards mastering LLMs.

Read full story

5. Define Reward Function

The reward function is a crucial component that guides the model’s learning process. Unlike traditional reinforcement learning approaches that rely on a learned value function (critic), GRPO computes the advantage of each generated response by comparing its reward to the average reward of a group of responses generated for the same prompt.

We will create a regular expression pattern match_format to detect the reasoning response structure in the model's outputs. This structure includes a reasoning section marked by reasoning_start and reasoning_end, followed by a solution section enclosed by solution_start and solution_end.

The regex uses re.MULTILINE and re.DOTALL flags so it can match across multiple lines and handle any newline characters within the reasoning or solution.

This is crucial because GRPO relies on reward functions to guide the model toward not just correct answers, but also well-structured, explainable reasoning — and this regex enforces that expected format.

# Create a regex format to match the reasoning sections and answers:
import re

match_format = re.compile(
    rf"^[\s]{{0,}}"\
    rf"{reasoning_start}.+?{reasoning_end}.*?"\
    rf"{solution_start}(.+?){solution_end}"\
    rf"[\s]{{0,}}$",
    flags = re.MULTILINE | re.DOTALL
)

Let's test whether the regex pattern can correctly identify a simple string containing both reasoning and solution sections. This acts as a unit test for your pattern and ensures your format enforcement logic works as intended before integrating it into a reward function.

match_format.search(
    "<start_working_out>Let me think!<end_working_out>"\
    "<SOLUTION>2</SOLUTION>",
)

<re.Match object; span=(0, 71), match=’<start_working_out>Let me think!<end_working_out>>

Let's define the first reward function, match_format_exactly, which rewards completions that match the full expected format perfectly. For each model-generated completion, it extracts the text content and applies the defined regex.

If the pattern matches, it assigns a reward of 3 points; otherwise, 0. This strict reward encourages the model to consistently produce reasoning and solution sections in the exact structure we want, which is critical when fine-tuning for applications where structured explanations matter as much as answers.

# Create a reward function to match the format exactly - we reward it with 3 points if it succeeds:
def match_format_exactly(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        # Match if format is seen exactly!
        if match_format.search(response) is not None: score += 3.0
        scores.append(score)
    return scores

Since not every generation will be perfect, we will define another reward function, match_format_approximately, which offers a softer reward mechanism.

Instead of looking for a complete match, it checks for the presence of each expected marker (reasoning_start, reasoning_end, solution_start, solution_end) and assigns or deducts 0.5 points based on whether each appears exactly once.

This way, even if the model partially follows the expected structure, it still earns some points, nudging it in the right direction while penalizing overuse or omission of key markers. This incremental feedback helps guide the model progressively closer to our target format.

# If it fails, the model is rewarded if it at least follows the format partially, by counting each symbol:
def match_format_approximately(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        # Count how many keywords are seen - we penalize if too many!
        # If we see 1, then plus some points!
        score += 0.5 if response.count(reasoning_start) == 1 else -0.5
        score += 0.5 if response.count(reasoning_end)   == 1 else -0.5
        score += 0.5 if response.count(solution_start)  == 1 else -0.5
        score += 0.5 if response.count(solution_end)    == 1 else -0.5
        scores.append(score)
    return scores

Finally, we will define the check_answer function that ties everything together by evaluating the actual answer content. It first extracts the proposed solution using the match_format regex.

If no solution is found, it scores 0. If the answer is exactly correct, it scores 3 points. A minor reward (1.5 points) is given if the answer matches after trimming whitespace, and a proportional reward is assigned if the numerical answer is close to the ground truth within certain thresholds (e.g., within 10% or 20%). Incorrect or malformed answers get penalized.

This careful reward shaping not only encourages accurate answers but also tolerates small numerical discrepancies, important for reasoning problems where minor arithmetic errors might still indicate good reasoning patterns.

Keep reading with a 7-day free trial

Subscribe to To Data & Beyond to keep reading this post and get 7 days of free access to the full post archives.

To Data & Beyond

Gemma 3 Reasoning Fine-Tuning with GRPO: A Step-by-Step Guide [Part 2]

Table of Contents:

My New E-Book: LLM Roadmap from Beginner to Advanced Level

5. Define Reward Function

Keep reading with a 7-day free trial