Math SFT Training Examples

This directory contains examples for supervised fine-tuning (SFT) of math reasoning models using the RLLM framework. The SFT training pipeline generates high-quality trajectories from a teacher model and fine-tunes a student model on the successful trajectories.

Our examples use the following:

Qwen/Qwen2.5-Math-7B-Instruct as the base model
Qwen/Qwen3-4B as the teacher model for trajectory generation
DeepScaleR math dataset for training data

The Math SFT examples demonstrate:

How to generate high-quality training data from teacher model trajectories
How to perform supervised fine-tuning on successful math reasoning trajectories
How to fine-tune math reasoning models using the DeepScaleR dataset

Quick Start

Model Hosting

Start a vLLM server with OpenAI-compatible API for the teacher model:

CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-4B \
    --host 0.0.0.0 \
    --port 30000 \
    --tensor-parallel-size 2 \
    --dtype bfloat16 \
    --max-model-len 32768

The server should be accessible at http://localhost:30000/v1

Dataset Preparation

First prepare the dataset:

cd examples/sft
python prepare_math_data.py

Generate SFT training data from teacher model trajectories:

cd examples/sft
python generate_sft_data.py --num_samples 1000 --trajectories_per_problem 4 --reward_threshold 1.0 --output large_sft_data.parquet

This will: - Load problems from the DeepScaleR math dataset - Generate trajectories using the teacher model (Qwen3-4B) - Filter trajectories by reward threshold - Save successful trajectories in SFT format

Configuration Options: - --num_samples: Number of problems to generate trajectories for (default: 500) - --trajectories_per_problem: Number of trajectories per problem (default: 4) - --reward_threshold: Minimum reward score to include trajectory (default: 1.0) - --output: Output file name (default: "sft_data.parquet")

Training

Run SFT with the generated data:

bash train_math_sft.sh

Configuration Options: You can modify the training script parameters: - model.partial_pretrain: Base model to fine-tune - trainer.total_epochs: Number of training epochs - data.train_batch_size: Total batch size across all GPUs - data.micro_batch_size_per_gpu: Batch size per GPU - data.max_length: Maximum sequence length - data.train_files: Training data file - data.val_files: Validation data file

The training script will: - Load the base model (Qwen2.5-Math-7B-Instruct) - Fine-tune on the generated SFT data - Save checkpoints to outputs/qwen2.5_math_sft/

Evaluation

You can launch a server for evaluation with:

CUDA_VISIBLE_DEVICES=0 python -m sglang_router.launch_server \
    --model-path Qwen/Qwen2.5-Math-7B-Instruct \
    --lora-path examples/sft/outputs/qwen2.5_math_sft/global_step_2784 \
    --dp-size 1 \
    --dtype bfloat16 \
    --disable-radix-cache \
    --context-length 16384 \
    --port 30000

The server should be accessible at http://localhost:30000/v1

Evaluate the trained model using the saved checkpoint:

cd examples/sft
python run_sft_model.py --model_path outputs/qwen2.5_math_sft/global_step_2784/

Replace outputs/qwen2.5_math_sft/global_step_2784/ with the actual path to your trained model checkpoint.

Code Reference

SFT Data Generator

Main script for generating SFT training data:

examples/sft/generate_sft_data.py

import argparse
import asyncio
import os

import pandas as pd

from rllm.agents.agent import Trajectory
from rllm.data.dataset import DatasetRegistry
from rllm.trainer.agent_sft_trainer import AgentSFTTrainer


def load_problems(num_samples, trajectories_per_problem=4):
    dataset = DatasetRegistry.load_dataset("deepscaler_math", "train")
    if dataset is None:
        raise RuntimeError("Dataset 'deepscaler_math' not found. Run prepare_math_data.py first to register the dataset.")

    data = dataset.get_data()
    if num_samples < len(data):
        df = pd.DataFrame(data)
        df = df.sample(n=num_samples, random_state=42)
        data = df.to_dict("records")

    # Generate multiple trajectories per problem
    tasks = []
    for i in range(trajectories_per_problem):
        for j, row in enumerate(data):
            tasks.append({"question": row["question"], "ground_truth": row["ground_truth"], "uid": f"{j}_{i}"})

    return tasks


async def generate_trajectories(tasks) -> list[Trajectory]:
    """Generate trajectories using Qwen3-4B with tool usage."""
    from transformers import AutoTokenizer

    from rllm.agents import ToolAgent
    from rllm.engine import AsyncAgentExecutionEngine
    from rllm.environments.tools.tool_env import ToolEnvironment
    from rllm.rewards.reward_fn import math_reward_fn

    os.environ["TOKENIZERS_PARALLELISM"] = "true"
    model_name = "Qwen/Qwen3-4B"

    agent_args = {"tools": ["python"], "parser_name": "qwen", "system_prompt": "You are a math assistant that can write python to solve math problems."}
    env_args = {"tools": ["python"], "reward_fn": math_reward_fn}
    engine = AsyncAgentExecutionEngine(
        agent_class=ToolAgent,
        env_class=ToolEnvironment,
        agent_args=agent_args,
        env_args=env_args,
        engine_name="openai",
        tokenizer=AutoTokenizer.from_pretrained(model_name),
        sampling_params={"temperature": 0.6, "top_p": 0.95, "model": model_name},
        rollout_engine_args={"base_url": "http://localhost:30000/v1", "api_key": "None"},
        max_response_length=15000,
        max_prompt_length=2048,
        n_parallel_agents=64,
    )

    return await engine.execute_tasks(tasks)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--num_samples", type=int, default=500)
    parser.add_argument("--trajectories_per_problem", type=int, default=4)
    parser.add_argument("--reward_threshold", type=float, default=1.0)
    parser.add_argument("--output", type=str, default="sft_data.parquet")
    args = parser.parse_args()

    # Load problems with replication
    tasks = load_problems(args.num_samples, args.trajectories_per_problem)

    print(f"Generating {len(tasks)} trajectories from {args.num_samples} problems ({args.trajectories_per_problem} per problem)...")

    # Generate trajectories
    results = asyncio.run(generate_trajectories(tasks))

    # Process trajectories with tool call filtering
    sft_data = AgentSFTTrainer.process_trajectories(results, args.reward_threshold, filter_tool_calls=True)

    # Save results
    if sft_data:
        pd.DataFrame(sft_data).to_parquet(args.output, index=False)
        lengths = [len(" ".join([m["content"] for m in ex["messages"] if m["role"] == "assistant"])) for ex in sft_data]
        print(f"Saved {len(sft_data)} examples with tool usage. Response lengths: min={min(lengths)}, max={max(lengths)}, avg={sum(lengths) // len(lengths)}")
    else:
        print("No valid data generated!")


if __name__ == "__main__":
    main()

Math SFT Model Evaluator

Script for evaluating SFT model performance:

examples/sft/run_sft_model.py

import argparse
import asyncio

from transformers import AutoTokenizer

from rllm.agents import ToolAgent
from rllm.data.dataset import DatasetRegistry
from rllm.engine.agent_execution_engine import AgentExecutionEngine
from rllm.environments.tools.tool_env import ToolEnvironment
from rllm.rewards.reward_fn import math_reward_fn
from rllm.utils import compute_pass_at_k

if __name__ == "__main__":
    import os

    os.environ["TOKENIZERS_PARALLELISM"] = "true"

    parser = argparse.ArgumentParser(description="Evaluate trained SFT math model")
    parser.add_argument("--model_path", type=str, required=True, help="Path to the trained model checkpoint (e.g., outputs/qwen2.5_math_sft/global_step_2784/)")
    args = parser.parse_args()

    n_parallel_agents = 64

    model_name = args.model_path

    # model_name = "Qwen/Qwen2.5-Math-7B-Instruct" # base model eval

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    agent_args = {
        "tools": ["python"],
        "parser_name": "qwen",
        "system_prompt": ('You are an expert mathematician and programmer. Your goal is to solve challenging math problems, like those from the AIME competition, by breaking them down into logical steps and using Python code for calculations. Strive for clarity and efficiency.\n\nFollow this process for every problem:\n1.  **Analyze the Problem**: Read the question carefully. Identify the key information, constraints, and what is being asked.\n2.  **Think Step-by-Step**: In the `<think>` block, outline your plan. Decompose the problem into the smallest, most logical steps. **You must not write code or perform calculations in this block.** Your goal is to create a plan that will be executed by the Python tool.\n3.  **Write Python Code**: In the `<tool_call>` block, write an efficient Python script to execute your plan. The tool expects a JSON object with `name` and `arguments` keys. The `arguments` should be a dictionary with a single `code` key. Ensure the code is self-contained, runs quickly, and prints the final result.\n4.  **State the Final Answer**: After receiving the `<tool_result>`, verify it. Then, state the final answer clearly and concisely in the format \\boxed{answer}.\n\nHere is an example:\nQuestion: What is the largest prime factor of 25! ?\n<think>The problem asks for the largest prime factor of 25 factorial. The largest prime factor of n! is the largest prime number less than or equal to n. In this case, n=25. I will write a Python script to find the largest prime number less than or equal to 25.</think>\n<tool_call>\n{"name": "python", "arguments": {"code": "import math\\ndef is_prime(n):\\n    if n <= 1:\\n        return False\\n    for i in range(2, int(math.sqrt(n)) + 1):\\n        if n % i == 0:\\n            return False\\n    return True\\n\\ndef largest_prime_up_to(n):\\n    for i in range(n, 1, -1):\\n        if is_prime(i):\\n            return i\\n    return None\\n\\nprint(largest_prime_up_to(25))"}}\n</tool_call>\n<tool_result>\n23\n</tool_result>\nThe largest prime factor of 25! is the largest prime number less than or equal to 25. The answer is \\boxed{23}.'),
    }
    env_args = {
        "tools": ["python"],
        "reward_fn": math_reward_fn,
    }

    sampling_params = {"temperature": 0.6, "top_p": 0.95, "model": model_name}

    engine = AgentExecutionEngine(
        agent_class=ToolAgent,
        agent_args=agent_args,
        env_class=ToolEnvironment,
        env_args=env_args,
        engine_name="openai",
        rollout_engine_args={"base_url": "http://localhost:30000/v1", "api_key": "None"},
        tokenizer=tokenizer,
        sampling_params=sampling_params,
        max_response_length=14000,
        max_prompt_length=2000,
        n_parallel_agents=n_parallel_agents,
        max_steps=20,  # Allow more steps for complex problems
    )

    test_dataset = DatasetRegistry.load_dataset("aime2024", "test")
    if test_dataset is None:
        print("Dataset not found, preparing dataset...")
        from prepare_math_data import prepare_math_data

        _, test_dataset = prepare_math_data()

    tasks = test_dataset.repeat(n=8)  # repeat to evaluate pass@k

    results = asyncio.run(engine.execute_tasks(tasks))
    compute_pass_at_k(results)

For detailed setup instructions, see the README in the sft example directory.