Math SFT Training Examples
This directory contains examples for supervised fine-tuning (SFT) of math reasoning models using the RLLM framework. The SFT training pipeline generates high-quality trajectories from a teacher model and fine-tunes a student model on the successful trajectories.
Our examples use the following:
- Qwen/Qwen2.5-Math-7B-Instruct as the base model
- Qwen/Qwen3-4B as the teacher model for trajectory generation
- DeepScaleR math dataset for training data
The Math SFT examples demonstrate:
- How to generate high-quality training data from teacher model trajectories
- How to perform supervised fine-tuning on successful math reasoning trajectories
- How to fine-tune math reasoning models using the DeepScaleR dataset
Quick Start
Model Hosting
Start a vLLM server with OpenAI-compatible API for the teacher model:
CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-4B \
--host 0.0.0.0 \
--port 30000 \
--tensor-parallel-size 2 \
--dtype bfloat16 \
--max-model-len 32768
The server should be accessible at http://localhost:30000/v1
Dataset Preparation
First prepare the dataset:
Generate SFT training data from teacher model trajectories:
cd examples/sft
python generate_sft_data.py --num_samples 1000 --trajectories_per_problem 4 --reward_threshold 1.0 --output large_sft_data.parquet
This will: - Load problems from the DeepScaleR math dataset - Generate trajectories using the teacher model (Qwen3-4B) - Filter trajectories by reward threshold - Save successful trajectories in SFT format
Configuration Options:
- --num_samples: Number of problems to generate trajectories for (default: 500)
- --trajectories_per_problem: Number of trajectories per problem (default: 4)
- --reward_threshold: Minimum reward score to include trajectory (default: 1.0)
- --output: Output file name (default: "sft_data.parquet")
Training
Run SFT with the generated data:
Configuration Options:
You can modify the training script parameters:
- model.partial_pretrain: Base model to fine-tune
- trainer.total_epochs: Number of training epochs
- data.train_batch_size: Total batch size across all GPUs
- data.micro_batch_size_per_gpu: Batch size per GPU
- data.max_length: Maximum sequence length
- data.train_files: Training data file
- data.val_files: Validation data file
The training script will:
- Load the base model (Qwen2.5-Math-7B-Instruct)
- Fine-tune on the generated SFT data
- Save checkpoints to outputs/qwen2.5_math_sft/
Evaluation
You can launch a server for evaluation with:
CUDA_VISIBLE_DEVICES=0 python -m sglang_router.launch_server \
--model-path Qwen/Qwen2.5-Math-7B-Instruct \
--lora-path examples/sft/outputs/qwen2.5_math_sft/global_step_2784 \
--dp-size 1 \
--dtype bfloat16 \
--disable-radix-cache \
--context-length 16384 \
--port 30000
The server should be accessible at http://localhost:30000/v1
Evaluate the trained model using the saved checkpoint:
Replace outputs/qwen2.5_math_sft/global_step_2784/ with the actual path to your trained model checkpoint.
Code Reference
SFT Data Generator
Main script for generating SFT training data:
import argparse
import asyncio
import os
import pandas as pd
from rllm.agents.agent import Trajectory
from rllm.data.dataset import DatasetRegistry
from rllm.trainer.agent_sft_trainer import AgentSFTTrainer
def load_problems(num_samples, trajectories_per_problem=4):
dataset = DatasetRegistry.load_dataset("deepscaler_math", "train")
if dataset is None:
raise RuntimeError("Dataset 'deepscaler_math' not found. Run prepare_math_data.py first to register the dataset.")
data = dataset.get_data()
if num_samples < len(data):
df = pd.DataFrame(data)
df = df.sample(n=num_samples, random_state=42)
data = df.to_dict("records")
# Generate multiple trajectories per problem
tasks = []
for i in range(trajectories_per_problem):
for j, row in enumerate(data):
tasks.append({"question": row["question"], "ground_truth": row["ground_truth"], "uid": f"{j}_{i}"})
return tasks
async def generate_trajectories(tasks) -> list[Trajectory]:
"""Generate trajectories using Qwen3-4B with tool usage."""
from transformers import AutoTokenizer
from rllm.agents import ToolAgent
from rllm.engine import AsyncAgentExecutionEngine
from rllm.environments.tools.tool_env import ToolEnvironment
from rllm.rewards.reward_fn import math_reward_fn
os.environ["TOKENIZERS_PARALLELISM"] = "true"
model_name = "Qwen/Qwen3-4B"
agent_args = {"tools": ["python"], "parser_name": "qwen", "system_prompt": "You are a math assistant that can write python to solve math problems."}
env_args = {"tools": ["python"], "reward_fn": math_reward_fn}
engine = AsyncAgentExecutionEngine(
agent_class=ToolAgent,
env_class=ToolEnvironment,
agent_args=agent_args,
env_args=env_args,
engine_name="openai",
tokenizer=AutoTokenizer.from_pretrained(model_name),
sampling_params={"temperature": 0.6, "top_p": 0.95, "model": model_name},
rollout_engine_args={"base_url": "http://localhost:30000/v1", "api_key": "None"},
max_response_length=15000,
max_prompt_length=2048,
n_parallel_agents=64,
)
return await engine.execute_tasks(tasks)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--num_samples", type=int, default=500)
parser.add_argument("--trajectories_per_problem", type=int, default=4)
parser.add_argument("--reward_threshold", type=float, default=1.0)
parser.add_argument("--output", type=str, default="sft_data.parquet")
args = parser.parse_args()
# Load problems with replication
tasks = load_problems(args.num_samples, args.trajectories_per_problem)
print(f"Generating {len(tasks)} trajectories from {args.num_samples} problems ({args.trajectories_per_problem} per problem)...")
# Generate trajectories
results = asyncio.run(generate_trajectories(tasks))
# Process trajectories with tool call filtering
sft_data = AgentSFTTrainer.process_trajectories(results, args.reward_threshold, filter_tool_calls=True)
# Save results
if sft_data:
pd.DataFrame(sft_data).to_parquet(args.output, index=False)
lengths = [len(" ".join([m["content"] for m in ex["messages"] if m["role"] == "assistant"])) for ex in sft_data]
print(f"Saved {len(sft_data)} examples with tool usage. Response lengths: min={min(lengths)}, max={max(lengths)}, avg={sum(lengths) // len(lengths)}")
else:
print("No valid data generated!")
if __name__ == "__main__":
main()
Math SFT Model Evaluator
Script for evaluating SFT model performance:
import argparse
import asyncio
from transformers import AutoTokenizer
from rllm.agents import ToolAgent
from rllm.data.dataset import DatasetRegistry
from rllm.engine.agent_execution_engine import AgentExecutionEngine
from rllm.environments.tools.tool_env import ToolEnvironment
from rllm.rewards.reward_fn import math_reward_fn
from rllm.utils import compute_pass_at_k
if __name__ == "__main__":
import os
os.environ["TOKENIZERS_PARALLELISM"] = "true"
parser = argparse.ArgumentParser(description="Evaluate trained SFT math model")
parser.add_argument("--model_path", type=str, required=True, help="Path to the trained model checkpoint (e.g., outputs/qwen2.5_math_sft/global_step_2784/)")
args = parser.parse_args()
n_parallel_agents = 64
model_name = args.model_path
# model_name = "Qwen/Qwen2.5-Math-7B-Instruct" # base model eval
tokenizer = AutoTokenizer.from_pretrained(model_name)
agent_args = {
"tools": ["python"],
"parser_name": "qwen",
"system_prompt": ('You are an expert mathematician and programmer. Your goal is to solve challenging math problems, like those from the AIME competition, by breaking them down into logical steps and using Python code for calculations. Strive for clarity and efficiency.\n\nFollow this process for every problem:\n1. **Analyze the Problem**: Read the question carefully. Identify the key information, constraints, and what is being asked.\n2. **Think Step-by-Step**: In the `<think>` block, outline your plan. Decompose the problem into the smallest, most logical steps. **You must not write code or perform calculations in this block.** Your goal is to create a plan that will be executed by the Python tool.\n3. **Write Python Code**: In the `<tool_call>` block, write an efficient Python script to execute your plan. The tool expects a JSON object with `name` and `arguments` keys. The `arguments` should be a dictionary with a single `code` key. Ensure the code is self-contained, runs quickly, and prints the final result.\n4. **State the Final Answer**: After receiving the `<tool_result>`, verify it. Then, state the final answer clearly and concisely in the format \\boxed{answer}.\n\nHere is an example:\nQuestion: What is the largest prime factor of 25! ?\n<think>The problem asks for the largest prime factor of 25 factorial. The largest prime factor of n! is the largest prime number less than or equal to n. In this case, n=25. I will write a Python script to find the largest prime number less than or equal to 25.</think>\n<tool_call>\n{"name": "python", "arguments": {"code": "import math\\ndef is_prime(n):\\n if n <= 1:\\n return False\\n for i in range(2, int(math.sqrt(n)) + 1):\\n if n % i == 0:\\n return False\\n return True\\n\\ndef largest_prime_up_to(n):\\n for i in range(n, 1, -1):\\n if is_prime(i):\\n return i\\n return None\\n\\nprint(largest_prime_up_to(25))"}}\n</tool_call>\n<tool_result>\n23\n</tool_result>\nThe largest prime factor of 25! is the largest prime number less than or equal to 25. The answer is \\boxed{23}.'),
}
env_args = {
"tools": ["python"],
"reward_fn": math_reward_fn,
}
sampling_params = {"temperature": 0.6, "top_p": 0.95, "model": model_name}
engine = AgentExecutionEngine(
agent_class=ToolAgent,
agent_args=agent_args,
env_class=ToolEnvironment,
env_args=env_args,
engine_name="openai",
rollout_engine_args={"base_url": "http://localhost:30000/v1", "api_key": "None"},
tokenizer=tokenizer,
sampling_params=sampling_params,
max_response_length=14000,
max_prompt_length=2000,
n_parallel_agents=n_parallel_agents,
max_steps=20, # Allow more steps for complex problems
)
test_dataset = DatasetRegistry.load_dataset("aime2024", "test")
if test_dataset is None:
print("Dataset not found, preparing dataset...")
from prepare_math_data import prepare_math_data
_, test_dataset = prepare_math_data()
tasks = test_dataset.repeat(n=8) # repeat to evaluate pass@k
results = asyncio.run(engine.execute_tasks(tasks))
compute_pass_at_k(results)
For detailed setup instructions, see the README in the sft example directory.