RL Training with Prime Intellect's Environment Hub
This example shows how to train RL agents using environments from Prime Intellect's Environment Hub with the verifiers library.
Overview
With this example you will:
- Install and configure the verifiers library for RL environments
- Download environments from Prime Intellect's Hub using the Prime CLI
- Train agents using either the verl (local GPU) or Tinker (remote GPU) backend
- Understand how rLLM wraps verifiers environments via
VerifiersWorkflow
Under the hood, the integration works as:
- VerifiersWorkflow: Wraps rLLM's
RolloutEngineas anAsyncOpenAI-compatible client - RolloutEngineAsyncClient: Adapter that lets verifiers environments call rLLM for inference
- Rubric scoring: Verifiers environments define rubrics that score model outputs
Setup
1. Install dependencies
Install rLLM with the verifiers extra:
This installs the verifiers library alongside rLLM.
For Tinker backend (remote GPU training), also install:
2. Install Prime CLI
The Prime CLI lets you download environments from the Hub. Install it via:
Or with pipx:
3. Login to Prime Intellect (optional)
Login is required for private environments:
This opens a browser for authentication. Public environments work without login.
4. Install an environment
Download an environment locally:
Format: prime env install <owner>/<environment-name>
For a specific version:
To see available installation methods:
See the Prime Intellect docs for more details.
Environment Arguments
Each verifiers environment can accept custom arguments via +verifiers.env_args. These are passed directly to vf.load_environment().
Passing environment args
Use Hydra's nested syntax:
python -m examples.verifiers_env.train \
+verifiers.env_id="primeintellect/alphabet-sort" \
'+verifiers.env_args={max_turns: 10, difficulty: "hard"}' \
...
Or in the shell script:
VF_ENV_ARGS='max_turns: 20'
python -m examples.verifiers_env.train \
+verifiers.env_id="${VF_ENV_ID}" \
"+verifiers.env_args={${VF_ENV_ARGS}}" \
...
Common environment arguments
| Argument | Description | Example |
|---|---|---|
max_turns |
Maximum conversation turns | 10 |
tools |
List of tools to enable | ["search", "calculator"] |
system_prompt |
Custom system prompt | "You are a helpful assistant" |
Check each environment's documentation for its specific arguments. You can also inspect an environment locally:
import verifiers as vf
# See what arguments load_environment accepts
env = vf.load_environment("primeintellect/alphabet-sort")
print(env) # Shows environment configuration
Training with verl Backend (Local GPU)
The verl backend runs inference and training on your local GPUs using vLLM.
Run training
Shell configuration:
#!/bin/bash
set -x
# Environment variables for vLLM (managed internally by verl)
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:False"
export VLLM_USE_V1=1
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export VLLM_ENGINE_ITERATION_TIMEOUT_S=100000000000
# ============================================================================
# CONFIGURE THESE FOR YOUR SETUP
# ============================================================================
# Model to train (HuggingFace path or local path)
MODEL_PATH="Qwen/Qwen3-4B"
# Verifiers environment configuration
VF_ENV_ID="math" # The verifiers environment to use
# VF_ENV_ARGS can be passed as JSON, e.g., '{"difficulty": "hard"}'
# GPU configuration
N_GPUS=1 # Number of GPUs per node
NNODES=1 # Number of nodes
# Training hyperparameters
TRAIN_BATCH_SIZE=64
VAL_BATCH_SIZE=128
MAX_PROMPT_LENGTH=2048
MAX_RESPONSE_LENGTH=2048
TOTAL_EPOCHS=100
MAX_STEPS=10 # Max turns for multi-turn environments
# Sampling parameters
ROLLOUT_N=8 # Number of samples per prompt (for GRPO/PPO)
TEMPERATURE=0.7
# Logging
PROJECT_NAME="verifiers-rl"
EXPERIMENT_NAME="${VF_ENV_ID}-training"
# ============================================================================
# RUN TRAINING
# ============================================================================
python3 -m examples.verifiers_env.train \
verifiers.env_id="${VF_ENV_ID}" \
algorithm.adv_estimator=grpo \
data.train_batch_size=${TRAIN_BATCH_SIZE} \
data.val_batch_size=${VAL_BATCH_SIZE} \
data.max_prompt_length=${MAX_PROMPT_LENGTH} \
data.max_response_length=${MAX_RESPONSE_LENGTH} \
actor_rollout_ref.model.path=${MODEL_PATH} \
actor_rollout_ref.hybrid_engine=True \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.loss_agg_mode=seq-mean-token-sum \
actor_rollout_ref.actor.ppo_mini_batch_size=32 \
actor_rollout_ref.actor.use_dynamic_bsz=True \
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=24000 \
actor_rollout_ref.actor.use_kl_loss=False \
actor_rollout_ref.actor.clip_ratio_high=0.28 \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.mode="async" \
actor_rollout_ref.rollout.enforce_eager=False \
actor_rollout_ref.rollout.temperature=${TEMPERATURE} \
actor_rollout_ref.rollout.gpu_memory_utilization=0.85 \
actor_rollout_ref.rollout.n=${ROLLOUT_N} \
actor_rollout_ref.rollout.val_kwargs.n=1 \
actor_rollout_ref.rollout.val_kwargs.temperature=0.7 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
actor_rollout_ref.actor.entropy_coeff=0 \
algorithm.kl_ctrl.kl_coef=0.001 \
rllm.mask_truncated_samples=False \
rllm.workflow.use_workflow=True \
trainer.critic_warmup=0 \
trainer.logger=['console','wandb'] \
trainer.project_name="${PROJECT_NAME}" \
trainer.experiment_name="${EXPERIMENT_NAME}" \
trainer.val_before_train=False \
trainer.n_gpus_per_node=${N_GPUS} \
trainer.nnodes=${NNODES} \
trainer.save_freq=40 \
trainer.test_freq=10 \
trainer.default_hdfs_dir=null \
rllm.agent.max_steps=${MAX_STEPS} \
trainer.total_epochs=${TOTAL_EPOCHS}
Key parameters to customize:
MODEL_PATH: HuggingFace model path (e.g.,Qwen/Qwen3-4B)VF_ENV_ID: Verifiers environment ID (e.g.,primeintellect/alphabet-sort)N_GPUS: Number of GPUs to useROLLOUT_N: Samples per prompt (GRPO group size)
Training with Tinker Backend (Remote GPU)
The Tinker backend offloads inference and LoRA training to Prime Intellect's GPU service.
Configure API key
Set your Tinker API key:
Or create a .env file in the example directory:
Run training
Shell configuration:
#!/bin/bash
set -x
# ============================================================================
# TINKER BACKEND TRAINING FOR VERIFIERS
# ============================================================================
#
# This uses Tinker for inference. You need:
# 1. A Tinker API key
# 2. Model available on Tinker
#
# Tinker handles:
# - Model inference (remote)
# - Weight management
# - LoRA training
#
# ============================================================================
# ============================================================================
# CONFIGURE THESE FOR YOUR SETUP
# ============================================================================
# Load API key from .env file
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
if [ -f "$SCRIPT_DIR/.env" ]; then
source "$SCRIPT_DIR/.env"
export TINKER_API_KEY
fi
# Tinker service URL (null uses default Tinker cloud)
TINKER_BASE_URL=null
# Model to train (HuggingFace path)
MODEL_NAME="Qwen/Qwen3-4B-Instruct-2507"
# Verifiers environment configuration
VF_ENV_ID="primeintellect/alphabet-sort"
# LoRA configuration
LORA_RANK=32
# Training hyperparameters
GROUP_SIZE=16 # Rollouts per prompt (for GRPO)
LEARNING_RATE=2e-5
MAX_LENGTH=32768
TRAIN_BATCH_SIZE=64
VAL_BATCH_SIZE=32
TOTAL_EPOCHS=10
# Sampling parameters (MUST be 1.0 for Tinker workflow trainer)
TEMPERATURE=1.0
TOP_P=1.0
# Workflow configuration
N_PARALLEL_TASKS=256
RETRY_LIMIT=3
# Logging
PROJECT_NAME="verifiers-tinker"
EXPERIMENT_NAME="${VF_ENV_ID}-training"
# ============================================================================
# RUN TRAINING
# ============================================================================
python3 -m examples.verifiers_env.train \
--config-name=tinker_rl_trainer \
+backend=tinker \
tinker_base_url=${TINKER_BASE_URL} \
model.name="${MODEL_NAME}" \
model.lora_rank=${LORA_RANK} \
+verifiers.env_id="${VF_ENV_ID}" \
algorithm.adv_estimator=grpo \
training.group_size=${GROUP_SIZE} \
training.learning_rate=${LEARNING_RATE} \
training.max_length=${MAX_LENGTH} \
sampling.temperature=${TEMPERATURE} \
sampling.top_p=${TOP_P} \
data.train_batch_size=${TRAIN_BATCH_SIZE} \
data.val_batch_size=${VAL_BATCH_SIZE} \
data.max_prompt_length=2048 \
data.max_response_length=2048 \
workflow.n_parallel_tasks=${N_PARALLEL_TASKS} \
workflow.retry_limit=${RETRY_LIMIT} \
trainer.total_epochs=${TOTAL_EPOCHS} \
trainer.logger=['console','wandb'] \
trainer.project_name="${PROJECT_NAME}" \
trainer.experiment_name="${EXPERIMENT_NAME}" \
trainer.test_freq=5 \
trainer.save_freq=20 \
trainer.val_before_train=true
Key Tinker-specific parameters:
model.name: Model to fine-tune (e.g.,Qwen/Qwen3-4B-Instruct-2507)model.lora_rank: LoRA rank for trainingtraining.group_size: GRPO group size+backend=tinker: Selects the Tinker backend
Quick Test
To verify everything works with minimal resources:
This runs a minimal training loop with:
- 20 samples (
+verifiers.max_samples=20) - Batch size of 4
- Single epoch
Test script:
#!/bin/bash
set -x
# Minimal test script for verifiers training
# Reduced parallelism and batch sizes for quick debugging
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
if [ -f "$SCRIPT_DIR/.env" ]; then
source "$SCRIPT_DIR/.env"
export TINKER_API_KEY
fi
python3 -m examples.verifiers_env.train \
--config-name=tinker_rl_trainer \
+backend=tinker \
tinker_base_url=null \
model.name="Qwen/Qwen3-4B-Instruct-2507" \
model.lora_rank=16 \
+verifiers.env_id="primeintellect/alphabet-sort" \
+verifiers.max_samples=20 \
algorithm.adv_estimator=grpo \
training.group_size=4 \
training.learning_rate=2e-5 \
training.max_length=4096 \
sampling.temperature=1.0 \
sampling.top_p=1.0 \
data.train_batch_size=4 \
data.val_batch_size=4 \
data.max_prompt_length=1024 \
data.max_response_length=1024 \
workflow.n_parallel_tasks=4 \
workflow.retry_limit=1 \
trainer.total_epochs=1 \
'trainer.logger=[console,wandb]' \
trainer.project_name="verifiers-test" \
trainer.experiment_name="test-run" \
trainer.test_freq=1 \
trainer.save_freq=100 \
trainer.val_before_train=false
Architecture
How it works
┌─────────────────────────────────────────────────────────────┐
│ AgentTrainer │
│ (orchestrates training loop, handles batching) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ VerifiersWorkflow │
│ - Receives task from trainer │
│ - Wraps RolloutEngine as AsyncOpenAI client │
│ - Calls verifiers env.rollout() │
│ - Converts verifiers State → rllm Episode │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────┴─────────────────┐
▼ ▼
┌─────────────────────────┐ ┌───────────────────────┐
│ RolloutEngineAsyncClient│ │ Verifiers Env │
│ (AsyncOpenAI adapter) │◄────────►│ (alphabet-sort, │
│ │ │ math, etc.) │
└─────────────────────────┘ └───────────────────────┘
│
▼
┌───────────────────────┐
│ RolloutEngine │
│ (verl / Tinker) │
└───────────────────────┘
Key files
| File | Description |
|---|---|
train.py |
Main entry point, loads environment and starts trainer |
workflow.py |
VerifiersWorkflow - bridges rLLM and verifiers |
openai_wrapper.py |
RolloutEngineAsyncClient - makes RolloutEngine look like AsyncOpenAI |
train_verifiers.sh |
Shell script for verl backend |
train_verifiers_tinker.sh |
Shell script for Tinker backend |
Code Reference
Training entry point
from typing import Any
import hydra
import verifiers as vf
from omegaconf import OmegaConf
from examples.verifiers_env.workflow import VerifiersWorkflow
from rllm.data.dataset import DatasetRegistry
from rllm.trainer.agent_trainer import AgentTrainer
@hydra.main(
config_path="pkg://rllm.trainer.config",
config_name="agent_ppo_trainer",
version_base=None,
)
def main(config):
# Access custom top-level config key with defaults
vf_env_id = OmegaConf.select(config, "verifiers.env_id", default="DefaultEnv")
vf_env_args_raw = OmegaConf.select(config, "verifiers.env_args", default=None)
vf_env_args: dict[str, Any] = (
OmegaConf.to_container(vf_env_args_raw, resolve=True) # type: ignore[assignment]
if vf_env_args_raw is not None
else {}
)
# Get sampling args for verifiers rollouts
vf_sampling_args_raw = OmegaConf.select(config, "verifiers.sampling_args", default=None)
vf_sampling_args: dict[str, Any] | None = (
OmegaConf.to_container(vf_sampling_args_raw, resolve=True) # type: ignore[assignment]
if vf_sampling_args_raw is not None
else None
)
vf_env = vf.load_environment(vf_env_id, **vf_env_args)
# Registering during train time since for some environments, creating some datasets for verifiers environments needs resources to spin up.
# Get max_samples limit (for testing with smaller datasets)
max_samples = OmegaConf.select(config, "verifiers.max_samples", default=None)
if DatasetRegistry.dataset_exists(vf_env_id, "train"):
train_dataset = DatasetRegistry.load_dataset(vf_env_id, "train")
test_dataset = DatasetRegistry.load_dataset(vf_env_id, "test")
else:
vf_train_dataset = vf_env.get_dataset()
vf_eval_dataset = vf_env.get_eval_dataset()
train_dataset = DatasetRegistry.register_dataset(vf_env_id, vf_train_dataset, "train")
test_dataset = DatasetRegistry.register_dataset(vf_env_id, vf_eval_dataset, "test")
# Limit dataset size if max_samples specified
if max_samples:
train_dataset = train_dataset.select(range(min(max_samples, len(train_dataset))))
test_dataset = test_dataset.select(range(min(max_samples, len(test_dataset))))
# Get backend from config (default: verl)
backend = OmegaConf.select(config, "backend", default="verl")
trainer = AgentTrainer(
workflow_class=VerifiersWorkflow,
workflow_args={
"vf_env": vf_env,
"sampling_args": vf_sampling_args,
},
train_dataset=train_dataset,
val_dataset=test_dataset,
config=config,
backend=backend,
)
trainer.train()
if __name__ == "__main__":
main()
VerifiersWorkflow
The workflow that bridges rLLM and verifiers:
"""
VerifiersWorkflow: rLLM Workflow that uses verifiers for RL environments.
This workflow wraps the rllm RolloutEngine as an AsyncOpenAI-compatible client,
allowing seamless integration with verifiers environments during training.
"""
from typing import Any
import verifiers as vf
from verifiers.types import State, TrajectoryStep
from verifiers.utils.async_utils import maybe_semaphore
from examples.verifiers_env.openai_wrapper import RolloutEngineAsyncClient
from rllm.agents.agent import Episode, Step, Trajectory
from rllm.engine.rollout.rollout_engine import RolloutEngine
from rllm.workflows.workflow import Workflow
class VerifiersWorkflow(Workflow):
"""
rLLM Workflow that uses verifiers for RL environments and generating rollouts.
This workflow:
1. Wraps the rllm RolloutEngine as an AsyncOpenAI-compatible client
2. Passes the client to a verifiers Environment for rollout generation
3. Converts verifiers State/TrajectorySteps to rllm Episode/Trajectory/Step
Example usage:
trainer = AgentTrainer(
workflow_class=VerifiersWorkflow,
workflow_args={
"vf_env": vf.load_environment("math", ...),
},
train_dataset=train_dataset,
config=config,
)
"""
def __init__(
self,
rollout_engine: RolloutEngine,
vf_env: vf.Environment,
sampling_args: dict[str, Any] | None = None,
**kwargs,
):
super().__init__(rollout_engine=rollout_engine, **kwargs)
self.vf_env = vf_env
self.sampling_args = sampling_args or {}
async def run(self, task: dict, uid: str, **kwargs) -> Episode:
"""
Run a single rollout using the verifiers environment.
Args:
task: Task dict containing 'prompt', 'example_id', 'task', etc.
uid: Unique identifier for this rollout (format: "task_id:rollout_idx")
Returns:
Episode containing the trajectory with rewards from verifiers
"""
self.reset(task, uid)
# Create verifiers-compatible client from rllm's rollout engine
# TinkerEngine uses model_name, OpenAIEngine uses model
model = getattr(self.rollout_engine, "model", None) or getattr(self.rollout_engine, "model_name", "unknown")
client = RolloutEngineAsyncClient(
rollout_engine=self.rollout_engine,
model=model,
application_id_fn=lambda: uid,
)
# Build verifiers RolloutInput from rllm task
rollout_input = {
"prompt": task.get("prompt") or task.get("messages", []),
"example_id": task.get("example_id", 0),
"task": task.get("task", "default"),
}
if "answer" in task:
rollout_input["answer"] = task["answer"]
if "info" in task:
rollout_input["info"] = task["info"]
# Run verifiers rollout
state: State = await self.vf_env.rollout(
input=rollout_input,
client=client,
model=client._model,
sampling_args=self.sampling_args,
)
score_sem = await maybe_semaphore(-1) # -1 means no limit
await self.vf_env.rubric.score_rollout(state, score_sem=score_sem)
# Convert verifiers State to rllm Episode
episode = self._convert_state_to_episode(state, uid)
return episode
def _convert_state_to_episode(self, state: State, uid: str) -> Episode:
"""
Convert a verifiers State to an rllm Episode.
Args:
state: Completed verifiers State with trajectory and reward
uid: Unique identifier for this episode
Returns:
rllm Episode with converted trajectories
"""
trajectory_steps: list[TrajectoryStep] = state.get("trajectory", [])
# Build rllm Steps from verifiers TrajectorySteps
steps = []
for traj_step in trajectory_steps:
tokens = traj_step.get("tokens")
step = Step(
prompt_ids=tokens["prompt_ids"] if tokens else [],
response_ids=tokens["completion_ids"] if tokens else [],
logprobs=tokens["completion_logprobs"] if tokens else [],
reward=traj_step.get("reward", 0.0),
)
steps.append(step)
# Create trajectory with final reward
trajectory = Trajectory(
steps=steps,
reward=state.get("reward", 0.0),
name="verifiers",
)
# Create episode
episode = Episode(
trajectories=[trajectory],
is_correct=state.get("reward", 0.0) > 0,
)
episode.id = uid
episode.task = state.get("task", {})
# Add metrics if available
if state.get("metrics"):
episode.metrics = state["metrics"]
return episode
RolloutEngine AsyncOpenAI Wrapper
Adapter that makes RolloutEngine compatible with verifiers:
"""Wrapper that makes RolloutEngine look like AsyncOpenAI for verifiers compatibility."""
from __future__ import annotations
import json
import time
from collections.abc import Callable
from typing import TYPE_CHECKING
from openai.types.chat import ChatCompletion
from openai.types.chat.chat_completion import Choice, ChoiceLogprobs, CompletionUsage
from openai.types.chat.chat_completion_message import ChatCompletionMessage
from openai.types.chat.chat_completion_message_tool_call import (
ChatCompletionMessageToolCall,
Function,
)
from openai.types.chat.chat_completion_token_logprob import ChatCompletionTokenLogprob
if TYPE_CHECKING:
from rllm.engine.rollout.rollout_engine import RolloutEngine
class ChatCompletions:
"""Implements client.chat.completions interface."""
def __init__(
self,
engine: RolloutEngine,
model: str,
application_id_fn: Callable[[], str] | None = None,
):
self._engine = engine
self._model = model
self._application_id_fn = application_id_fn or (lambda: "default")
async def create(
self,
*,
model: str,
messages: list[dict],
tools: list[dict] | None = None,
**kwargs,
) -> ChatCompletion:
"""Call RolloutEngine and return OpenAI-compatible ChatCompletion."""
# Normalize sampling args
sampling_args = dict(kwargs)
# max_completion_tokens → max_tokens for engine
if "max_completion_tokens" in sampling_args:
sampling_args["max_tokens"] = sampling_args.pop("max_completion_tokens")
# Remove args the engine doesn't understand
sampling_args.pop("extra_body", None)
sampling_args.pop("modalities", None)
# Call the engine
output = await self._engine.get_model_response(
messages=messages,
application_id=self._application_id_fn(),
tools=tools or [],
**sampling_args,
)
# Convert tool_calls if present
# tool_calls can be:
# - list[ToolCall] (rllm dataclass with name, arguments)
# - list[ChatCompletionMessageToolCall] (raw OpenAI objects)
# - list[dict] (parsed dicts)
tool_calls = None
if output.tool_calls:
converted_tool_calls = []
for i, tc in enumerate(output.tool_calls):
# Already an OpenAI ChatCompletionMessageToolCall - pass through
if isinstance(tc, ChatCompletionMessageToolCall):
converted_tool_calls.append(tc)
# rllm ToolCall dataclass or dict
else:
tc_id = getattr(tc, "id", None) or f"call_{i}"
tc_name = getattr(tc, "name", None) or (tc.get("name") if isinstance(tc, dict) else "")
tc_args = getattr(tc, "arguments", None) or (tc.get("arguments") if isinstance(tc, dict) else {})
# arguments can be dict or string
if isinstance(tc_args, dict):
tc_args = json.dumps(tc_args)
converted_tool_calls.append(
ChatCompletionMessageToolCall(
id=tc_id,
type="function",
function=Function(name=tc_name, arguments=tc_args),
)
)
tool_calls = converted_tool_calls if converted_tool_calls else None
# Build logprobs if available
logprobs = None
if output.logprobs:
token_logprobs = [
ChatCompletionTokenLogprob(
token=f"<token_{i}>", # placeholder, we don't have decoded tokens
bytes=None,
logprob=float(lp) if lp is not None else 0.0,
top_logprobs=[],
)
for i, lp in enumerate(output.logprobs)
]
logprobs = ChoiceLogprobs(content=token_logprobs, refusal=None)
# Build the choice
choice = Choice(
index=0,
message=ChatCompletionMessage(
role="assistant",
content=output.content or output.text,
tool_calls=tool_calls,
),
finish_reason=output.finish_reason or "stop",
logprobs=logprobs,
)
# Add vLLM extensions as attributes
choice.token_ids = output.completion_ids # type: ignore[attr-defined]
# Build the response
response = ChatCompletion(
id=f"chatcmpl-{int(time.time() * 1000)}",
model=model or self._model,
object="chat.completion",
created=int(time.time()),
choices=[choice],
usage=CompletionUsage(
prompt_tokens=output.prompt_length or 0,
completion_tokens=output.completion_length or 0,
total_tokens=(output.prompt_length or 0) + (output.completion_length or 0),
),
)
# Add vLLM extension
response.prompt_token_ids = output.prompt_ids # type: ignore[attr-defined]
return response
class Chat:
"""Implements client.chat interface."""
def __init__(
self,
engine: RolloutEngine,
model: str,
application_id_fn: Callable[[], str] | None = None,
):
self.completions = ChatCompletions(engine, model, application_id_fn)
class RolloutEngineAsyncClient:
"""
Wrapper that makes RolloutEngine look like AsyncOpenAI.
Use this to integrate rllm with libraries that expect an AsyncOpenAI client,
such as the verifiers library.
Example:
client = RolloutEngineAsyncClient(
rollout_engine=self.rollout_engine,
model="Qwen/Qwen3-0.6B",
application_id_fn=lambda: uid,
)
# Now use with verifiers
response = await client.chat.completions.create(
model="...",
messages=[...],
)
"""
def __init__(
self,
rollout_engine: RolloutEngine,
model: str = "",
application_id_fn: Callable[[], str] | None = None,
base_url: str = "http://rllm-internal",
):
self._engine = rollout_engine
self._model = model or getattr(rollout_engine, "model", "unknown")
self._application_id_fn = application_id_fn
# Properties that verifiers accesses
self.base_url = base_url
self.api_key = "rllm-internal"
# Build interface
self.chat = Chat(rollout_engine, self._model, application_id_fn)
Configuration Reference
Verifiers-specific config
| Parameter | Description | Example |
|---|---|---|
+verifiers.env_id |
Environment ID from Hub | primeintellect/alphabet-sort |
+verifiers.env_args |
Environment constructor args (JSON) | '{"difficulty": "hard"}' |
+verifiers.sampling_args |
Sampling args for rollouts | '{"temperature": 0.7}' |
+verifiers.max_samples |
Limit dataset size (for testing) | 100 |
+backend |
Training backend | verl or tinker |
Common training config
| Parameter | Description | Default |
|---|---|---|
data.train_batch_size |
Prompts per batch | 64 |
data.max_prompt_length |
Max prompt tokens | 2048 |
data.max_response_length |
Max response tokens | 2048 |
trainer.total_epochs |
Training epochs | 100 |
trainer.logger |
Logging backends | ['console','wandb'] |
verl-specific config
| Parameter | Description |
|---|---|
actor_rollout_ref.model.path |
HuggingFace model path |
actor_rollout_ref.rollout.n |
Samples per prompt |
trainer.n_gpus_per_node |
GPUs to use |
Tinker-specific config
| Parameter | Description |
|---|---|
model.name |
Model to fine-tune |
model.lora_rank |
LoRA rank |
training.group_size |
GRPO group size |
training.learning_rate |
Learning rate |
Troubleshooting
Environment not found
If prime env install fails:
- Check you're logged in:
prime login - Verify the environment exists on the Hub
- Check the exact name format:
owner/environment-name
Tinker API key errors
Ensure the key is set:
Or check your .env file is in the right location.
Dataset too large
Use +verifiers.max_samples to limit dataset size:
python -m examples.verifiers_env.train \
+verifiers.env_id="primeintellect/alphabet-sort" \
+verifiers.max_samples=1000 \
...
Hydra config errors
When adding new config keys, use +key=value (with plus):