Skip to content

RL Training with Prime Intellect's Environment Hub

This example shows how to train RL agents using environments from Prime Intellect's Environment Hub with the verifiers library.

Overview

With this example you will:

  1. Install and configure the verifiers library for RL environments
  2. Download environments from Prime Intellect's Hub using the Prime CLI
  3. Train agents using either the verl (local GPU) or Tinker (remote GPU) backend
  4. Understand how rLLM wraps verifiers environments via VerifiersWorkflow

Under the hood, the integration works as:

  • VerifiersWorkflow: Wraps rLLM's RolloutEngine as an AsyncOpenAI-compatible client
  • RolloutEngineAsyncClient: Adapter that lets verifiers environments call rLLM for inference
  • Rubric scoring: Verifiers environments define rubrics that score model outputs

Setup

1. Install dependencies

Install rLLM with the verifiers extra:

uv sync --extra verifiers

This installs the verifiers library alongside rLLM.

For Tinker backend (remote GPU training), also install:

uv sync --extra tinker

2. Install Prime CLI

The Prime CLI lets you download environments from the Hub. Install it via:

uv tool install prime

Or with pipx:

pipx install prime

3. Login to Prime Intellect (optional)

Login is required for private environments:

prime login

This opens a browser for authentication. Public environments work without login.

4. Install an environment

Download an environment locally:

prime env install primeintellect/alphabet-sort

Format: prime env install <owner>/<environment-name>

For a specific version:

prime env install primeintellect/alphabet-sort@0.1.0

To see available installation methods:

prime env info primeintellect/alphabet-sort

See the Prime Intellect docs for more details.


Environment Arguments

Each verifiers environment can accept custom arguments via +verifiers.env_args. These are passed directly to vf.load_environment().

Passing environment args

Use Hydra's nested syntax:

python -m examples.verifiers_env.train \
    +verifiers.env_id="primeintellect/alphabet-sort" \
    '+verifiers.env_args={max_turns: 10, difficulty: "hard"}' \
    ...

Or in the shell script:

VF_ENV_ARGS='max_turns: 20'

python -m examples.verifiers_env.train \
    +verifiers.env_id="${VF_ENV_ID}" \
    "+verifiers.env_args={${VF_ENV_ARGS}}" \
    ...

Common environment arguments

Argument Description Example
max_turns Maximum conversation turns 10
tools List of tools to enable ["search", "calculator"]
system_prompt Custom system prompt "You are a helpful assistant"

Check each environment's documentation for its specific arguments. You can also inspect an environment locally:

import verifiers as vf

# See what arguments load_environment accepts
env = vf.load_environment("primeintellect/alphabet-sort")
print(env)  # Shows environment configuration

Training with verl Backend (Local GPU)

The verl backend runs inference and training on your local GPUs using vLLM.

Run training

cd examples/verifiers_env
bash train_verifiers.sh

Shell configuration:

examples/verifiers_env/train_verifiers.sh
#!/bin/bash
set -x

# Environment variables for vLLM (managed internally by verl)
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:False"
export VLLM_USE_V1=1
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export VLLM_ENGINE_ITERATION_TIMEOUT_S=100000000000

# ============================================================================
# CONFIGURE THESE FOR YOUR SETUP
# ============================================================================

# Model to train (HuggingFace path or local path)
MODEL_PATH="Qwen/Qwen3-4B"

# Verifiers environment configuration
VF_ENV_ID="math"  # The verifiers environment to use
# VF_ENV_ARGS can be passed as JSON, e.g., '{"difficulty": "hard"}'

# GPU configuration
N_GPUS=1  # Number of GPUs per node
NNODES=1  # Number of nodes

# Training hyperparameters
TRAIN_BATCH_SIZE=64
VAL_BATCH_SIZE=128
MAX_PROMPT_LENGTH=2048
MAX_RESPONSE_LENGTH=2048
TOTAL_EPOCHS=100
MAX_STEPS=10  # Max turns for multi-turn environments

# Sampling parameters
ROLLOUT_N=8  # Number of samples per prompt (for GRPO/PPO)
TEMPERATURE=0.7

# Logging
PROJECT_NAME="verifiers-rl"
EXPERIMENT_NAME="${VF_ENV_ID}-training"

# ============================================================================
# RUN TRAINING
# ============================================================================

python3 -m examples.verifiers_env.train \
    verifiers.env_id="${VF_ENV_ID}" \
    algorithm.adv_estimator=grpo \
    data.train_batch_size=${TRAIN_BATCH_SIZE} \
    data.val_batch_size=${VAL_BATCH_SIZE} \
    data.max_prompt_length=${MAX_PROMPT_LENGTH} \
    data.max_response_length=${MAX_RESPONSE_LENGTH} \
    actor_rollout_ref.model.path=${MODEL_PATH} \
    actor_rollout_ref.hybrid_engine=True \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.loss_agg_mode=seq-mean-token-sum \
    actor_rollout_ref.actor.ppo_mini_batch_size=32 \
    actor_rollout_ref.actor.use_dynamic_bsz=True \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=24000 \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.actor.clip_ratio_high=0.28 \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.mode="async" \
    actor_rollout_ref.rollout.enforce_eager=False \
    actor_rollout_ref.rollout.temperature=${TEMPERATURE} \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.85 \
    actor_rollout_ref.rollout.n=${ROLLOUT_N} \
    actor_rollout_ref.rollout.val_kwargs.n=1 \
    actor_rollout_ref.rollout.val_kwargs.temperature=0.7 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.actor.entropy_coeff=0 \
    algorithm.kl_ctrl.kl_coef=0.001 \
    rllm.mask_truncated_samples=False \
    rllm.workflow.use_workflow=True \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name="${PROJECT_NAME}" \
    trainer.experiment_name="${EXPERIMENT_NAME}" \
    trainer.val_before_train=False \
    trainer.n_gpus_per_node=${N_GPUS} \
    trainer.nnodes=${NNODES} \
    trainer.save_freq=40 \
    trainer.test_freq=10 \
    trainer.default_hdfs_dir=null \
    rllm.agent.max_steps=${MAX_STEPS} \
    trainer.total_epochs=${TOTAL_EPOCHS}

Key parameters to customize:

  • MODEL_PATH: HuggingFace model path (e.g., Qwen/Qwen3-4B)
  • VF_ENV_ID: Verifiers environment ID (e.g., primeintellect/alphabet-sort)
  • N_GPUS: Number of GPUs to use
  • ROLLOUT_N: Samples per prompt (GRPO group size)

Training with Tinker Backend (Remote GPU)

The Tinker backend offloads inference and LoRA training to Prime Intellect's GPU service.

Configure API key

Set your Tinker API key:

export TINKER_API_KEY=your_api_key_here

Or create a .env file in the example directory:

# examples/verifiers_env/.env
TINKER_API_KEY=your_api_key_here

Run training

cd examples/verifiers_env
bash train_verifiers_tinker.sh

Shell configuration:

examples/verifiers_env/train_verifiers_tinker.sh
#!/bin/bash
set -x

# ============================================================================
# TINKER BACKEND TRAINING FOR VERIFIERS
# ============================================================================
#
# This uses Tinker for inference. You need:
# 1. A Tinker API key
# 2. Model available on Tinker
#
# Tinker handles:
#   - Model inference (remote)
#   - Weight management
#   - LoRA training
#
# ============================================================================

# ============================================================================
# CONFIGURE THESE FOR YOUR SETUP
# ============================================================================

# Load API key from .env file
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
if [ -f "$SCRIPT_DIR/.env" ]; then
    source "$SCRIPT_DIR/.env"
    export TINKER_API_KEY
fi

# Tinker service URL (null uses default Tinker cloud)
TINKER_BASE_URL=null

# Model to train (HuggingFace path)
MODEL_NAME="Qwen/Qwen3-4B-Instruct-2507"

# Verifiers environment configuration
VF_ENV_ID="primeintellect/alphabet-sort"

# LoRA configuration
LORA_RANK=32

# Training hyperparameters
GROUP_SIZE=16          # Rollouts per prompt (for GRPO)
LEARNING_RATE=2e-5
MAX_LENGTH=32768
TRAIN_BATCH_SIZE=64
VAL_BATCH_SIZE=32
TOTAL_EPOCHS=10

# Sampling parameters (MUST be 1.0 for Tinker workflow trainer)
TEMPERATURE=1.0
TOP_P=1.0

# Workflow configuration
N_PARALLEL_TASKS=256
RETRY_LIMIT=3

# Logging
PROJECT_NAME="verifiers-tinker"
EXPERIMENT_NAME="${VF_ENV_ID}-training"

# ============================================================================
# RUN TRAINING
# ============================================================================

python3 -m examples.verifiers_env.train \
    --config-name=tinker_rl_trainer \
    +backend=tinker \
    tinker_base_url=${TINKER_BASE_URL} \
    model.name="${MODEL_NAME}" \
    model.lora_rank=${LORA_RANK} \
    +verifiers.env_id="${VF_ENV_ID}" \
    algorithm.adv_estimator=grpo \
    training.group_size=${GROUP_SIZE} \
    training.learning_rate=${LEARNING_RATE} \
    training.max_length=${MAX_LENGTH} \
    sampling.temperature=${TEMPERATURE} \
    sampling.top_p=${TOP_P} \
    data.train_batch_size=${TRAIN_BATCH_SIZE} \
    data.val_batch_size=${VAL_BATCH_SIZE} \
    data.max_prompt_length=2048 \
    data.max_response_length=2048 \
    workflow.n_parallel_tasks=${N_PARALLEL_TASKS} \
    workflow.retry_limit=${RETRY_LIMIT} \
    trainer.total_epochs=${TOTAL_EPOCHS} \
    trainer.logger=['console','wandb'] \
    trainer.project_name="${PROJECT_NAME}" \
    trainer.experiment_name="${EXPERIMENT_NAME}" \
    trainer.test_freq=5 \
    trainer.save_freq=20 \
    trainer.val_before_train=true

Key Tinker-specific parameters:

  • model.name: Model to fine-tune (e.g., Qwen/Qwen3-4B-Instruct-2507)
  • model.lora_rank: LoRA rank for training
  • training.group_size: GRPO group size
  • +backend=tinker: Selects the Tinker backend

Quick Test

To verify everything works with minimal resources:

cd examples/verifiers_env
bash test_train.sh

This runs a minimal training loop with:

  • 20 samples (+verifiers.max_samples=20)
  • Batch size of 4
  • Single epoch

Test script:

examples/verifiers_env/test_train.sh
#!/bin/bash
set -x

# Minimal test script for verifiers training
# Reduced parallelism and batch sizes for quick debugging

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
if [ -f "$SCRIPT_DIR/.env" ]; then
    source "$SCRIPT_DIR/.env"
    export TINKER_API_KEY
fi

python3 -m examples.verifiers_env.train \
    --config-name=tinker_rl_trainer \
    +backend=tinker \
    tinker_base_url=null \
    model.name="Qwen/Qwen3-4B-Instruct-2507" \
    model.lora_rank=16 \
    +verifiers.env_id="primeintellect/alphabet-sort" \
    +verifiers.max_samples=20 \
    algorithm.adv_estimator=grpo \
    training.group_size=4 \
    training.learning_rate=2e-5 \
    training.max_length=4096 \
    sampling.temperature=1.0 \
    sampling.top_p=1.0 \
    data.train_batch_size=4 \
    data.val_batch_size=4 \
    data.max_prompt_length=1024 \
    data.max_response_length=1024 \
    workflow.n_parallel_tasks=4 \
    workflow.retry_limit=1 \
    trainer.total_epochs=1 \
    'trainer.logger=[console,wandb]' \
    trainer.project_name="verifiers-test" \
    trainer.experiment_name="test-run" \
    trainer.test_freq=1 \
    trainer.save_freq=100 \
    trainer.val_before_train=false

Architecture

How it works

┌─────────────────────────────────────────────────────────────┐
│                      AgentTrainer                           │
│  (orchestrates training loop, handles batching)             │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│                    VerifiersWorkflow                        │
│  - Receives task from trainer                               │
│  - Wraps RolloutEngine as AsyncOpenAI client                │
│  - Calls verifiers env.rollout()                            │
│  - Converts verifiers State → rllm Episode                  │
└─────────────────────────────────────────────────────────────┘
            ┌─────────────────┴─────────────────┐
            ▼                                   ▼
┌─────────────────────────┐          ┌───────────────────────┐
│ RolloutEngineAsyncClient│          │   Verifiers Env       │
│ (AsyncOpenAI adapter)   │◄────────►│   (alphabet-sort,     │
│                         │          │    math, etc.)        │
└─────────────────────────┘          └───────────────────────┘
┌───────────────────────┐
│    RolloutEngine      │
│  (verl / Tinker)      │
└───────────────────────┘

Key files

File Description
train.py Main entry point, loads environment and starts trainer
workflow.py VerifiersWorkflow - bridges rLLM and verifiers
openai_wrapper.py RolloutEngineAsyncClient - makes RolloutEngine look like AsyncOpenAI
train_verifiers.sh Shell script for verl backend
train_verifiers_tinker.sh Shell script for Tinker backend

Code Reference

Training entry point

examples/verifiers_env/train.py
from typing import Any

import hydra
import verifiers as vf
from omegaconf import OmegaConf

from examples.verifiers_env.workflow import VerifiersWorkflow
from rllm.data.dataset import DatasetRegistry
from rllm.trainer.agent_trainer import AgentTrainer


@hydra.main(
    config_path="pkg://rllm.trainer.config",
    config_name="agent_ppo_trainer",
    version_base=None,
)
def main(config):
    # Access custom top-level config key with defaults
    vf_env_id = OmegaConf.select(config, "verifiers.env_id", default="DefaultEnv")
    vf_env_args_raw = OmegaConf.select(config, "verifiers.env_args", default=None)

    vf_env_args: dict[str, Any] = (
        OmegaConf.to_container(vf_env_args_raw, resolve=True)  # type: ignore[assignment]
        if vf_env_args_raw is not None
        else {}
    )

    # Get sampling args for verifiers rollouts
    vf_sampling_args_raw = OmegaConf.select(config, "verifiers.sampling_args", default=None)
    vf_sampling_args: dict[str, Any] | None = (
        OmegaConf.to_container(vf_sampling_args_raw, resolve=True)  # type: ignore[assignment]
        if vf_sampling_args_raw is not None
        else None
    )

    vf_env = vf.load_environment(vf_env_id, **vf_env_args)

    # Registering during train time since for some environments, creating some datasets for verifiers environments needs resources to spin up.

    # Get max_samples limit (for testing with smaller datasets)
    max_samples = OmegaConf.select(config, "verifiers.max_samples", default=None)

    if DatasetRegistry.dataset_exists(vf_env_id, "train"):
        train_dataset = DatasetRegistry.load_dataset(vf_env_id, "train")
        test_dataset = DatasetRegistry.load_dataset(vf_env_id, "test")
    else:
        vf_train_dataset = vf_env.get_dataset()
        vf_eval_dataset = vf_env.get_eval_dataset()

        train_dataset = DatasetRegistry.register_dataset(vf_env_id, vf_train_dataset, "train")

        test_dataset = DatasetRegistry.register_dataset(vf_env_id, vf_eval_dataset, "test")

    # Limit dataset size if max_samples specified
    if max_samples:
        train_dataset = train_dataset.select(range(min(max_samples, len(train_dataset))))
        test_dataset = test_dataset.select(range(min(max_samples, len(test_dataset))))

    # Get backend from config (default: verl)
    backend = OmegaConf.select(config, "backend", default="verl")

    trainer = AgentTrainer(
        workflow_class=VerifiersWorkflow,
        workflow_args={
            "vf_env": vf_env,
            "sampling_args": vf_sampling_args,
        },
        train_dataset=train_dataset,
        val_dataset=test_dataset,
        config=config,
        backend=backend,
    )
    trainer.train()


if __name__ == "__main__":
    main()

VerifiersWorkflow

The workflow that bridges rLLM and verifiers:

examples/verifiers_env/workflow.py
"""
VerifiersWorkflow: rLLM Workflow that uses verifiers for RL environments.

This workflow wraps the rllm RolloutEngine as an AsyncOpenAI-compatible client,
allowing seamless integration with verifiers environments during training.
"""

from typing import Any

import verifiers as vf
from verifiers.types import State, TrajectoryStep
from verifiers.utils.async_utils import maybe_semaphore

from examples.verifiers_env.openai_wrapper import RolloutEngineAsyncClient
from rllm.agents.agent import Episode, Step, Trajectory
from rllm.engine.rollout.rollout_engine import RolloutEngine
from rllm.workflows.workflow import Workflow


class VerifiersWorkflow(Workflow):
    """
    rLLM Workflow that uses verifiers for RL environments and generating rollouts.

    This workflow:
    1. Wraps the rllm RolloutEngine as an AsyncOpenAI-compatible client
    2. Passes the client to a verifiers Environment for rollout generation
    3. Converts verifiers State/TrajectorySteps to rllm Episode/Trajectory/Step

    Example usage:
        trainer = AgentTrainer(
            workflow_class=VerifiersWorkflow,
            workflow_args={
                "vf_env": vf.load_environment("math", ...),
            },
            train_dataset=train_dataset,
            config=config,
        )
    """

    def __init__(
        self,
        rollout_engine: RolloutEngine,
        vf_env: vf.Environment,
        sampling_args: dict[str, Any] | None = None,
        **kwargs,
    ):
        super().__init__(rollout_engine=rollout_engine, **kwargs)
        self.vf_env = vf_env
        self.sampling_args = sampling_args or {}

    async def run(self, task: dict, uid: str, **kwargs) -> Episode:
        """
        Run a single rollout using the verifiers environment.

        Args:
            task: Task dict containing 'prompt', 'example_id', 'task', etc.
            uid: Unique identifier for this rollout (format: "task_id:rollout_idx")

        Returns:
            Episode containing the trajectory with rewards from verifiers
        """
        self.reset(task, uid)

        # Create verifiers-compatible client from rllm's rollout engine
        # TinkerEngine uses model_name, OpenAIEngine uses model
        model = getattr(self.rollout_engine, "model", None) or getattr(self.rollout_engine, "model_name", "unknown")
        client = RolloutEngineAsyncClient(
            rollout_engine=self.rollout_engine,
            model=model,
            application_id_fn=lambda: uid,
        )

        # Build verifiers RolloutInput from rllm task
        rollout_input = {
            "prompt": task.get("prompt") or task.get("messages", []),
            "example_id": task.get("example_id", 0),
            "task": task.get("task", "default"),
        }
        if "answer" in task:
            rollout_input["answer"] = task["answer"]
        if "info" in task:
            rollout_input["info"] = task["info"]

        # Run verifiers rollout
        state: State = await self.vf_env.rollout(
            input=rollout_input,
            client=client,
            model=client._model,
            sampling_args=self.sampling_args,
        )
        score_sem = await maybe_semaphore(-1)  # -1 means no limit

        await self.vf_env.rubric.score_rollout(state, score_sem=score_sem)

        # Convert verifiers State to rllm Episode
        episode = self._convert_state_to_episode(state, uid)

        return episode

    def _convert_state_to_episode(self, state: State, uid: str) -> Episode:
        """
        Convert a verifiers State to an rllm Episode.

        Args:
            state: Completed verifiers State with trajectory and reward
            uid: Unique identifier for this episode

        Returns:
            rllm Episode with converted trajectories
        """
        trajectory_steps: list[TrajectoryStep] = state.get("trajectory", [])

        # Build rllm Steps from verifiers TrajectorySteps
        steps = []
        for traj_step in trajectory_steps:
            tokens = traj_step.get("tokens")
            step = Step(
                prompt_ids=tokens["prompt_ids"] if tokens else [],
                response_ids=tokens["completion_ids"] if tokens else [],
                logprobs=tokens["completion_logprobs"] if tokens else [],
                reward=traj_step.get("reward", 0.0),
            )
            steps.append(step)

        # Create trajectory with final reward
        trajectory = Trajectory(
            steps=steps,
            reward=state.get("reward", 0.0),
            name="verifiers",
        )

        # Create episode
        episode = Episode(
            trajectories=[trajectory],
            is_correct=state.get("reward", 0.0) > 0,
        )
        episode.id = uid
        episode.task = state.get("task", {})

        # Add metrics if available
        if state.get("metrics"):
            episode.metrics = state["metrics"]

        return episode

RolloutEngine AsyncOpenAI Wrapper

Adapter that makes RolloutEngine compatible with verifiers:

examples/verifiers_env/openai_wrapper.py
"""Wrapper that makes RolloutEngine look like AsyncOpenAI for verifiers compatibility."""

from __future__ import annotations

import json
import time
from collections.abc import Callable
from typing import TYPE_CHECKING

from openai.types.chat import ChatCompletion
from openai.types.chat.chat_completion import Choice, ChoiceLogprobs, CompletionUsage
from openai.types.chat.chat_completion_message import ChatCompletionMessage
from openai.types.chat.chat_completion_message_tool_call import (
    ChatCompletionMessageToolCall,
    Function,
)
from openai.types.chat.chat_completion_token_logprob import ChatCompletionTokenLogprob

if TYPE_CHECKING:
    from rllm.engine.rollout.rollout_engine import RolloutEngine


class ChatCompletions:
    """Implements client.chat.completions interface."""

    def __init__(
        self,
        engine: RolloutEngine,
        model: str,
        application_id_fn: Callable[[], str] | None = None,
    ):
        self._engine = engine
        self._model = model
        self._application_id_fn = application_id_fn or (lambda: "default")

    async def create(
        self,
        *,
        model: str,
        messages: list[dict],
        tools: list[dict] | None = None,
        **kwargs,
    ) -> ChatCompletion:
        """Call RolloutEngine and return OpenAI-compatible ChatCompletion."""
        # Normalize sampling args
        sampling_args = dict(kwargs)

        # max_completion_tokens → max_tokens for engine
        if "max_completion_tokens" in sampling_args:
            sampling_args["max_tokens"] = sampling_args.pop("max_completion_tokens")

        # Remove args the engine doesn't understand
        sampling_args.pop("extra_body", None)
        sampling_args.pop("modalities", None)

        # Call the engine
        output = await self._engine.get_model_response(
            messages=messages,
            application_id=self._application_id_fn(),
            tools=tools or [],
            **sampling_args,
        )

        # Convert tool_calls if present
        # tool_calls can be:
        # - list[ToolCall] (rllm dataclass with name, arguments)
        # - list[ChatCompletionMessageToolCall] (raw OpenAI objects)
        # - list[dict] (parsed dicts)
        tool_calls = None
        if output.tool_calls:
            converted_tool_calls = []
            for i, tc in enumerate(output.tool_calls):
                # Already an OpenAI ChatCompletionMessageToolCall - pass through
                if isinstance(tc, ChatCompletionMessageToolCall):
                    converted_tool_calls.append(tc)
                # rllm ToolCall dataclass or dict
                else:
                    tc_id = getattr(tc, "id", None) or f"call_{i}"
                    tc_name = getattr(tc, "name", None) or (tc.get("name") if isinstance(tc, dict) else "")
                    tc_args = getattr(tc, "arguments", None) or (tc.get("arguments") if isinstance(tc, dict) else {})
                    # arguments can be dict or string
                    if isinstance(tc_args, dict):
                        tc_args = json.dumps(tc_args)
                    converted_tool_calls.append(
                        ChatCompletionMessageToolCall(
                            id=tc_id,
                            type="function",
                            function=Function(name=tc_name, arguments=tc_args),
                        )
                    )
            tool_calls = converted_tool_calls if converted_tool_calls else None

        # Build logprobs if available
        logprobs = None
        if output.logprobs:
            token_logprobs = [
                ChatCompletionTokenLogprob(
                    token=f"<token_{i}>",  # placeholder, we don't have decoded tokens
                    bytes=None,
                    logprob=float(lp) if lp is not None else 0.0,
                    top_logprobs=[],
                )
                for i, lp in enumerate(output.logprobs)
            ]
            logprobs = ChoiceLogprobs(content=token_logprobs, refusal=None)

        # Build the choice
        choice = Choice(
            index=0,
            message=ChatCompletionMessage(
                role="assistant",
                content=output.content or output.text,
                tool_calls=tool_calls,
            ),
            finish_reason=output.finish_reason or "stop",
            logprobs=logprobs,
        )

        # Add vLLM extensions as attributes
        choice.token_ids = output.completion_ids  # type: ignore[attr-defined]

        # Build the response
        response = ChatCompletion(
            id=f"chatcmpl-{int(time.time() * 1000)}",
            model=model or self._model,
            object="chat.completion",
            created=int(time.time()),
            choices=[choice],
            usage=CompletionUsage(
                prompt_tokens=output.prompt_length or 0,
                completion_tokens=output.completion_length or 0,
                total_tokens=(output.prompt_length or 0) + (output.completion_length or 0),
            ),
        )

        # Add vLLM extension
        response.prompt_token_ids = output.prompt_ids  # type: ignore[attr-defined]

        return response


class Chat:
    """Implements client.chat interface."""

    def __init__(
        self,
        engine: RolloutEngine,
        model: str,
        application_id_fn: Callable[[], str] | None = None,
    ):
        self.completions = ChatCompletions(engine, model, application_id_fn)


class RolloutEngineAsyncClient:
    """
    Wrapper that makes RolloutEngine look like AsyncOpenAI.

    Use this to integrate rllm with libraries that expect an AsyncOpenAI client,
    such as the verifiers library.

    Example:
        client = RolloutEngineAsyncClient(
            rollout_engine=self.rollout_engine,
            model="Qwen/Qwen3-0.6B",
            application_id_fn=lambda: uid,
        )

        # Now use with verifiers
        response = await client.chat.completions.create(
            model="...",
            messages=[...],
        )
    """

    def __init__(
        self,
        rollout_engine: RolloutEngine,
        model: str = "",
        application_id_fn: Callable[[], str] | None = None,
        base_url: str = "http://rllm-internal",
    ):
        self._engine = rollout_engine
        self._model = model or getattr(rollout_engine, "model", "unknown")
        self._application_id_fn = application_id_fn

        # Properties that verifiers accesses
        self.base_url = base_url
        self.api_key = "rllm-internal"

        # Build interface
        self.chat = Chat(rollout_engine, self._model, application_id_fn)

Configuration Reference

Verifiers-specific config

Parameter Description Example
+verifiers.env_id Environment ID from Hub primeintellect/alphabet-sort
+verifiers.env_args Environment constructor args (JSON) '{"difficulty": "hard"}'
+verifiers.sampling_args Sampling args for rollouts '{"temperature": 0.7}'
+verifiers.max_samples Limit dataset size (for testing) 100
+backend Training backend verl or tinker

Common training config

Parameter Description Default
data.train_batch_size Prompts per batch 64
data.max_prompt_length Max prompt tokens 2048
data.max_response_length Max response tokens 2048
trainer.total_epochs Training epochs 100
trainer.logger Logging backends ['console','wandb']

verl-specific config

Parameter Description
actor_rollout_ref.model.path HuggingFace model path
actor_rollout_ref.rollout.n Samples per prompt
trainer.n_gpus_per_node GPUs to use

Tinker-specific config

Parameter Description
model.name Model to fine-tune
model.lora_rank LoRA rank
training.group_size GRPO group size
training.learning_rate Learning rate

Troubleshooting

Environment not found

If prime env install fails:

  1. Check you're logged in: prime login
  2. Verify the environment exists on the Hub
  3. Check the exact name format: owner/environment-name

Tinker API key errors

Ensure the key is set:

echo $TINKER_API_KEY

Or check your .env file is in the right location.

Dataset too large

Use +verifiers.max_samples to limit dataset size:

python -m examples.verifiers_env.train \
    +verifiers.env_id="primeintellect/alphabet-sort" \
    +verifiers.max_samples=1000 \
    ...

Hydra config errors

When adding new config keys, use +key=value (with plus):

# Correct - adds new key
+backend=tinker
+verifiers.env_id="math"

# Wrong - tries to override existing key
backend=tinker  # Error: key not in struct