Agent and Environment
This guide explains how the two main components in rLLM work together: agents and environments. You'll learn about their core interfaces, interaction patterns, and how to implement your own custom agents and environments.
Overview
rLLM uses a modular approach where agents act as intelligent decision-makers and environments provide tasks and feedback. This separation enables:
- Modular development: Components can be built and tested independently
- Flexible reuse: The same agent can work across different environments
- Consistent patterns: Standardized interfaces ensure compatibility
Agents
Agents are the core components in rLLM that generate intelligent actions based on environmental observations. They serve as the bridge between language models and interactive environments, enabling autonomous problem-solving and decision-making.
BaseAgent Interface
All agents inherit from the BaseAgent class, which defines the essential methods for environment interaction:
from rllm.agents.agent import BaseAgent, Step, Trajectory
class BaseAgent(ABC):
@abstractmethod
def update_from_env(self, observation, reward, done, info, **kwargs):
"""Updates agent state after receiving environment feedback."""
pass
@abstractmethod
def update_from_model(self, response, **kwargs):
"""Updates agent state after receiving model response."""
pass
@abstractmethod
def reset(self):
"""Resets agent's internal state for new episode."""
pass
@property
def chat_completions(self) -> List[Dict[str, str]]:
"""Returns messages formatted for chat completion."""
return []
@property
def trajectory(self) -> Trajectory:
"""Returns complete interaction history."""
return Trajectory()
def get_current_state(self) -> Step | None:
"""Return the most recent step."""
if not self.trajectory.steps:
return None
return self._trajectory.steps[-1]
Key Agent Responsibilities
Each agent manages: - State tracking: Maintaining conversation history and internal state through trajectories - Model interaction: Formatting messages for language model consumption - Response processing: Handling and storing model outputs - Environment adaptation: Updating state based on environmental feedback
Environments
Environments complement agents by defining the tasks, evaluation criteria, and interaction rules. They provide the context within which agents operate and learn.
BaseEnv Interface
All environments in rLLM inherit from the BaseEnv class, which follows the Gymnasium interface with rLLM-specific extensions:
from rllm.environments.base.base_env import BaseEnv
class BaseEnv(ABC):
@abstractmethod
def reset(self) -> Tuple[Any, Dict]:
"""Reset environment and return initial observation and info."""
pass
@abstractmethod
def step(self, action: Any) -> Tuple[Any, float, bool, bool, Dict]:
"""Execute action and return (observation, reward, done, info)."""
pass
def close(self):
"""Clean up environment resources."""
pass
@staticmethod
@abstractmethod
def from_dict(env_args: Dict) -> "BaseEnv":
"""Create environment instance from dictionary. This function is used both during inference and training to instantiate a new environment instance, so it has to be implemented properly."""
pass
Key Environment Responsibilities
Each environment handles: - Task definition: Providing problems or scenarios for agents to solve - Observation generation: Creating meaningful inputs for agent decision-making - Action evaluation: Assessing agent responses and providing rewards - Episode management: Determining when interactions should terminate
The Interaction Cycle
Understanding how agents and environments work together is crucial for effective rLLM usage. The interaction follows a structured cycle that enables learning and adaptation.
Step-by-Step Flow
The agent-environment interaction follows this pattern:
- Initialization: Agent calls
agent.reset()to prepare for a new episode, environment callsenv.reset()to provide initial observation - State Update: Agent processes environment observation via
update_from_env() - Model Interaction: Language model generates response using agent's
chat_completions - Response Processing: Agent updates state via
update_from_model() - Environment Feedback: Environment evaluates the response via
step()and provides reward/next observation - Repeat: Process continues until episode completion or termination criteria are met
This cycle enables sophisticated behaviors like self-correction, multi-turn reasoning, and adaptive problem-solving.
Complete Implementation Example
To illustrate these concepts in action, let's examine a complete implementation that demonstrates self-correction behavior. The MathAgent and MathEnv example shows how agents can learn from mistakes and improve their responses.
MathAgent Implementation
The MathAgent demonstrates how to implement the core agent interface for mathematical problem-solving:
from typing import Any, Dict, List
from rllm.agents.agent import BaseAgent, Step, Trajectory
class MathAgent(BaseAgent):
"""
A math agent that solves mathematical problems step by step, following the BaseAgent interface.
"""
def __init__(self, accumulate_thinking=True):
"""
Initialize the MathAgent.
"""
self.instruction = "Let's think step by step and put the final answer within \\boxed{}."
self._trajectory = Trajectory()
self.messages = []
self.step = 0
self.accumulate_thinking = accumulate_thinking
def update_from_env(self, observation: Any, reward: float, done: bool, info: Dict, **kwargs):
"""Process environment feedback and update internal state."""
# Format observation based on whether it's the initial problem or subsequent feedback
if not self.trajectory.steps:
# Initial problem presentation
assert isinstance(observation, dict) and 'question' in observation
question = observation['question']
formatted_observation = f'{question} {self.instruction}'
else:
# Follow-up correction prompt
formatted_observation = (
"Your previous answer may contain a mistake. "
"Please review it carefully and answer again. "
"Put your final answer within \\boxed{}."
)
# If there are previous steps, update the last step's outcome
if self.trajectory.steps:
prior_step = self._trajectory.steps[-1]
prior_step.next_observation = formatted_observation
prior_step.reward = reward
prior_step.done = done
prior_step.info = info
if done:
return
self.messages.append({
"role": "user",
"content": formatted_observation
})
cur_step = Step(
observation=formatted_observation,
step=self.step
)
self.trajectory.steps.append(cur_step)
def update_from_model(self, response: str, **kwargs):
"""
Updates the agent's internal state based on the model's response.
"""
assert self.trajectory.steps, "Trajectory should not be empty when update_from_model is called."
# Update the current step in the trajectory
cur_step = self.get_current_state()
cur_step.model_response = response
if not self.accumulate_thinking:
_, sep, after = response.partition("</think>")
if sep:
response = after
self.messages.append({"role": "assistant", "content": response})
self.step += 1
def reset(self):
"""Reset agent state for new episode."""
self._trajectory = Trajectory()
self.messages = []
self.step = 0
@property
def chat_completions(self) -> List[Dict[str, str]]:
"""Return conversation history for model interaction."""
return self.messages
@property
def trajectory(self) -> Trajectory:
"""Return complete interaction trajectory."""
return self._trajectory
MathEnv Implementation
The MathEnv demonstrates how to create an environment that supports self-correction through reward-based feedback:
from typing import Any, Dict, Tuple
from rllm.environments.base.base_env import BaseEnv
from rllm.rewards.reward_fn import math_reward_fn
class MathEnv(BaseEnv):
"""
A math environment that presents mathematical problems and evaluates solutions.
Supports self-correction by asking the agent to retry if the answer is incorrect.
"""
def __init__(self, task: Dict, max_attempts: int = 2):
"""
Initialize the MathEnv.
Args:
task: Dictionary containing problem info (question, ground_truth, etc.)
max_attempts: Maximum number of attempts allowed (default: 2)
"""
self.task = task
self.max_attempts = max_attempts
self.current_attempt = 0
self.is_correct = False
def reset(self) -> Tuple[Dict[str, str], Dict]:
"""Reset environment and return initial observation and info."""
self.current_attempt = 0
self.is_correct = False
observation = {"question": self.task["question"]}
info = {"attempt": self.current_attempt, "max_attempts": self.max_attempts}
return observation, info
def step(self, action: str) -> Tuple[Any, float, bool, bool, Dict]:
"""
Execute action (model response) and return (observation, reward, terminated, truncated, info).
"""
self.current_attempt += 1
# Use rllm's math reward function to evaluate the response
reward_output = math_reward_fn(self.task, action)
reward = reward_output.reward
self.is_correct = reward > 0.0
# Determine if episode is done
terminated = self.is_correct or self.current_attempt >= self.max_attempts
truncated = False
observation = None # Will be handled by agent's update_from_env
info = {
"attempt": self.current_attempt,
"max_attempts": self.max_attempts,
"is_correct": self.is_correct,
"reward_metadata": reward_output.metadata
}
return observation, reward, terminated, truncated, info
def close(self):
"""Clean up environment resources."""
pass
@staticmethod
def from_dict(env_args: Dict) -> "MathEnv":
"""Create MathEnv instance from dictionary configuration."""
return MathEnv(
task=env_args["task"],
max_attempts=env_args.get("max_attempts", 2)
)
Detailed Interaction Walkthrough
Let's trace through how the MathAgent and MathEnv handle a complete interaction cycle with self-correction:
Step 0: Initial Problem Solving
- Environment Reset:
env.reset()returns initial observation containing the math question - Agent Processing:
- Agent calls
update_from_env()with the question - Formats question with step-by-step instructions
- Creates Step 0 in trajectory and prepares user message
- Agent calls
- Model Generation: Language model uses
agent.chat_completionsto generate response - Response Integration:
- Agent calls
update_from_model()with model's solution attempt - Updates Step 0 with model response and adds to conversation history
- Agent calls
- Environment Evaluation:
env.step()evaluates response usingmath_reward_fn
Step 1: Self-Correction (if needed)
- Environment Feedback: Environment returns reward (0.0 if incorrect) and episode status
- Agent Processing:
- Agent calls
update_from_env()with feedback - Updates Step 0 with reward and outcome
- If not done, creates Step 1 with correction prompt
- Agent calls
- Model Generation: Model attempts to correct previous answer with new context
- Response Integration: Updates Step 1 with corrected response
- Final Evaluation: Environment provides final reward and terminates episode
This walkthrough demonstrates how the standardized interface enables complex behaviors like self-correction while maintaining clean separation of concerns between agent and environment.
Next Steps
Now that you understand how agents and environments work together, explore these related topics:
- Execution Engine: Learn how to orchestrate parallel agent-environment interactions for efficient batch processing