DeepSWE Software Engineering Agent Example
This example demonstrates training and running DeepSWE, a software-engineering agent trained on top of Qwen3-32B to search, view, and navigate codebases. The model achieves an impressive 59.0% on SWE-Bench-Verified, which is currently #1 in the open-weights category.
Overview
The DeepSWE examples demonstrate:
- How to use rLLM's SWEAgent for software engineering tasks.
- How to train DeepSWE with compact filtering.
- How to evaluate DeepSWE over SWE-Bench-Verified.
Quick Start
Setup Coding Data
First, prepare your coding datasets:
Model Hosting
Start a model server using vLLM:
# Start VLLM server with tensor parallelism across 8 GPUs
export MAX_CONTEXT_LEN=65536
export TENSOR_PARALLEL_SIZE=8
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve agentica-org/DeepSWE-Preview \
--tensor-parallel-size $TENSOR_PARALLEL_SIZE \
--max-model-len $MAX_CONTEXT_LEN \
--hf-overrides '{"max_position_embeddings": '$MAX_CONTEXT_LEN'}' \
--enable_prefix_caching
Run/Evaluate DeepSWE Agent on SWE-Bench-Verified
To fully reproduce DeepSWE's evaluation, see the official R2E-Gym repo for more details.
Train DeepSWE Agent
To train DeepSWE, we suggest deploying a Kubernetes (K8) cluster on AWS/GCP/Azure. Each node should have a large number of CPUs and diskspace. Each node in our K8 cluster contains 200 CPUs and over 6 TB+ of disk space to store 1000s of Docker images.
To run Kubernetes locally, we suggest installing kind and launching it with kind create cluster. However, please do note that this is not sufficient to launch a full training run.
Next, run the bash script below:
Code Reference
SWE Agent Runner
Main script for evaluating SWE-Bench performance:
import asyncio
from transformers import AutoTokenizer
from rllm.agents.swe_agent import SWEAgent
from rllm.data.dataset import DatasetRegistry
from rllm.engine.agent_execution_engine import AgentExecutionEngine
from rllm.environments.swe.swe import SWEEnv
from rllm.utils import compute_pass_at_k
def load_swe_data():
if DatasetRegistry.dataset_exists("SWE_Bench_Verified", "test"):
test_dataset = DatasetRegistry.load_dataset("SWE_Bench_Verified", "test")
return test_dataset.get_data()
raise ValueError("SWE_Bench_Verified dataset not found. Please run `python prepare_swe_data.py` to create the dataset.")
if __name__ == "__main__":
import os
os.environ["TOKENIZERS_PARALLELISM"] = "true"
model_name = "agentica-org/DeepSWE-Preview"
tokenizer = AutoTokenizer.from_pretrained(model_name)
sampling_params = {"temperature": 1, "model": model_name}
engine = AgentExecutionEngine(
agent_class=SWEAgent,
env_class=SWEEnv,
agent_args={},
env_args={},
engine_name="openai",
tokenizer=tokenizer,
sampling_params=sampling_params,
rollout_engine_args={
"base_url": "http://localhost:30000/v1",
"api_key": "None",
},
n_parallel_agents=48,
max_response_length=65536,
max_prompt_length=4096,
)
tasks = load_swe_data()
results = asyncio.run(engine.execute_tasks(tasks))
compute_pass_at_k(results)
Training Script
DeepSWE training configuration:
import hydra
from rllm.agents.swe_agent import SWEAgent
from rllm.data import DatasetRegistry
from rllm.environments.swe.swe import SWEEnv
from rllm.trainer.agent_trainer import AgentTrainer
@hydra.main(config_path="pkg://rllm.trainer.config", config_name="agent_ppo_trainer", version_base=None)
def main(config):
# Load SWE datasets - using names from prepare_swe_data.py
train_dataset = DatasetRegistry.load_dataset("R2E_Gym_Subset", "train")
val_dataset = DatasetRegistry.load_dataset("SWE_Bench_Verified", "test")
trainer = AgentTrainer(
agent_class=SWEAgent,
env_class=SWEEnv,
config=config,
train_dataset=train_dataset,
val_dataset=val_dataset,
)
trainer.train()
if __name__ == "__main__":
main()
For detailed setup instructions, see the README in the deepswe example directory.