2026 is the year of multi-agent harness design. MASEval is the framework-agnostic evaluation library that treats the entire agent system as the unit of analysis. Compare architectures, tools, and coordination strategies.
The Next Frontier
We've benchmarked transformers, compared LLMs, and tested single agents. Now the question is: how do agent systems work together? What tools, what topologies, what coordination strategies?
Core Capabilities
MASEval is the only evaluation library with full native support for multi-agent orchestration, system-level comparison, framework-agnostic adapters, trace-first evaluation, and structured error attribution.
Evaluate agents from any framework via thin AgentAdapter wrappers. Works with smolagents, LangGraph, LlamaIndex, CAMEL, or your custom code. Adding a new framework takes two methods: _run_agent() and get_messages().
Built for multi-agent systems from the ground up. Automatic per-agent message history tracing with independent conversation contexts that respect partial observability. Each agent sees only its own messages.
The complete agent system (agents, framework, coordination logic, and tools) is evaluated as a whole. Compare entirely different architectural approaches on the same benchmark, not just swap LLMs.
Evaluate intermediate steps and tool usage patterns, not just final outputs. Evaluator uses a two-stage pattern: filter_traces() extracts relevant data, then __call__() computes metrics from structured traces.
Structured exception hierarchy distinguishes AgentError (counts against score) from EnvironmentError and UserError (excluded from scoring). Fair evaluation requires knowing who failed.
Extensible BenchmarkCallback, EnvironmentCallback, and AgentCallback provide hooks at every phase. Built-in callbacks for progress bars and result logging. Add WandB, Langfuse, or custom metrics without modifying core logic.
See It In Action
Subclass Benchmark to wire your evaluation logic together. Override setup hooks to plug in your agents, environments, and evaluators. MASEval handles orchestration, tracing, and reporting.
from maseval.core import Benchmark, Task from maseval.core import AgentAdapter, Environment class MyBenchmark(Benchmark): def setup_environment(self, agent_data, task, seed_generator): return MyEnvironment(task.environment_data) def setup_agents(self, agent_data, environment, task, user, seed_generator): agent = build_agent(agent_data, environment) adapter = MyAdapter(agent, "agent") return [adapter], {"agent": adapter} def setup_evaluators(self, environment, task, agents, user, seed_generator): return [MyEvaluator(task.evaluation_data)] def run_agents(self, agents, task, environment, query): return agents[0].run(query) def get_model_adapter(self, model_id, **kw): return OpenAIModelAdapter(client, model_id)
# Run it results = MyBenchmark(seed=42).run( tasks=tasks, agent_data={"model_id": "gpt-4o"}, ) # Every result is a structured report for r in results: r["status"] # "success" | "agent_error" r["eval"] # [{metric: value}, ...] r["task_id"] # deterministic ID r["repeat_idx"] # 0, 1, 2 # Full execution traces per component r["traces"]["agents"] # per-agent messages r["traces"]["tools"] # tool call logs r["traces"]["models"] # LLM call counts # Config snapshot for reproducibility r["config"]["agents"] # agent settings r["config"]["models"] # model IDs, seeds r["config"]["environment"] # env snapshot # Error attribution: who failed? errors = [r for r in results if r["status"] != "success"] for e in errors: # agent_error vs environment_error print(e["status"], e["error"]["type"])
from maseval.benchmark.tau2 import ( Tau2Benchmark, ensure_data_exists, load_tasks, compute_benchmark_metrics, ) ensure_data_exists(domain="retail") tasks = load_tasks(domain="retail", split="base") # One agent handles everything class SingleAgent(Tau2Benchmark): def setup_agents(self, agent_data, env, task, user, seed_gen): agent = ToolCallingAgent( model=model, tools=env.create_tools(), ) a = SmolAgentAdapter(agent, "agent") return [a], {"agent": a} single = SingleAgent(n_task_repeats=3, seed=42) results_single = single.run( tasks=tasks, agent_data={"model_id": "gpt-4o"}, )
# Orchestrator delegates to specialists class MultiAgent(Tau2Benchmark): def setup_agents(self, agent_data, env, task, user, seed_gen): tools = env.create_tools() specialist = ToolCallingAgent( model=model, tools=tools, name="tool_specialist", ) orchestrator = ToolCallingAgent( model=model, managed_agents=[specialist], ) orch = SmolAgentAdapter(orchestrator, "orch") spec = SmolAgentAdapter(specialist, "spec") # Run orchestrator, trace both return [orch], {"orch": orch, "spec": spec} multi = MultiAgent(n_task_repeats=3, seed=42) results_multi = multi.run( tasks=tasks, agent_data={"model_id": "gpt-4o"}, ) # Compare on the same benchmark s = compute_benchmark_metrics(results_single) m = compute_benchmark_metrics(results_multi) print(f"Single: {s['success_rate']:.0%}") print(f"Multi: {m['success_rate']:.0%}")
Architecture
MASEval's three-tier module structure enforces a clean separation between the core runtime, abstract interfaces, and your implementations. The core never imports framework-specific code.
Benchmark Task Lifecycle
Create environment, user, agents, evaluators via overridable hooks
Run agents with customizable turn orchestration via run_agents()
Gather traces from all registered components automatically
Compute metrics from traces using composable Evaluator classes
Structured report with status, traces, config snapshot, and eval results
Why MASEval
MASEval is the only library that achieves full support across all key evaluation dimensions: multi-agent native orchestration, system-level comparison, and framework-agnostic design.
| Library | Multi-Agent | System Eval | Agent-Agnostic | Benchmarks | Flexible Interaction | BYO | Trace-First | Mature |
|---|---|---|---|---|---|---|---|---|
| MASEval | ||||||||
| AnyAgent | ||||||||
| MLflow GenAI | ||||||||
| HAL Harness | ||||||||
| Inspect-AI | ||||||||
| OpenCompass | ||||||||
| AgentGym | ||||||||
| Arize Phoenix | ||||||||
| TruLens | ||||||||
| MARBLE | ||||||||
| DeepEval | ||||||||
| MCPEval |
Get Started
A complete, runnable example: load real-world agent tasks from HuggingFace, run the built-in ReAct agent, and drill into per-capability results.
from maseval.benchmark.gaia2 import ( DefaultAgentGaia2Benchmark, load_tasks, configure_model_ids, compute_gaia2_metrics, ) from maseval.interface.inference import ( OpenAIModelAdapter, ) from openai import OpenAI # Load 5 "execution" tasks from HuggingFace tasks = load_tasks( capability="execution", limit=5, )
# The default agent needs a model adapter — # subclass and tell it how to create one class MyGaia2(DefaultAgentGaia2Benchmark): def get_model_adapter(self, model_id, **kw): adapter = OpenAIModelAdapter( OpenAI(), model_id=model_id, ) # Register so traces capture model usage if "register_name" in kw: self.register( "models", kw["register_name"], adapter) return adapter
# Agent gets calendar, email, messaging, # browser, shopping & 7 more tool apps benchmark = MyGaia2() results = benchmark.run( tasks=tasks, agent_data={"model_id": "gpt-4o"}, ) metrics = compute_gaia2_metrics(results) print(f"GSR: {metrics['gsr']:.0%}")
# Per-agent traces — e.g. agent2agent capability # where two agents coordinate via messaging r = results[0] for name, t in r["traces"]["agents"].items(): print(name, t["message_count"]) # >> orchestrator 34 messages # >> search_agent 12 messages # Token usage per model for name, m in r["traces"]["models"].items(): print(name, m["total_tokens"]) # >> orch_model 48291 tokens # >> search_model 9140 tokens r["eval"] # {"gsr": 1.0, "passed": true} r["config"] # reproducibility snapshot
Links & Authors
MASEval is developed by Parameter Lab and collaborators. MIT licensed, published on PyPI, with CI/CD and comprehensive documentation.