v0.3.0 — MIT Licensed

Evaluate Multi-Agent
Systems, Not Just Models

2026 is the year of multi-agent harness design. MASEval is the framework-agnostic evaluation library that treats the entire agent system as the unit of analysis. Compare architectures, tools, and coordination strategies.

$ pip install maseval

2026 is the year of agent harness evaluation

We've benchmarked transformers, compared LLMs, and tested single agents. Now the question is: how do agent systems work together? What tools, what topologies, what coordination strategies?

2017 Transformer Attention Is All You Need. A new architecture emerges
2022 ChatGPT LLMs go mainstream. MMLU, HellaSwag, and the leaderboard era
2024 Agentic AI Single-agent benchmarks for tool use, planning, and code generation
2026 Multi-Agent Harness Evaluate entire systems: architectures, frameworks, and coordination

Built for system-level evaluation

MASEval is the only evaluation library with full native support for multi-agent orchestration, system-level comparison, framework-agnostic adapters, trace-first evaluation, and structured error attribution.

Flexible: Any Agent Works

Evaluate agents from any framework via thin AgentAdapter wrappers. Works with smolagents, LangGraph, LlamaIndex, CAMEL, or your custom code. Adding a new framework takes two methods: _run_agent() and get_messages().

Multi-Agent Native

Built for multi-agent systems from the ground up. Automatic per-agent message history tracing with independent conversation contexts that respect partial observability. Each agent sees only its own messages.

Compare Whole Systems

The complete agent system (agents, framework, coordination logic, and tools) is evaluated as a whole. Compare entirely different architectural approaches on the same benchmark, not just swap LLMs.

See Every Step

Evaluate intermediate steps and tool usage patterns, not just final outputs. Evaluator uses a two-stage pattern: filter_traces() extracts relevant data, then __call__() computes metrics from structured traces.

Know Who Failed

Structured exception hierarchy distinguishes AgentError (counts against score) from EnvironmentError and UserError (excluded from scoring). Fair evaluation requires knowing who failed.

Extend Everything

Extensible BenchmarkCallback, EnvironmentCallback, and AgentCallback provide hooks at every phase. Built-in callbacks for progress bars and result logging. Add WandB, Langfuse, or custom metrics without modifying core logic.

We handle boilerplate, so you do science

Subclass Benchmark to wire your evaluation logic together. Override setup hooks to plug in your agents, environments, and evaluators. MASEval handles orchestration, tracing, and reporting.

Define Your Benchmark
from maseval.core import Benchmark, Task
from maseval.core import AgentAdapter, Environment

class MyBenchmark(Benchmark):
    def setup_environment(self, agent_data,
            task, seed_generator):
        return MyEnvironment(task.environment_data)

    def setup_agents(self, agent_data, environment,
            task, user, seed_generator):
        agent = build_agent(agent_data, environment)
        adapter = MyAdapter(agent, "agent")
        return [adapter], {"agent": adapter}

    def setup_evaluators(self, environment, task,
            agents, user, seed_generator):
        return [MyEvaluator(task.evaluation_data)]

    def run_agents(self, agents, task,
            environment, query):
        return agents[0].run(query)

    def get_model_adapter(self, model_id, **kw):
        return OpenAIModelAdapter(client, model_id)
Structured Reports & Traces
# Run it
results = MyBenchmark(seed=42).run(
    tasks=tasks, agent_data={"model_id": "gpt-4o"},
)

# Every result is a structured report
for r in results:
    r["status"]      # "success" | "agent_error"
    r["eval"]        # [{metric: value}, ...]
    r["task_id"]     # deterministic ID
    r["repeat_idx"]  # 0, 1, 2

    # Full execution traces per component
    r["traces"]["agents"]  # per-agent messages
    r["traces"]["tools"]   # tool call logs
    r["traces"]["models"]  # LLM call counts

    # Config snapshot for reproducibility
    r["config"]["agents"]      # agent settings
    r["config"]["models"]      # model IDs, seeds
    r["config"]["environment"]  # env snapshot

# Error attribution: who failed?
errors = [r for r in results
          if r["status"] != "success"]
for e in errors:
    # agent_error vs environment_error
    print(e["status"], e["error"]["type"])
Single Agent
from maseval.benchmark.tau2 import (
    Tau2Benchmark, ensure_data_exists,
    load_tasks, compute_benchmark_metrics,
)

ensure_data_exists(domain="retail")
tasks = load_tasks(domain="retail", split="base")

# One agent handles everything
class SingleAgent(Tau2Benchmark):
    def setup_agents(self, agent_data, env,
            task, user, seed_gen):
        agent = ToolCallingAgent(
            model=model, tools=env.create_tools(),
        )
        a = SmolAgentAdapter(agent, "agent")
        return [a], {"agent": a}

single = SingleAgent(n_task_repeats=3, seed=42)
results_single = single.run(
    tasks=tasks,
    agent_data={"model_id": "gpt-4o"},
)
Multi-Agent (same benchmark)
# Orchestrator delegates to specialists
class MultiAgent(Tau2Benchmark):
    def setup_agents(self, agent_data, env,
            task, user, seed_gen):
        tools = env.create_tools()
        specialist = ToolCallingAgent(
            model=model, tools=tools,
            name="tool_specialist",
        )
        orchestrator = ToolCallingAgent(
            model=model,
            managed_agents=[specialist],
        )
        orch = SmolAgentAdapter(orchestrator, "orch")
        spec = SmolAgentAdapter(specialist, "spec")
        # Run orchestrator, trace both
        return [orch], {"orch": orch, "spec": spec}

multi = MultiAgent(n_task_repeats=3, seed=42)
results_multi = multi.run(
    tasks=tasks,
    agent_data={"model_id": "gpt-4o"},
)

# Compare on the same benchmark
s = compute_benchmark_metrics(results_single)
m = compute_benchmark_metrics(results_multi)
print(f"Single: {s['success_rate']:.0%}")
print(f"Multi:  {m['success_rate']:.0%}")

Bring your own everything

MASEval's three-tier module structure enforces a clean separation between the core runtime, abstract interfaces, and your implementations. The core never imports framework-specific code.

Your Code
smolagents LangGraph LlamaIndex CAMEL YourAgent OpenAI Anthropic LiteLLM WandB Langfuse
Interface
AgentAdapter ModelAdapter Environment User Evaluator BenchmarkCallback
Core Runtime
Orchestration TraceableMixin ComponentRegistry SeedGenerator Error Attribution
Benchmarks
Tau2Benchmark GAIA2Benchmark MultiAgentBenchmark MACSBenchmark ConVerseBenchmark MMLUBenchmark

Benchmark Task Lifecycle

1

Setup

Create environment, user, agents, evaluators via overridable hooks

2

Execute

Run agents with customizable turn orchestration via run_agents()

3

Collect

Gather traces from all registered components automatically

4

Evaluate

Compute metrics from traces using composable Evaluator classes

5

Report

Structured report with status, traces, config snapshot, and eval results

The complete evaluation stack

MASEval is the only library that achieves full support across all key evaluation dimensions: multi-agent native orchestration, system-level comparison, and framework-agnostic design.

Library Multi-Agent System Eval Agent-Agnostic Benchmarks Flexible Interaction BYO Trace-First Mature
MASEval
AnyAgent
MLflow GenAI
HAL Harness
Inspect-AI
OpenCompass
AgentGym
Arize Phoenix
TruLens
MARBLE
DeepEval
MCPEval
Full support Partial / limited Not supported

Evaluate an agent on GAIA2 in four steps

A complete, runnable example: load real-world agent tasks from HuggingFace, run the built-in ReAct agent, and drill into per-capability results.

1

Load Tasks

from maseval.benchmark.gaia2 import (
    DefaultAgentGaia2Benchmark,
    load_tasks, configure_model_ids,
    compute_gaia2_metrics,
)
from maseval.interface.inference import (
    OpenAIModelAdapter,
)
from openai import OpenAI

# Load 5 "execution" tasks from HuggingFace
tasks = load_tasks(
    capability="execution", limit=5,
)
2

Wire Up a Model

# The default agent needs a model adapter —
# subclass and tell it how to create one
class MyGaia2(DefaultAgentGaia2Benchmark):
    def get_model_adapter(self, model_id, **kw):
        adapter = OpenAIModelAdapter(
            OpenAI(), model_id=model_id,
        )
        # Register so traces capture model usage
        if "register_name" in kw:
            self.register(
                "models", kw["register_name"],
                adapter)
        return adapter
3

Run & Evaluate

# Agent gets calendar, email, messaging,
# browser, shopping & 7 more tool apps
benchmark = MyGaia2()
results = benchmark.run(
    tasks=tasks,
    agent_data={"model_id": "gpt-4o"},
)
metrics = compute_gaia2_metrics(results)
print(f"GSR: {metrics['gsr']:.0%}")
4

Inspect Traces

# Per-agent traces — e.g. agent2agent capability
# where two agents coordinate via messaging
r = results[0]
for name, t in r["traces"]["agents"].items():
    print(name, t["message_count"])

# >> orchestrator    34 messages
# >> search_agent    12 messages

# Token usage per model
for name, m in r["traces"]["models"].items():
    print(name, m["total_tokens"])

# >> orch_model      48291 tokens
# >> search_model     9140 tokens

r["eval"]   # {"gsr": 1.0, "passed": true}
r["config"] # reproducibility snapshot