MASEval — Multi-Agentic System Evaluation

Core Capabilities

Built for system-level evaluation

MASEval is the only evaluation library with full native support for multi-agent orchestration, system-level comparison, framework-agnostic adapters, trace-first evaluation, and structured error attribution.

Flexible: Any Agent Works

Evaluate agents from any framework via thin AgentAdapter wrappers. Works with smolagents, LangGraph, LlamaIndex, CAMEL, or your custom code. Adding a new framework takes two methods: _run_agent() and get_messages().

Multi-Agent Native

Built for multi-agent systems from the ground up. Automatic per-agent message history tracing with independent conversation contexts that respect partial observability. Each agent sees only its own messages.

Compare Whole Systems

The complete agent system (agents, framework, coordination logic, and tools) is evaluated as a whole. Compare entirely different architectural approaches on the same benchmark, not just swap LLMs.

See Every Step

Evaluate intermediate steps and tool usage patterns, not just final outputs. Evaluator uses a two-stage pattern: filter_traces() extracts relevant data, then __call__() computes metrics from structured traces.

Know Who Failed

Structured exception hierarchy distinguishes AgentError (counts against score) from EnvironmentError and UserError (excluded from scoring). Fair evaluation requires knowing who failed.

Extend Everything

Extensible BenchmarkCallback, EnvironmentCallback, and AgentCallback provide hooks at every phase. Built-in callbacks for progress bars and result logging. Add WandB, Langfuse, or custom metrics without modifying core logic.

See It In Action

We handle boilerplate, so you do science

Subclass Benchmark to wire your evaluation logic together. Override setup hooks to plug in your agents, environments, and evaluators. MASEval handles orchestration, tracing, and reporting.

Define Your Benchmark

from maseval.core import Benchmark, Task
from maseval.core import AgentAdapter, Environment

class MyBenchmark(Benchmark):
    def setup_environment(self, agent_data,
            task, seed_generator):
        return MyEnvironment(task.environment_data)

    def setup_agents(self, agent_data, environment,
            task, user, seed_generator):
        agent = build_agent(agent_data, environment)
        adapter = MyAdapter(agent, "agent")
        return [adapter], {"agent": adapter}

    def setup_evaluators(self, environment, task,
            agents, user, seed_generator):
        return [MyEvaluator(task.evaluation_data)]

    def run_agents(self, agents, task,
            environment, query):
        return agents[0].run(query)

    def get_model_adapter(self, model_id, **kw):
        return OpenAIModelAdapter(client, model_id)

Structured Reports & Traces

# Run it
results = MyBenchmark(seed=42).run(
    tasks=tasks, agent_data={"model_id": "gpt-4o"},
)

# Every result is a structured report
for r in results:
    r["status"]      # "success" | "agent_error"
    r["eval"]        # [{metric: value}, ...]
    r["task_id"]     # deterministic ID
    r["repeat_idx"]  # 0, 1, 2

    # Full execution traces per component
    r["traces"]["agents"]  # per-agent messages
    r["traces"]["tools"]   # tool call logs
    r["traces"]["models"]  # LLM call counts

    # Config snapshot for reproducibility
    r["config"]["agents"]      # agent settings
    r["config"]["models"]      # model IDs, seeds
    r["config"]["environment"]  # env snapshot

# Error attribution: who failed?
errors = [r for r in results
          if r["status"] != "success"]
for e in errors:
    # agent_error vs environment_error
    print(e["status"], e["error"]["type"])

Single Agent

from maseval.benchmark.tau2 import (
    Tau2Benchmark, ensure_data_exists,
    load_tasks, compute_benchmark_metrics,
)

ensure_data_exists(domain="retail")
tasks = load_tasks(domain="retail", split="base")

# One agent handles everything
class SingleAgent(Tau2Benchmark):
    def setup_agents(self, agent_data, env,
            task, user, seed_gen):
        agent = ToolCallingAgent(
            model=model, tools=env.create_tools(),
        )
        a = SmolAgentAdapter(agent, "agent")
        return [a], {"agent": a}

single = SingleAgent(n_task_repeats=3, seed=42)
results_single = single.run(
    tasks=tasks,
    agent_data={"model_id": "gpt-4o"},
)

Multi-Agent (same benchmark)

# Orchestrator delegates to specialists
class MultiAgent(Tau2Benchmark):
    def setup_agents(self, agent_data, env,
            task, user, seed_gen):
        tools = env.create_tools()
        specialist = ToolCallingAgent(
            model=model, tools=tools,
            name="tool_specialist",
        )
        orchestrator = ToolCallingAgent(
            model=model,
            managed_agents=[specialist],
        )
        orch = SmolAgentAdapter(orchestrator, "orch")
        spec = SmolAgentAdapter(specialist, "spec")
        # Run orchestrator, trace both
        return [orch], {"orch": orch, "spec": spec}

multi = MultiAgent(n_task_repeats=3, seed=42)
results_multi = multi.run(
    tasks=tasks,
    agent_data={"model_id": "gpt-4o"},
)

# Compare on the same benchmark
s = compute_benchmark_metrics(results_single)
m = compute_benchmark_metrics(results_multi)
print(f"Single: {s['success_rate']:.0%}")
print(f"Multi:  {m['success_rate']:.0%}")

Architecture

Bring your own everything

MASEval's three-tier module structure enforces a clean separation between the core runtime, abstract interfaces, and your implementations. The core never imports framework-specific code.

Your Code

smolagents LangGraph LlamaIndex CAMEL YourAgent OpenAI Anthropic LiteLLM WandB Langfuse

Interface

AgentAdapter ModelAdapter Environment User Evaluator BenchmarkCallback

Core Runtime

Orchestration TraceableMixin ComponentRegistry SeedGenerator Error Attribution

Benchmarks

Tau2Benchmark GAIA2Benchmark MultiAgentBenchmark MACSBenchmark ConVerseBenchmark MMLUBenchmark

Benchmark Task Lifecycle

1

Setup

Create environment, user, agents, evaluators via overridable hooks

2

Execute

Run agents with customizable turn orchestration via run_agents()

3

Collect

Gather traces from all registered components automatically

4

Evaluate

Compute metrics from traces using composable Evaluator classes

5

Report

Structured report with status, traces, config snapshot, and eval results

Why MASEval

The complete evaluation stack

MASEval is the only library that achieves full support across all key evaluation dimensions: multi-agent native orchestration, system-level comparison, and framework-agnostic design.

Library	Multi-Agent	System Eval	Agent-Agnostic	Benchmarks	Flexible Interaction	BYO	Trace-First	Mature
MASEval
AnyAgent
MLflow GenAI
HAL Harness
Inspect-AI
OpenCompass
AgentGym
Arize Phoenix
TruLens
MARBLE
DeepEval
MCPEval

Full support Partial / limited Not supported

Get Started

Evaluate an agent on GAIA2 in four steps

A complete, runnable example: load real-world agent tasks from HuggingFace, run the built-in ReAct agent, and drill into per-capability results.

1

Load Tasks

from maseval.benchmark.gaia2 import (
    DefaultAgentGaia2Benchmark,
    load_tasks, configure_model_ids,
    compute_gaia2_metrics,
)
from maseval.interface.inference import (
    OpenAIModelAdapter,
)
from openai import OpenAI

# Load 5 "execution" tasks from HuggingFace
tasks = load_tasks(
    capability="execution", limit=5,
)

2

Wire Up a Model

# The default agent needs a model adapter —
# subclass and tell it how to create one
class MyGaia2(DefaultAgentGaia2Benchmark):
    def get_model_adapter(self, model_id, **kw):
        adapter = OpenAIModelAdapter(
            OpenAI(), model_id=model_id,
        )
        # Register so traces capture model usage
        if "register_name" in kw:
            self.register(
                "models", kw["register_name"],
                adapter)
        return adapter

3

Run & Evaluate

# Agent gets calendar, email, messaging,
# browser, shopping & 7 more tool apps
benchmark = MyGaia2()
results = benchmark.run(
    tasks=tasks,
    agent_data={"model_id": "gpt-4o"},
)
metrics = compute_gaia2_metrics(results)
print(f"GSR: {metrics['gsr']:.0%}")

4

Inspect Traces

# Per-agent traces — e.g. agent2agent capability
# where two agents coordinate via messaging
r = results[0]
for name, t in r["traces"]["agents"].items():
    print(name, t["message_count"])

# >> orchestrator    34 messages
# >> search_agent    12 messages

# Token usage per model
for name, m in r["traces"]["models"].items():
    print(name, m["total_tokens"])

# >> orch_model      48291 tokens
# >> search_model     9140 tokens

r["eval"]   # {"gsr": 1.0, "passed": true}
r["config"] # reproducibility snapshot

Links & Authors

Open source, built by researchers

MASEval is developed by Parameter Lab and collaborators. MIT licensed, published on PyPI, with CI/CD and comprehensive documentation.

Evaluate Multi-Agent
Systems, Not Just Models

2026 is the year of agent harness evaluation

Built for system-level evaluation

Flexible: Any Agent Works

Multi-Agent Native

Compare Whole Systems

See Every Step

Know Who Failed

Extend Everything

We handle boilerplate, so you do science

Bring your own everything

Setup

Execute

Collect

Evaluate

Report

The complete evaluation stack

Evaluate an agent on GAIA2 in four steps

Load Tasks

Wire Up a Model

Run & Evaluate

Inspect Traces

Open source, built by researchers

GitHub

Documentation

Parameter Lab

PyPI

Blog Post

Release Notes

Paper

Evaluate Multi-AgentSystems, Not Just Models

2026 is the year of agent harness evaluation

Built for system-level evaluation

Flexible: Any Agent Works

Multi-Agent Native

Compare Whole Systems

See Every Step

Know Who Failed

Extend Everything

We handle boilerplate, so you do science

Bring your own everything

Setup

Execute

Collect

Evaluate

Report

The complete evaluation stack

Evaluate an agent on GAIA2 in four steps

Load Tasks

Wire Up a Model

Run & Evaluate

Inspect Traces

Open source, built by researchers

GitHub

Documentation

Parameter Lab

PyPI

Blog Post

Release Notes

Paper

Evaluate Multi-Agent
Systems, Not Just Models