Dmytro Nikolaiev

smoleval & Notes on Agentic Engineering

This post covers smoleval, a lightweight evaluation framework for AI applications, which I built with coding agents.

TL;DR: smoleval lets you define evaluation test cases in YAML and run them against your agents over HTTP or programmatically in Rust. It's ~1,500 lines of code and I built it with Conductor in ~15h.

Why I Built This

I've been playing with simple agents using small models with Rig, and one thing that stands out is how compact and memory-efficient the resulting binaries are. That's clearly not the industry's main bottleneck right now, but I can see a world where tiny agents run on device, potentially with local models or even as self-contained agent binaries.

But with smaller models you need to make sure things actually work, so evaluation becomes critical1. I wanted something minimal that I could plug into CI: define a set of test cases, run them against an agent, and get a pass/fail report.

That's essentially smoleval. It's a lightweight evaluation framework with a registry of extensible checks2. You write your test cases in YAML and smoleval runs them and gives you structured results. The whole thing is ~1,500 lines of Rust (excluding tests) and the release binary is 7.4 MB.

How It Works

A dataset is essentially a list of tests, where each test is:

name: CalculatorAgentEvalDataset
description: |
  Evaluation dataset for a calculator agent
  with `add(x, y)` tool
tests:
  - name: simpleAddition
    description: |
      Checks that when asked "2 + 3", 
      the agent uses the `add` tool at least once 
      and the final response contains either "5" or "five"
    prompt: |
      What is 2 + 3?
    checks:
      - kind: responseContainsAny
        values: ["5", "five"]
      - kind: toolUsedAtLeast
        name: "add"
        times: 1

  - name: largerAddition
    description: |
      Checks that when asked "100 + 250", 
      the agent uses the `add` tool exactly once 
      and the final response contains "350"
    prompt: |
      Please add 100 and 250 together.
    checks:
      - kind: responseContainsAll
        values: ["350"]
      - kind: toolUsedExactly
        name: "add"
        times: 1

  - name: noHallucination
    description: |
      Checks that when asked "7 + 8", 
      the final response contains either "15" or "fifteen" 
      and doesn't contain "14" or "16"
    prompt: |
      What is seven plus eight?
    checks:
      - kind: responseContainsAny
        values: ["15", "fifteen"]
      - kind: responseNotContains
        values: ["16", "14"]

You can install smoleval CLI with cargo install smoleval-cli and run it against any HTTP endpoint. It supports concurrent requests and several report formats, such as JSON and XML.

smoleval \
    --dataset crates/smoleval-cli-example/data/eval_dataset.yaml \
    --agent http://localhost:3826 \
    --concurrency 3 \
    --quiet \
    --output results.json

The command above will produce a report containing agent responses, check results and aggregated metrics.

{
  "datasetName": "CalculatorAgentEvalDataset",
  "results": [
    {
      "testCase": "simpleAddition",
      "response": {
        "text": "2 + 3 is 5.",
        "toolCalls": [
          {
            "name": "add",
            "arguments": { "x": 2, "y": 3 }
          }
        ]
      },
      "checks": [
        {
          "kind": "responseContainsAny",
          "passed": true,
          "reason": "found \"5\" in response (case-insensitive)",
          "durationSecs": 2.833e-6,
          "score": 1.0
        },
        {
          "kind": "toolUsedAtLeast",
          "passed": true,
          "reason": "tool \"add\" used 1 time(s) (>= 1)",
          "durationSecs": 1.666e-6,
          "score": 1.0
        }
      ],
      "passed": true,
      "score": 1.0,
      "agentDurationSecs": 2.600175333
    },
    {
      "testCase": "largerAddition",
      "response": {
        "text": "100 and 250 added together equals 350.",
        "toolCalls": [
          {
            "name": "add",
            "arguments": { "x": 100, "y": 250 }
          }
        ]
      },
      "checks": [
        {
          "kind": "responseContainsAll",
          "passed": true,
          "reason": "found all of [\"350\"] in response (case-insensitive)",
          "durationSecs": 5.792e-6,
          "score": 1.0
        },
        {
          "kind": "toolUsedExactly",
          "passed": true,
          "reason": "tool \"add\" used exactly 1 time(s)",
          "durationSecs": 2.791e-6,
          "score": 1.0
        }
      ],
      "passed": true,
      "score": 1.0,
      "agentDurationSecs": 3.215905333
    },
    {
      "testCase": "noHallucination",
      "response": {
        "text": "Seven plus eight is 15.",
        "toolCalls": [
          {
            "name": "add",
            "arguments": { "x": 7, "y": 8 }
          }
        ]
      },
      "checks": [
        {
          "kind": "responseContainsAny",
          "passed": true,
          "reason": "found \"15\" in response (case-insensitive)",
          "durationSecs": 5.875e-6,
          "score": 1.0
        },
        {
          "kind": "responseNotContains",
          "passed": true,
          "reason": "none of [\"16\", \"14\"] found in response (case-insensitive)",
          "durationSecs": 3.042e-6,
          "score": 1.0
        }
      ],
      "passed": true,
      "score": 1.0,
      "agentDurationSecs": 2.9096685
    }
  ],
  "summary": {
    "totalCount": 3,
    "passedCount": 3,
    "failedCount": 0,
    "erroredCount": 0,
    "meanScore": 1.0,
    "threshold": 1.0,
    "durationSecs": 3.216570875
  }
}

The repo has examples for both Rig (Rust) and LangChain (Python) agents implementing the same calculator over HTTP, and a mock Rust agent using the core library directly.

Limitations

The main limitation is that there's currently no unified framework for agent interaction. The HTTP-based approach is a good baseline as it's language- and framework-agnostic, but it still requires you to implement the expected request/response interface, which can be tedious.

The closest thing to unification is probably A2A, which is still far from being widely adopted. Ideally, a tool like smoleval should be pluggable into any existing agent without any changes to the agent itself, but we're not there yet.

On Agentic Engineering

When building smoltok, I deliberately didn’t use coding agents, although all the main ingredients were in place, but I was enjoying working with Claude Code daily for other projects. If you’re only managing a single agent though, you’re falling behind. People now run 2–5–10 parallel sessions of Claude Code & Codex, which turns "vibe-coding" into "agentic engineering"3.

It's very clear this approach makes you productive as an individual. But we need to make sure these tools make you productive as a team too. This probably comes down to the same good practices — making the simplest thing, changing as few lines of code as possible, etc. to minimize communication overhead — but I believe it's worth thinking about.

I built this project in ~15h, and I think it's in good shape, somewhere between PoC and MVP. Without coding agents it would probably take 2–3x the time, but the real question is — would I build it at all without coding agents? Many people (myself included) have built as many projects in the last several months as they did in the last several years, and I'd say a big part of it is just confidence. Code is now cheap, you're allowed to try more things and throw them away, and you more often feel that "you can just build stuff" — and that's a nice feeling to have.

Conductor

But back to tooling — there’s a rise of ADEs (Agentic Development Environments) as a contrast to IDEs. It’s a kind of IDE built around the idea of conveniently managing several parallel instances of coding agents. The best one I’ve tried so far is Conductor. Here are some things I liked:

Final Thoughts

smoleval itself is a small project and a lot of features could be added or improved, but I think the problem it addresses — lightweight, simple, CI-friendly agent evaluation — will only get more important as agents get more capable and more widely deployed. If you're building agents, give it a try4.


  1. It's not that it's only true for small models, it's that it's especially true for small models.

  2. Built-in checks cover response content matching (containsAll, containsAny, notContains, exactMatch) and tool usage validation (usedAtLeast, usedAtMost, usedExactly, usedInOrder). You can register custom checks on top of these, such as LLM-as-a-judge; see an example here.

  3. The critical characteristic of vibe-coding is that you don’t look at the code and therefore don’t understand it, so AI-assisted coding ≠ vibe coding. I still like to look at the code, but that’s increasingly optional as long as you’re doing "in-distribution" work from the model’s perspective. Another point of view is that AI empowers you both qualitatively — coding in frameworks and languages you don’t know, especially valuable for non-technical people — and quantitatively — making more of what you can already make, but at a different scale.

  4. You should already have something similar developed internally 🙂

#project #rust #smol