Microsoft's ASSERT: Plain-English Tests for AI You Ship

If you have ever shipped an AI feature and then lay awake wondering what it would say to a real user at 2am, Microsoft's new ASSERT tool is aimed squarely at you. On June 2, 2026, Microsoft open-sourced a framework that turns a plain-English description of how your AI should behave into actual, runnable tests. No bespoke test harness, no hand-labelled dataset to start.

I read the original TechCrunch report and my first thought was not "neat demo." It was: this is the part of AI work that small teams skip, and skipping it is exactly what burns them.

🔍 What ASSERT actually does

ASSERT stands for Adaptive Spec-driven Scoring for Evaluation and Regression Testing. The idea is simple to state. You write down, in normal language, the behaviours you expect and the policies your app must follow. ASSERT converts that into sets of acceptable and unacceptable behaviours, generates test cases for each, runs them against your system, and scores the results.

The part I care about most: it records the execution path, including intermediate actions and tool calls. So when a test fails, you can see where the model went wrong, not just that it did.

Step	What you do	What ASSERT does
1. Spec	Describe expected behaviour in plain English	Parses intent into acceptable / unacceptable sets
2. Generate	—	Builds test scenarios automatically
3. Run	Point it at your system	Executes scenarios, logs every tool call
4. Score	Read the report	Scores results, flags regressions

Key takeaway: ASSERT moves AI testing from "write code to test the model" to "describe the behaviour and let the framework write the tests." For a two-person team, that difference is the gap between doing evals and never getting around to them.

💡 Why this lands differently for small teams

There is already a serious world of AI evaluation: Stanford's HELM, MLCommons' AILuminate, and METR benchmarks. Those measure broad model capability and safety. They answer "is this model generally good and generally safe?"

That is not the question most of us are actually asking. As Sarah Bird, Microsoft's Chief Product Officer of Responsible AI, put it:

"if you really want to have a trustworthy system, you should evaluate many more dimensions that are application-specific."

That is the whole point. If you build a support chatbot for a Colombo logistics company, no public benchmark knows your refund policy, your tone rules, or the three things the bot must never promise a customer. ASSERT is designed to fill exactly that gap, where general evaluations cannot, because the right behaviour is shaped by your product's context, policies, and tools.

For a freelancer or a small studio in Sri Lanka, that matters for a practical reason: you do not have a QA team. Your eval budget is your own evenings. A tool that generates the test cases from a written spec is doing the labour you would otherwise never find time for.

⚡ Where it fits in a real workflow

ASSERT is meant to be used at three points, and each maps to a problem I have actually hit:

During development — catch bad behaviour before a client ever sees it.
Post-deployment — confirm the version you shipped behaves like the version you tested.
Continuous monitoring — re-run the same spec when you swap models or bump a prompt, so a "small" change does not silently break something.

That third one is the quiet killer. You change a system prompt to fix one complaint, and three other behaviours drift. Regression testing for AI has been awkward precisely because the output is fuzzy. A spec-driven scorer is a sane answer to that.

Bottom line: the same spec runs in dev, in production checks, and every time you change a model. Write it once, keep it honest forever.

If you are building anything that speaks to users, say a narration feature or a voice assistant, the testing discipline is the same. Our own free AI tools are deliberately client-side and deterministic, which sidesteps a lot of this. But the moment an LLM is in the loop, "it worked when I tried it" stops being evidence, and a framework like ASSERT becomes the difference between a demo and a product.

🛠️ Getting started without spending money

The best part for a learning budget: it is open source, on GitHub at github.com/responsibleai/ASSERT. That means you can read the code, run it locally, and learn the method even if you never adopt the tool wholesale.

Here is how I would approach it as a student or solo builder:

Clone and read first. The value is partly the framework and partly the mental model of turning policy into tests.
Write your spec in English before any code. Forcing yourself to list "must never" rules is a useful exercise on its own.
Start with one feature. Don't try to spec your whole app. Pick the riskiest behaviour and test that.
Watch your token spend. Generating and running scenarios means model calls. On a free tier, run small batches and check the bill mentally before scaling up.

One honest caveat: I have not run ASSERT in production, and the TechCrunch piece is an announcement, not a long-term review. Treat early adoption as learning, not as a guarantee. Read the repo's own docs before you trust it with anything a paying client sees.

🌐 What this means for you

For a Sri Lankan engineer or small team, the lesson is bigger than one Microsoft release. The industry is admitting out loud that general benchmarks do not tell you whether your AI feature is safe to ship. Application-specific testing is becoming the standard, and the tooling to do it is now free and open.

If you ship AI features, start writing behaviour specs in plain English today, with or without ASSERT.
If you are learning, read the repo to understand how spec-driven scoring works. It is a skill that will show up in job interviews soon.
If you are choosing what to trust, prefer systems whose makers can show you their evals, not just their demos.

The tools to test AI properly used to live inside big labs. As of this week, one of them lives in a public GitHub repo. That is the part worth paying attention to.

Microsoft's ASSERT: Plain-English Tests for AI You Ship

🔍 What ASSERT actually does

💡 Why this lands differently for small teams

⚡ Where it fits in a real workflow

🛠️ Getting started without spending money

🌐 What this means for you

Keep reading

Lovable Now Deploys to Vercel: What It Means for You

Vercel Redacts Secrets in Build Logs: What It Misses

Agentic Coding Is Fast. Your Judgment Is the Moat