How this is measured
Every rule we ship is exercised by a real attack technique against a live TinySocs install, on a schedule, in public. This page explains exactly what the numbers on the dashboard mean — including the parts that are unflattering.
What runs
We use Atomic Red Team — an open library of small, self-contained tests, each mapped to a MITRE ATT&CK technique. For each rule in the detection pack that has a mapped technique, the harness:
- executes the Atomic Red Team test for that technique on the machine (e.g. simulates failed logons for brute-force, dumps LSASS for credential access);
- waits for the TinySocs detection pipeline to process the resulting Windows events;
- queries OpenSearch for an alert carrying the expected rule ID;
- records the outcome with a timestamp and the commit that produced it.
The harness source is open:
tests/Test-AtomicDetection.ps1,
and the technique→rule mapping is
tests/atomic-tests.yaml.
What counts as a pass
A test passes when at least one of its expected rules fires an alert within the timeout. That is detection at the technique level: the attack happened, and the pack caught it. If nothing fires, it's a miss.
The categories
Each result lands in exactly one bucket. The dashboard colours by these:
| Category | Meaning |
|---|---|
| Detected | An expected rule fired. The pack caught the attack. |
| Missed | A rule should have fired and didn't. The only alarming category. Every miss gets a written postmortem. |
| Skip — platform | The test environment can't run this test (e.g. it needs a Domain Controller, Sysmon, or the FIM module). Not a rule defect. |
| Skip — prerequisite | An environmental precondition isn't met (e.g. Tamper Protection is on, blocking a Defender change). Expected behaviour, not a defect. |
| Error | The harness itself hit a problem (network, Atomic Red Team install, a query failure). Investigated, but not a rule failure. |
The venue
Validation currently runs on a single Windows 11 workstation with TinySocs, OpenSearch, and Sysmon installed. That's an honest limit: rules that need a Domain Controller (e.g. NTDS extraction) or the optional FIM module show as platform-skips here, not passes. Multi-platform validation (Windows Server, additional OS versions) is planned, not yet live.
Coverage, named
Not every rule in the pack has an Atomic Red Team test yet. The dashboard counts and lists the untested rules rather than hiding them, and we add coverage on a weekly cadence. A rule without a test is marked untested — it is not counted as a pass.
When we miss
A first-time miss is the failure mode that matters, so we treat it openly: same-week investigation, a written postmortem committed alongside the results, and a link from the affected rule. "Yes, we missed; here's why; here's the fix" is the whole point of publishing this.
The raw data
Every weekly run is committed as a JSON file you can read or diff:
results/.
The dashboard is built from those files and nothing else — there's no
separate, prettier source of truth.