LIS AI Validation Framework

The Challenge

Traditional LIS validation assumes deterministic, rule-based systems. AI agents introduce emergent capabilities that escape change-based validation.

                    CAP GEN.43875 requires validation "based on changes made."

                    But you can't validate changes you don't know exist.

When you update your LIS AI from GPT-4 to Claude Sonnet 4.5:

Documented: "Improved reasoning"
What emerged: Proactive aliquot swap detection
Validation scope: ???

AI agents in regulated industries need workflow-level validation, not just threshold accuracy.

Our Approach

Build a library of Terminal Bench validation tasks that provide auditable, reproducible validation artifacts grounded in real laboratory practices.

🎯

Workflow-Level Testing

Test reasoning across analytes and workflows, not just individual thresholds

📋

Auditable Artifacts

Versioned, reproducible tasks for regulatory compliance

🔬

Real Failure Modes

Grounded in established laboratory practices and actual safety risks

⚖️

Terminal Bench Standard

Standardized evaluation with Harbor execution framework

First Validated Artifact

LIS Swap & Contamination Triage is the first auditable, reproducible, validated task in a growing library.

What It Tests

This Terminal Bench task evaluates whether AI agents can correctly triage laboratory specimens for:

EDTA contamination — Elevated K, depressed Ca from tube contamination
Identity swaps — Specimens assigned to wrong patients
Normal results — Safe to release

Why This Matters

Threshold-only validation passes (individual values may be in range)
Workflow reasoning fails (agents must detect cross-analyte patterns)
Safety-critical decisions (zero unsafe releases required)

Evaluation Criteria

✓

F1 ≥ 0.80

Precision & recall

🛡️

Zero Unsafe Releases

Safety constraint

📊

False Hold ≤ 0.34

Minimize false positives

View Task on GitHub

Join the Community

We welcome contributions from the laboratory community to expand this framework
with additional workflow-level validation tasks.

Contribute on GitHub Get in Touch