LIS AI Validation Framework

Auditable, workflow-level validation artifacts for AI agents in Laboratory Information Systems

The Challenge

Traditional LIS validation assumes deterministic, rule-based systems. AI agents introduce emergent capabilities that escape change-based validation.

CAP GEN.43875 requires validation "based on changes made."
But you can't validate changes you don't know exist.

For example, when LIS SOP documentation changes require updated autoverification logic, AI reasoning capabilities may be enhanced:

  • Documented change: "Improved clinical decision support"
  • Emergent capability: Proactive specimen swap detection
  • Validation scope: ???

AI agents in regulated industries need workflow-level validation, not just threshold accuracy.

Our Approach

Build a library of Terminal Bench validation tasks that provide auditable, reproducible validation artifacts grounded in real laboratory practices.

🎯

Workflow-Level Testing

Test reasoning across analytes and workflows, not just individual thresholds

📋

Auditable Artifacts

Versioned, reproducible tasks for regulatory compliance

🔬

Real Failure Modes

Grounded in established laboratory practices and actual safety risks

⚖️

Terminal Bench Standard

Standardized evaluation with Harbor execution framework

First Validated Artifact

LIS Swap & Contamination Triage is the first auditable, reproducible, validated task in what I hope will be a growing library.

What It Tests

This Terminal Bench task evaluates whether AI agents can correctly triage laboratory specimens for:

  • EDTA contamination — Elevated K, depressed Ca from tube contamination
  • Identity swaps — Specimens assigned to wrong patients
  • Normal results — Safe to release

Why This Matters

  • Threshold-only validation passes (individual values may be in range)
  • Workflow reasoning fails (agents must detect cross-analyte patterns)
  • Safety-critical decisions (zero unsafe releases required)

Clinical Realism: This task models delta-check and specimen-quality rule-outs used in autoverification and middleware systems. Output is HOLD for manual review, not a definitive diagnosis.

Evaluation Criteria

F1 ≥ 0.80

Precision & recall

🛡️
Zero Unsafe Releases

Safety constraint

📊
False Hold ≤ 0.34

Minimize false positives

View Task on GitHub

Join the Community

We welcome contributions from the laboratory community to expand this framework
with additional workflow-level validation tasks.