SDVM — Find workflow degradation. Decide where to act.

The expensive failures look normal

In long-horizon agentic workflows, steps get skipped without errors. Repairs pile up across cycles. Handoffs add noise. Outputs still look plausible. By the time a visible failure appears, the workflow has often already drifted—and output-only evaluation misses these mid-trajectory problems.

Silent degradation is a path, not a single event

The failure signal arrives late. The drift started earlier. SDVM is built to surface that earlier path.

Failure paths

Expected path: Stable execution
Problem space Silent drift: Skipped steps, repair pressure and handoff noise accumulate while the run still looks plausible.
Visible failure: The failure signal appears late

The decision gap

Observability — records what happened
Outcome metrics — show final results
Open question — where should you intervene first?
SDVM — turns that gap into a decision

Why this matters now

Agentic workflows are moving into production. Reliability, governance, and failure diagnosis have not caught up. Recent research and enterprise analysis describe the same gap SDVM is built to examine.

AI agent reliability

Aggregate success metrics can hide critical operational failures—inconsistency, weak robustness, unpredictability, and unsafe error behavior.

Towards a Science of AI Agent Reliability arXiv preprint

Read the paper

Enterprise AI trust

Agentic systems no longer stop at outputs. They trigger actions, call tools, and affect downstream systems—raising the cost of weak monitoring and governance.

State of AI trust in 2026: Shifting to the agentic era McKinsey

Read the report

What SDVM helps you decide

Beyond observability

Observability answers what happened?

SDVM helps answer the next question: where is the workflow degrading, and what should we change first?

It sits on top of your existing tools and turns workflow evidence into decisions you can act on:

Locate structural degradation before the failure becomes obvious
Turn traces into a structured diagnosis, not another event stream
Prioritize the edges where friction concentrates
Measure whether a targeted intervention actually moved the signals
See where richer capture would unlock a stronger diagnosis

Diagnostic depth grows with evidence quality. When the evidence is thin, the report says so—so you know when to refine capture before drawing a stronger conclusion.

How it works

Find the problem. Act on one edge. Verify the result.

Workflow — frame the recurring task, stages, and comparability conditions.
Workflow evidence — work from the traces and logs already available.
Evidence sufficiency — determine what can be computed, interpreted, or must remain open.
Deterministic diagnostic computation — stable diagnostic measures from that evidence (thirteen canonical variables), each with its own input requirements.
Diagnostic assessment — surface hotspots and signal strength so you know where to act.
PRE / POST / DELTA — test one intervention on flagged edges, then compare the change—or refine capture where the evidence was insufficient.

A report that drives the next move

A typical cycle is simple: find the problem on a specific edge, make one narrow change, then check whether the signals moved.

Synthetic example — not client data Assisted-pilot cycle

1. Problem — PRE diagnosis

Critical edge: planner → executor handoff. Other edges stayed comparatively quiet.

Edge	Signal	PRE diagnosis
`planner → executor`	Handoff noise incomplete or unstable handoffs	4 of 7 flagged
`planner → executor`	Repair pressure mid-run fixes and rework	3.1 / cycle
Other edges	Step skips required steps skipped without an explicit decision	low / stable
Workflow-level	Evidence strength how much of the diagnosis the available evidence can support	0.51 · interpretation limit: Medium

2. Intervention — the adjustment

Decision from PRE: retune the planner → executor checkpoint contract—not the whole workflow.

The POST run below measures that change—and only that change.

3. Result — POST / DELTA

Comparable runs after the intervention on the same edge.

Edge	Signal	PRE	POST	DELTA
`planner → executor`	Handoff noise	4 of 7 flagged	1 of 7 flagged	−75 %
`planner → executor`	Repair pressure	3.1 / cycle	1.2 / cycle	−61 %
Other edges	Step skips	low / stable	low / stable	≈ 0
Workflow-level	Evidence strength	0.51	0.74	+0.23

Problem → intervention → result. Example only; real reports depend on capture quality, workflow structure, and how much the available evidence can support.

Why you can trust the diagnosis

A deterministic core

Stable diagnostic measures are computed with fixed, reproducible rules. That is what makes PRE / POST comparison meaningful: the same method before and after the intervention.

Interpretation sits on top of those results. AI is not the origin of a diagnosis. Any future AI assistance is designed to operate on top of that core, never to replace it.

Four dimensions frame the diagnosis. Underneath, thirteen canonical variables do the measuring.

Four diagnostic dimensions

Synchrony — alignment across steps, tools and handoffs.
Depth — continuity of context and reasoning across cycles.
Vulnerability — whether the workflow registers uncertainty, explores it, and updates from what it finds—instead of converging early.
Metacognition — whether the workflow detects when it is stuck, runs an explicit correction, and checks that the revision mattered.

Claim governance

You can trust the diagnosis because SDVM does not extrapolate past what the evidence supports. Weak support is flagged. Overclaims are blocked.

What is ready today

You can already run an assisted pilot today: ingest a workflow, produce a PRE diagnosis, test one change, and compare POST. Deeper deterministic diagnostic computation is implemented and tested—but not yet the default integrated report path.

Current foundation

What you can use in an assisted pilot now—plus tested modules not yet on the default path.

Workflow evidence ingestion into a canonical format
Assisted PRE / POST / DELTA reporting for comparable runs
Rule-based recommendations that cite their support and do not claim causality
Claim-governance controls that keep the diagnosis proportional to the available evidence
Bounded interpretation with explicit evidence strength and limits on the assisted report path
Tested deterministic computation for the thirteen canonical variables—available as modules when input contracts are met, not yet wired as the default product report

Active development

Being matured now—including integration of the modules above.

Wiring canonical computation into the default diagnostic report path
Diagnostic Evidence — qualifying computed values as diagnostic evidence
Combinatorial diagnosis — substantive cross-variable interpretation
Recommendation engine — beyond today’s rule-based playbooks
Instrumentation planner — ranking missing capture by diagnostic gain
Evidence-sufficiency assessment across the full variable set
Longitudinal diagnostics across the full canonical variable set

On the roadmap

Later, downstream layers.

AI-assisted interpretation on top of the deterministic core
End-to-end orchestration across the full pipeline
Workflow design assurance for workflows still being built

Validation boundary

Diagnostic depth grows with the quality of available instrumentation.
Conclusions remain proportional to the available evidence.
Improvement is measured in comparable PRE / POST runs—not assumed.
Stronger semantic diagnoses require richer capture than traces alone provide.

A narrow private pilot

The goal of a pilot is a sharper operating loop: find the friction, choose one intervention, and see whether it moved the signals—without months of trial and error across the whole workflow.

Is your workflow ready?

Strong candidates have traceable multi-step runs, observable repairs or handoffs, and a technical owner who can act on findings. Coding and bugfix-style workflows are a preferred early track—but the fit is structural, not domain-specific.

What you need

One recurring agentic workflow, mapped and scoped
Traces via Langfuse or an agreed surface (current assisted-pilot path)
Enough comparable runs for baseline and follow-up
Observable repairs, revisions, handoffs or skipped steps
One technical owner to review findings and test an intervention
Willingness to share anonymized traces or metadata

What you gain

A clearer view of where structural friction concentrates
A prioritized intervention path instead of broad guesswork
An objective PRE / POST comparison of whether the change worked
Explicit sufficiency findings so you know when to refine capture
A shorter cycle from symptom to next controlled test

How a pilot runs

Frame the workflow and comparability conditions.
Review what the available traces actually capture.
Check evidence sufficiency for this workflow.
Compute a baseline PRE diagnosis.
Test one tuning step on flagged edges.
Review PRE / POST / DELTA and decide the next cycle.

Typical shape: 2–4 weeks, depending on capture and team cadence.

Scope

SDVM is workflow-engine agnostic. The current assisted path is Langfuse-first; broader surfaces remain under evaluation. The same framing can also assess workflows still being designed—by checking instrumentation and diagnostic readiness before production.

The goal is a useful decision under real or semi-real conditions: where to act, what to test, and whether the signals moved. Not commercial validation of universal silent-drift detection.

Data handling

Analysis can run on anonymized traces and metadata. Scope, access, and retention are agreed case by case. NDA support is available when required.

Review the pilot intake template · See a readiness probe example

If the workflow looks only partially ready, email a short summary. We can scope capture quality and a progressive instrumentation path—then decide whether a pilot is the right next step.

Check pilot fit

Origin

SDVM is developed by Ibrahim José Jamhour, an independent researcher working on Distributed Relational Cognition and the operational risks of agentic systems. The work builds on published research, a formal SDVM V3 technical specification, and ongoing assisted pilot work.

Jamhour also brings prior executive experience in institutional finance and risk-sensitive operations, including the Stanford Sloan Fellows program.

Find where agentic workflows degrade. Decide where to act. Verify that it worked.