Runtime Trust Signals
https://hvtracker.net/spec/runtime-trust/v0.2
1. Purpose
Supply-chain trust tells you whether an agent's code and packages are what they claim. Runtime trust asks a different question: what can this agent reach once it runs? An agent that ships an MCP server, calls many external providers, or exposes a plugin marketplace has a materially different risk surface than a self-contained library — regardless of how clean its build provenance is.
This spec documents the four runtime-trust signals HVTracker discovers for every tracked agent, and the experimental scoring that incorporates them. It exists so the methodology is public before any of it affects the production ranking.
2. Status: Live in the Production Rank
Since methodology v4.0 (2026-07-02), the runtime-calibrated score in §4 is the production trust_score/rank/evidence_grade — on the leaderboard, agent pages, the /data API, badges, and signed credentials. Promotion followed the evidence gate in §6 (an upset review); the pre-calibration baseline stays visible on the leaderboard for comparison.
v4.1 (2026-07-05) added a soft ceiling and an evidence-first tie-break (§4). No change ships as a silent reweight: every adjustment is documented here and in the methodology.
3. Runtime Discovery Fields
Each tracked agent carries four runtime fields, discovered by static analysis of the repository and its published package metadata. Every field reports a status, a confidence (high / medium / low), and an evidence array of human-readable findings, so any consumer can audit why a value was assigned.
| Field | Statuses | What it captures |
|---|---|---|
mcp_server_support | implemented, declared, none | Whether the project ships or declares a Model Context Protocol server |
external_service_dependencies | providers list + requires_api_keys | Third-party services the agent calls at runtime (LLM providers, APIs) |
tool_plugin_surface | plugin_system: marketplace, extension-based, declared, none; plus tool_tags | How much third-party code the agent can load and execute |
package_provenance_drift | match, partial, unknown, not_applicable, warning | Whether the published package matches the tracked repository |
These fields are recorded in every daily history snapshot, building an append-only time series of runtime-surface drift per agent.
4. Production Scoring
The runtime-calibrated score is the base trust score plus a bounded runtime adjustment, clamped to [0, 100]. Since methodology v4.0 this IS the production trust_score/rank/evidence_grade. Reference implementation: compute_trust_score_v2 in fetch_and_build.py. The per-dimension adjustments are:
| Dimension | Adjustment |
|---|---|
| MCP server support | implemented +2.0 · declared 0 · none 0 |
| External dependencies | −0.5 per provider beyond the first, capped at −3.0; additional −1.0 if API keys are required |
| Tool/plugin surface | −0.3 per tool tag, capped at −1.5; plus marketplace −1.0, extension-based −0.6, declared −0.3 |
| Provenance drift | match +4.0 · partial +2.0 · unknown/not_applicable 0 · warning −5.0 |
Soft ceiling (v4.1). The positive terms above are scaled by remaining headroom — factor = min(1, (100 − base) / 20), i.e. full effect at base ≤ 80, phasing to zero at base 100 — before being added. Penalties are not scaled. This prevents bonuses from clamping multiple strong agents onto an identical 100.0.
Tie-break (v4.1). Agents with an exactly equal score are ordered by hardest-to-fake evidence first: trust_confidence → OSSF scorecard_score → signed_commits_ratio → activity/momentum → stars → slug. They share a rank (=N) on the leaderboard.
Each agent publishes trust_score, the net trust_v2_adjustment, the applied trust_v2_headroom_factor, and a per-dimension trust_v2_breakdown, so every point of difference from the base score is attributable.
5. Data Access
GET /data/latest.json— all agents, including runtime fields and v2 scoresGET /data/agents/{slug}.json— per-agent recordGET /data/history/YYYY-MM-DD.json— daily snapshots (runtime-drift time series)- Methodology — Runtime-Trust Calibration — the human-readable adjustment reference
6. Calibration and Promotion Criteria
Runtime signals moved into the production rank only after an upset review demonstrated the change was evidence-backed, against published criteria: maximum acceptable rank churn, protection of high-grade agents from unexplained drops, no single dimension dominating the adjustment, and this spec being published first. The review is re-run after any recalibration — including the v4.1 soft-ceiling and tie-break change, whose near-zero churn (no grade flips) cleared the gate.
7. Versioning
This spec versions independently of the scoring methodology. Any change to the adjustment table (§4) requires a version bump and a changelog entry; promotion into production rank requires a new major section documenting the cutover and the evidence that gated it.