Public Trust Benchmark

Evaluation health, explanation coverage, data trust, and known limits.

Recommendation eval top-3 hit rate is 1; explanation checks are 12/12; data trust is 81/100 with 12 review queue items.

Open benchmark JSON Trust Gate Quality governance Review queue Agent guide

Top-1 hit rate92.9%

28 CI-safe eval cases.

Top-3 hit rate100%

Shortlist health for recommendation and search flows.

Explanation coverage100%

12/12 explanation checks passed.

Data trust81/100

high current corpus risk.

Benchmark

Category accuracy

100%

Eval cases match expected category classifications.

Benchmark

Deployment accuracy

100%

Eval cases match expected deployment classifications.

Benchmark

Cloudflare readiness

100%

Eval cases match expected Cloudflare readiness.

Benchmark

Release score

100/100

Quality release gate based on warnings and errors.

Benchmark

Category coverage

13/13

Tracked taxonomy coverage in the loaded corpus.

Benchmark

Review queue

Projects needing classification, collection, or signal review.

Known Limitations

What agents should disclose

Eval quality is a CI-safe baseline over curated cases and generated fixtures; it is not a live benchmark over every GitHub repository.

Top-1 misses remain for prompt-tooling and coding-agent search cases; use top-3 health when evaluating shortlist quality.

Quality signal confidence can be snapshot, partial, estimated, or unknown depending on available GitHub sync depth.

Current data trust risk is high; inspect quality and review queue details before high-confidence claims.

12 projects are in the low-confidence review queue and may need classification or collection-semantics review.

Review Focus

Current top-1 misses

search-prompt-tooling · top-1 missExpected: promptfoo/promptfoo, guardrails-ai/guardrails, dottxt-ai/outlines, 567-labs/instructorObserved: openai/openai-structured-outputs-samples, guardrails-ai/guardrails, 567-labs/instructor, microsoft/TypeChat, BoundaryML/baml

search-coding-agent · top-1 missExpected: openai/codex, cline/cline, aider-ai/aider, OpenHands/OpenHandsObserved: smol-ai/developer, OpenHands/OpenHands, cline/cline, google-gemini/gemini-cli, anomalyco/opencode

Top Review Queue Items

Highest-impact data work

modelcontextprotocol/servers · impact 107Repository is a collection and may need curation semantics review.Confirm collection scope, freshness, and whether the category should represent resources rather than runtime code.

n8n-io/self-hosted-ai-starter-kit · impact 48Repository is a collection and may need curation semantics review.Confirm collection scope, freshness, and whether the category should represent resources rather than runtime code.

vonzosten/awesome-LangGraph · impact 40Repository is a collection and may need curation semantics review.Confirm collection scope, freshness, and whether the category should represent resources rather than runtime code.

arabicapp/everything-claude-code · impact 32Repository is a collection and may need curation semantics review.Confirm collection scope, freshness, and whether the category should represent resources rather than runtime code.

GoDiao/Free-Way · impact 32Repository is a collection and may need curation semantics review.Confirm collection scope, freshness, and whether the category should represent resources rather than runtime code.