Methodology

How We Verify Findings

Static analysis generates candidates. AI verification + human audit determines what's real. No finding cited in pitch materials or research posts without documented TP/FP assessment.

Scan Pipeline

Discovery

Repos discovered via npm/PyPI registry crawl, GitHub topic search, and artifact-type file search. Each repo assigned a tier (T1–T5) based on star count and source quality. Aggregators and collections excluded before scanning.

Static Analysis

22 scanner modules run concurrently against the cloned repository. Each module covers a specific artifact type or vulnerability category. All 120+ detection rules are hand-written with documented true positives and negatives (ADR-010).

Score Computation

Weighted sum of finding severities with confidence multiplier. Floor rules applied for high-severity categories. Score is deterministic — same repo always produces same score with same checker versions.

AI Verification

CRITICAL findings from differentiator checkers (CHK-115, CHK-119, CHK-027, CHK-089) are submitted to the AI jury. Each finding gets a verdict: CONFIRMED, LIKELY, or FALSE_POSITIVE. Verdicts are cached in verify_cache.json and persist across rescans.

Corpus Audit

Before any stat is published, every high-volume checker is sampled (20 findings each) and FP rate measured. Checkers with >50% FP rate are fixed before numbers are cited. The 24% critical figure reflects post-audit, post-fix numbers.

Responsible Disclosure

Named findings are disclosed to maintainers 7+ days before publication. Evidence is redacted in public reports. Full reports available to maintainers on request.

AI Verification (Jury System)

CRITICAL findings from differentiator checkers are submitted to an LLM jury with structured context: the checker's intent, the evidence, the file path, and the surrounding code context. The jury returns a structured verdict with a one-sentence rationale.

Input:  checker_id, checker_intent, evidence, file_path, repo_context
Prompt: "This checker fires when: [intent].
        Here is the finding. Is this a true positive?"
Output: { verdict: "CONFIRMED"|"LIKELY"|"FALSE_POSITIVE",
         explanation: "one sentence, max 20 words" }

Verdicts are cached in verify_cache.json and persist across rescans. A cached CONFIRMED verdict is never downgraded by a re-scan — only a human analyst can override a confirmed verdict.

Verdict Definitions

CONFIRMED

Criteria

·Evidence matches the finding pattern with no ambiguity

·Context rules out false positive (runtime source file, not test/doc)

·The security impact is clear and non-speculative

Example

CHK-115: SKILL.md contains 'cat ~/.aws/credentials' in a task step. File path is in .claude/skills/, not a test fixture. Evidence is unambiguous credential access instruction.

LIKELY

Criteria

·Evidence matches but context is ambiguous

·Security impact is probable but not certain

·Manual review recommended before blocking

Example

CHK-027: SKILL.md contains 'send results to https://api.example.com'. Domain is not in known-safe list. Could be legitimate API call or exfiltration — context required.

FALSE_POSITIVE

Criteria

·Evidence matches the pattern but context makes it benign

·File is a test fixture, documentation example, or placeholder

·The pattern fires on non-runtime code (e.g. key header in certgen test)

Example

CHK-042: '-----BEGIN RSA PRIVATE KEY-----' in pkg/tls/certgen_test.go. Test file generating ephemeral TLS certs — pattern correct but context is benign.

False Positive Audit Log

Before locking pitch statistics, every high-volume checker is sampled (20 random findings) and FP rate assessed. Checkers above 50% FP are fixed at the root cause — never globally suppressed.

CHK-023 (injection patterns)

Problem

~90% — normal skill MUST/CRITICAL instructions triggered

Fix

Tightened to require explicit override language ('ignore previous instructions', 'disregard system prompt', 'bypass security'). Imperative workflow instructions excluded.

Impact

~4,000 HIGH findings eliminated

CHK-049 (no auth)

Problem

~30% — skill/hook/agent repos fired even though they have no server

Fix

Scoped to repos with server/mcp artifact hints only. Skill-only repos excluded from auth check.

Impact

~500 HIGH findings eliminated

CHK-133 (placeholder secrets)

Problem

~80% — 'YourMySQLRootPassword', 'secure123' scored as CRITICAL

Fix

Shannon entropy gate raised to 3.5 bits/char + expanded placeholder list covering common tutorial patterns.

Impact

~700 CRITICAL demoted to INFO

CHK-105 (CI secret echo)

Problem

~95% — standard >> $GITHUB_OUTPUT writes triggered

Fix

Excluded lines writing to $GITHUB_OUTPUT, $GITHUB_ENV, $GITHUB_PATH — the mandated GitHub Actions step-output idiom since 2022.

Impact

~170 HIGH findings eliminated

CHK-108 (credential URLs in docs)

Problem

~95% — i18n locale files, README proxy examples triggered

Fix

Extended docs context to cover /locales/, /i18n/ paths and .json files containing URL format examples.

Impact

~96 CRITICAL demoted to LOW

Core Principles

Deterministic scoring

Same repo + same checkers = same score. No randomness, no model drift, no A/B testing on security conclusions.

Hand-written detection logic

ADR-010: all detection rules written deliberately with documented true/false positives. No ML classifiers in v1.

Audit trail

Every finding has a checker_id traceable to source code. Every verdict has a cached rationale. SOC 2 audit log on all scans.

Responsible disclosure

Named findings disclosed 7+ days before publication. Severity never inflated. Evidence redacted in public reports.

No suppression

FP checkers are fixed at the root cause — path context, entropy gate, or scope filter. Global suppression is never used.

Stats locked before publishing

Critical percentage figures are locked after full FP audit and not adjusted retroactively to match a narrative.

Browse Registry →How we score What we scan