Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/planning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
- `docs/planning/agent-integration-surface.md` - external agent-consumable VRDex skill, API, website navigation, and MCP direction
- `docs/planning/world-discovery.md` - world pages, creator attribution, active-world discovery, and creator-commerce boundaries
- `docs/planning/marketplace-api-research.md` - marketplace sync gate, provider posture, and disallowed storefront data
- `docs/planning/moderation-audit-and-adjudication.md` - auditable moderation/intelligence ledger and scoped false-positive overrides
- `docs/planning/engineering-strategy.md` - stack, testing, verification, and agentic factory plan
- `docs/planning/docs-strategy.md` - Docusaurus and source-of-truth documentation plan
- `docs/planning/epics.md` - phased epic breakdown for v0.5, v1, and v1.5
Expand Down
14 changes: 14 additions & 0 deletions docs/planning/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -373,6 +373,20 @@ Current recommendation:

- abuse, impersonation, suspicious link, toxic content, or mismatch signals
- can be raised by rules, LLM review, user reports, or admin actions
- should be backed by append-only moderation events and adjudications so false positives, test artifacts, reversals, and ignored signals are auditable instead of deleted
- see `docs/planning/moderation-audit-and-adjudication.md`

### `moderation_events` later

- immutable ledger for detections, reviews, restrictions, bans, reversals, dismissals, appeals, and other moderation/intelligence facts
- scoped globally, per server, per community, or future product scope
- used as the raw evidence source for moderation history and trust/risk accounting

### `moderation_adjudications` later

- append-only reviewer decisions attached to one or more moderation events
- records whether an event should count for future scoring in a given scope
- supports decisions such as false positive, test artifact, policy exempt, locally ignored, globally ignored, upheld, reversed, or stale

### `profile_revisions`

Expand Down
3 changes: 3 additions & 0 deletions docs/planning/epics.md
Original file line number Diff line number Diff line change
Expand Up @@ -386,11 +386,14 @@ Includes:
- set-time extraction
- entity matching suggestions
- moderation/confirmation workflow
- auditable moderation/intelligence event ledger
- scoped adjudication decisions for false positives, test artifacts, policy exemptions, reversals, and non-countable history

Acceptance criteria:

- AI can propose useful matches and extracted structure
- uncertain AI output is never silently treated as verified fact
- prior detections that reviewers marked non-countable do not continue to count as suspicious history

### EPIC-17 Insights and premium polish

Expand Down
208 changes: 208 additions & 0 deletions docs/planning/moderation-audit-and-adjudication.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
# Moderation Audit And Adjudication

## Status

Current recommendation.

This document captures the requirement that moderation and intelligence signals must be auditable and correctable before they become durable trust history. It is intentionally broader than profile moderation because the same model applies to future bot detections, server-level actions, global intelligence, bans, and appeal/review workflows.

## Problem

Operators and test accounts can accumulate suspicious detections while exercising the system. If those old detections keep flowing into future scoring, the system can repeatedly flag a user even after a human has determined the earlier events were false positives, test artifacts, policy-exempt activity, or otherwise not relevant.

Deleting old detections is the wrong fix. The system needs to preserve what happened while letting authorized reviewers decide whether a detection or action should count against a user, profile, account, server, or community in future accounting.

## Locked Decisions

- every detection, review, restriction, ban, dismissal, appeal, reversal, and override should be auditable
- adjudication should be additive, not destructive; do not erase historical evidence to restore standing
- future trust scoring should consume only active/countable signals, not raw historical detections by default
- global admins can adjudicate global intelligence and cross-server standing
- server admins can adjudicate server-scoped detections and server-local actions within their authority
- server-local adjudication must not silently erase global intelligence or another server's evidence
- AI/LLM output is evidence or a recommendation, not the final authority for trust state

## Core Concepts

### `moderation_events`

Immutable event ledger for facts observed or actions taken.

Examples:

- suspicious content detection
- suspicious activity detection
- user report
- model risk summary
- manual case opened
- warning issued
- restriction applied
- ban applied
- ban lifted
- message dismissed
- appeal submitted

Important fields:

- `scopeType`: `global`, `server`, `community`, or future product scope
- `scopeId`: optional scope identifier, such as Discord server id or community profile id
- `subjectType`: `user`, `profile`, `message`, `event`, `world`, `account`, or future subject type
- `subjectId`: stable internal subject id where available
- `sourceType`: `rule`, `llm`, `report`, `admin`, `integration`, `system`
- `sourceId`: model run id, rule id, report id, integration id, or actor id where relevant
- `eventType`: concrete event/action type
- `confidence`: optional normalized confidence for detections
- `reasonCodes`: normalized reason codes when available
- `summary`: short human-readable summary
- `evidenceRefs`: references to immutable evidence snapshots, not uncontrolled live links only
- `createdBy`: actor or system identity
- `createdAt`: event time

### `moderation_adjudications`

Append-only review decisions attached to one or more moderation events.

Examples:

- `upheld`
- `false_positive`
- `test_artifact`
- `policy_exempt`
- `duplicate`
- `stale`
- `locally_ignored`
- `globally_ignored`
- `reversed`
- `partial`

Important fields:

- `eventIds`: one or more moderation event ids covered by the decision
- `scopeType` and `scopeId`: adjudication scope
- `decision`: normalized decision
- `countsForScoring`: whether the covered events should affect future risk/accounting in this scope
- `effectiveFrom`: when the decision applies
- `expiresAt`: optional expiration for temporary overrides
- `reviewerId`: admin/reviewer identity
- `reviewerRole`: global admin, server owner, server admin, moderator, automated migration, etc.
- `rationale`: concise reviewer explanation
- `createdAt`: adjudication time

### `moderation_cases`

Human workflow container for grouping events, actions, appeals, notes, and adjudications.

Cases should support:

- global view for super admins
- server-scoped view for server admins
- subject history timeline
- linked detections and actions
- current effective standing
- action buttons such as dismiss, restrict, ban, reverse, mark false positive, mark test artifact, and exclude from future scoring

## Effective Standing

Trust/risk calculations should use an effective view, not raw event history.

Current recommendation:

- raw event history remains immutable
- each event has an effective adjudication state per relevant scope
- scoring excludes events where the latest applicable adjudication has `countsForScoring: false`
- global adjudications apply everywhere unless a narrower rule explicitly needs local context
- server adjudications apply only to that server scope
- model prompts and feature inputs should avoid presenting ignored detections as negative evidence
- when useful, prompts may include a compact neutral note such as `3 prior detections were adjudicated false-positive/test-artifact and excluded from scoring`

This avoids the failure mode where a user is repeatedly flagged for prior suspicious history after an authorized reviewer already corrected the record.

## Permission Model

Global admins can:

- view all moderation events and adjudications
- adjudicate global and server-scoped events
- reverse or override any automated action
- mark events as globally ignored for future scoring
- audit reviewer behavior

Server admins can:

- view events scoped to their server
- view limited global standing summaries when needed for safety decisions
- adjudicate server-scoped detections and actions
- mark server-local events as locally ignored for future scoring
- reverse server-local actions where the product grants that capability

Server admins cannot:

- erase global events
- modify another server's events
- globally clear a user's account standing
- hide their own reviewer actions from global audit

## Existing Data Migration

When this system is introduced, existing detections and actions should be backfilled into `moderation_events` instead of left as opaque legacy state.

Migration should preserve:

- original detection/action timestamps
- original confidence where available
- original reason text or normalized reason codes where available
- original actor/system source where available
- current action state, such as active ban or lifted ban
- a migration source marker when original provenance is incomplete

If old detections are known test artifacts, add adjudication records rather than deleting the migrated events.

## UI Requirements

Global admin console:

- subject search by user/account/profile/server
- global timeline of detections, actions, adjudications, and appeals
- effective standing summary with countable vs ignored signal breakdown
- filters for unresolved, active action, ignored, false positive, test artifact, and stale
- bulk adjudication for known test runs or duplicate imports
- reviewer/auditor activity log

Server admin console:

- server-local timeline for a subject
- local effective standing summary
- actions to dismiss, mark false positive, mark test artifact, restrict, ban, reverse, and open case
- clear labels when global intelligence exists but cannot be modified locally

Detection cards and alerts:

- show recent countable history separately from ignored/adjudicated history
- do not list ignored events as active reasons for suspicion
- expose a `History` path that explains why an alert triggered and what can be corrected

## Non-Goals For The First Slice

- building a giant Discord-sized permission matrix
- making AI the final judge of appeals or standing
- deleting historical detections as the normal correction path
- exposing private evidence across server boundaries without a need-to-know rule
- using broad natural-language heuristics in deterministic code to decide whether something is suspicious

## First Implementation Slice

1. Add immutable moderation events for detections and actions.
2. Add adjudication records with `countsForScoring` and scoped authority.
3. Update risk/history queries to return countable and ignored signal groups separately.
4. Update model/tool inputs to ignore non-countable prior detections as negative evidence.
5. Add global admin actions to mark events false-positive, test-artifact, policy-exempt, or ignored.
6. Add server admin actions for server-scoped false-positive/test-artifact/local-ignore decisions.
7. Backfill existing detections/actions into the event ledger.
8. Add tests that prove ignored historical detections no longer trigger recent-suspicious-history reasons.

## Open Questions

- Should global admins be able to mark a server-local event as globally ignored, or should that require promoting the event to global scope first?
- Should expired adjudications restore scoring automatically, or require re-review before counting again?
- Which evidence fields are safe for server admins to see when global intelligence contributed to a server-local alert?
- How should self-review be handled when the affected user is also a server admin or global admin?
3 changes: 3 additions & 0 deletions docs/planning/product-spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -685,6 +685,9 @@ Every field or block should support owner-configured visibility and source attri
- restrict certain sensitive fields after claim
- allow report / correction requests
- clearly mark AI-extracted event links as suggested, confirmed, or disputed
- keep an auditable ledger for detections, reviews, restrictions, bans, dismissals, reversals, and false-positive decisions
- allow authorized global and server-scoped admins to mark prior detections or actions as non-countable for future trust/risk accounting without deleting history
- see `docs/planning/moderation-audit-and-adjudication.md`

## API goals

Expand Down
Loading