These are operational norms: what teams do day-to-day to maintain safety, reliability, and accountability after the principles and methods are defined.
1. Release gating
No release without passing the current safety bar. Higher capability and higher access require higher scrutiny.
- Define a minimum evaluation suite per release tier.
- Block release on critical regressions (privacy, self-harm, violence, fraud, manipulation, tool misuse).
- Require explicit sign-off for enabling tools, autonomy, or sensitive domains.
2. Change logs and decision records
Record what changed, why it changed, and who approved it. Preserve the rationale for future audits.
- Keep a short decision log for each release and each safety exception.
3. Safe defaults
Start locked-down and expand carefully. Defaults should minimize harm without relying on user expertise.
- Default-deny for tools and external actions.
- Prefer read-only access over write access.
- Prefer explicit confirmation before consequential steps.
4. Least privilege access
Grant the minimum data and permissions required for a task; scope credentials by time, domain, and capability.
- Use short-lived tokens and scoped permissions.
- Separate prod vs. staging; separate human vs. automated credentials.
- No broad “god mode” for routine use.
5. Monitoring and alerting
Continuously monitor for unsafe behavior, abuse, and drift. Alert on spikes and anomalies.
- Track unsafe content rates, refusal rates, escalation rates, and tool invocation patterns.
- Detect prompt injection patterns and jailbreak signatures.
- Alert on unusual access to sensitive data and output leakage signals.
6. Incident response drills
Practice before it happens: run tabletop exercises and live drills with clear roles and escalation paths.
- Define severity levels and response playbooks.
- Assign an on-call rotation with rollback authority.
- Keep postmortems blameless and corrective-action driven.
7. User transparency
Be explicit about limits, uncertainty, and data handling. Avoid misleading anthropomorphism or hidden persuasion.
- Communicate confidence/uncertainty where possible.
- Disclose tool usage when it affects outcomes.
- Make it easy to report unsafe outputs.
8. Abuse handling and enforcement
Have clear processes for abuse reports, rate limits, bans, and rapid mitigation without overreach.
- Escalate repeat offenders and coordinated abuse.
- Apply progressive friction (rate limits → captchas → restriction → ban).
- Preserve evidence for investigation while respecting privacy.
9. Data minimization
Collect and store the minimum necessary. Prefer aggregation, redaction, and short retention windows.
- Redact secrets and sensitive identifiers from logs where feasible.
- Separate telemetry from content; restrict access to both.
- Explicit retention policies with periodic enforcement.
10. Evaluation hygiene
Keep evals trustworthy: avoid leakage, keep datasets versioned, and prevent “teaching to the test” from masking real risk.
- Version eval suites and document changes.
- Use held-out adversarial sets.
- Triangulate with real-world monitoring feedback.