Embedding Labs

Constitutional AI: Harmlessness from AI Feedback

Bai et al. (Anthropic) • 2022

AlignmentRLHFSafety

Abstract

The foundational paper behind Claude. Constitutional AI extends RLHF by introducing a set of written principles (a 'constitution') that enables models to critique and revise their own outputs. This approach reduces reliance on human feedback for harmlessness while maintaining helpfulness, providing a scalable path to safer AI systems.

Warum Es Wichtig Ist

Foundational to Anthropic's approach to safe AI
Reduces dependence on human feedback for safety alignment
Scalable framework for encoding values into AI systems

Auf arXiv ansehen PDF herunterladen

Fragen zu diesem Artikel stellen

Loading chat...