Embedding Labs

Constitutional AI: Harmlessness from AI Feedback

Bai et al. (Anthropic) • 2022

AlignmentRLHFSafety

Abstract

The foundational paper behind Claude. Constitutional AI extends RLHF by introducing a set of written principles (a 'constitution') that enables models to critique and revise their own outputs. This approach reduces reliance on human feedback for harmlessness while maintaining helpfulness, providing a scalable path to safer AI systems.

Perché È Importante

Foundational to Anthropic's approach to safe AI
Reduces dependence on human feedback for safety alignment
Scalable framework for encoding values into AI systems

Vedi su arXiv Scarica PDF

Chiedi su questo articolo

Loading chat...