Embedding Labs

Constitutional AI: Harmlessness from AI Feedback

Bai et al. (Anthropic) • 2022

AlignmentRLHFSafety

Resumo

The foundational paper behind Claude. Constitutional AI extends RLHF by introducing a set of written principles (a 'constitution') that enables models to critique and revise their own outputs. This approach reduces reliance on human feedback for harmlessness while maintaining helpfulness, providing a scalable path to safer AI systems.

Por Que Importa

Foundational to Anthropic's approach to safe AI
Reduces dependence on human feedback for safety alignment
Scalable framework for encoding values into AI systems

Ver no arXiv Baixar PDF

Perguntar sobre este artigo

Loading chat...