Embedding Labs

Constitutional AI: Harmlessness from AI Feedback

Bai et al. (Anthropic) • 2022

AlignmentRLHFSafety

Resumen

The foundational paper behind Claude. Constitutional AI extends RLHF by introducing a set of written principles (a 'constitution') that enables models to critique and revise their own outputs. This approach reduces reliance on human feedback for harmlessness while maintaining helpfulness, providing a scalable path to safer AI systems.

Por Qué Importa

Foundational to Anthropic's approach to safe AI
Reduces dependence on human feedback for safety alignment
Scalable framework for encoding values into AI systems

Ver en arXiv Descargar PDF

Preguntar sobre este artículo

Loading chat...