Embedding Labs

Constitutional AI: Harmlessness from AI Feedback

Bai et al. (Anthropic) • 2022

AlignmentRLHFSafety

Résumé

The foundational paper behind Claude. Constitutional AI extends RLHF by introducing a set of written principles (a 'constitution') that enables models to critique and revise their own outputs. This approach reduces reliance on human feedback for harmlessness while maintaining helpfulness, providing a scalable path to safer AI systems.

Pourquoi C'est Important

Foundational to Anthropic's approach to safe AI
Reduces dependence on human feedback for safety alignment
Scalable framework for encoding values into AI systems

Voir sur arXiv Télécharger le PDF

Poser une question sur cet article

Loading chat...