AI Alignment
AI Alignment investigates failure modes, adversarial vulnerabilities, and emergent misbehavior in large language models through systematic red teaming. The project develops testing frameworks and attack strategies to uncover risks before deployment, contributing to the broader effort of building safe and reliable AI systems.