AI Governance and Safety Institute (AIGSI) is a nonprofit that aims to improve institutional response to existential risk from future artificial intelligence systems and ensure the benefits of AI are realized. We conduct research and outreach and develop educational materials for stakeholders and the general public.
Key Concepts in AI Alignment
Machine Learning and Interpretability: Developing the science of understanding AI. Usual computer programs are human-made instructions for machines to follow, but modern AI systems are instead trillions of numbers; the algorithms these numbers represent are found by computers themselves and not designed, controlled, or understood by humans.
Outer Alignment: Ensuring that the specified goals for a smarter-than-human AI system actually capture human values.
Inner Alignment: Making sure the AI system actually pursues the goals we specify, rather than its own misaligned objectives. AI developers know how to use a lot of compute to make AI systems generally better at achieving goals, but don’t know how to influence the goals AI systems are trying to pursue, especially as AI systems become human-level or smarter.
Convergent Instrumental Subgoals: Understanding how advanced AI systems might develop certain subgoals (like self-preservation or resource acquisition) regardless of their final goals, and how to use powerful AI systems safely despite these drives.
Advanced Agents: Ensuring safety of superhuman AI, or preventing everyone from developing a superhuman AI until it's possible to do that safely and in alignment with human values. If monkeys really want something, but humans really want something different, and humans don’t really care about monkeys, humans would usually get what they wanted even if it means monkeys don’t; if an AI system that’s better at achieving goals than humans doesn’t care about humans at all, it would get what it wants even if it means humans won’t get what they want. We should avoid developing superhuman systems that are misaligned with human values.