How can the use of inaccurate proxies lead AI systems to cause harm?
Proxies that fail to capture important values we care about can result in models exploiting those gaps and taking undesired actions that negatively impact people.
What is an example that illustrates Goodhart's law in proxies?
Why could reliance on AI systems to evaluate other AIs be risky?