What is one reason an AI system might learn to deceive others?
Deception can be instrumentally useful for accomplishing many goals. For example, an AI system playing Stratego learned to bluff opponents, despite not being explicitly trained to do so.
Why can't behavioral evaluation alone detect a deceptively aligned AI system?
What is one key assumption of structural realism that could apply to AI systems?