Palisade’s team found that OpenAI’s o1-preview attempted to hack 45 of its 122 games, while DeepSeek’s R1 model attempted to cheat in 11 of its 74 games. Ultimately, o1-preview managed to “win” seven times. The researchers say that DeepSeek’s rapid rise in popularity meant its R1 model was overloaded at the time of the experiments, meaning they only managed to get it to do the first steps of a game, not to finish a full one. “While this is good enough to see propensity to hack, this underestimates DeepSeek’s hacking success because it has fewer steps to work with,” they wrote in their paper. Both OpenAI and DeepSeek were contacted for comment about the findings, but neither replied.
The models used a variety of cheating techniques, including attempting to access the file where the chess program stores the chess board and delete the cells representing their opponent’s pieces. (“To win against a powerful chess engine as black, playing a standard game may not be sufficient,” the o1-preview-powered agent wrote in a “journal” documenting the steps it took. “I’ll overwrite the board to have a decisive advantage.”) Other tactics included creating a copy of Stockfish—essentially pitting the chess engine against an equally proficient version of itself—and attempting to replace the file containing Stockfish’s code with a much simpler chess program.
So, why do these models try to cheat?
The researchers noticed that o1-preview’s actions changed over time. It consistently attempted to hack its games in the early stages of their experiments before December 23 last year, when it suddenly started making these attempts much less frequently. They believe this might be due to an unrelated update to the model made by OpenAI. They tested the company’s more recent o1mini and o3mini reasoning models and found that they never tried to cheat their way to victory.
Reinforcement learning may be the reason o1-preview and DeepSeek R1 tried to cheat unprompted, the researchers speculate. This is because the technique rewards models for making whatever moves are necessary to achieve their goals—in this case, winning at chess. Non-reasoning LLMs use reinforcement learning to some extent, but it plays a bigger part in training reasoning models.
#reasoning #models #cheat #win #chess #games