There’s more. To make its use of reinforcement learning as efficient as possible, DeepSeek has also developed a new algorithm called Group Relative Policy Optimization (GRPO). It first used GRPO a year ago, to build a model called DeepSeekMath.
We’ll skip the details—you just need to know that reinforcement learning involves calculating a score to determine whether a potential move is good or bad. Many existing reinforcement-learning techniques require a whole separate model to make this calculation. In the case of large language models, that means a second model that could be as expensive to build and run as the first. Instead of using a second model to predict a score, GRPO just makes an educated guess. It’s cheap, but still accurate enough to work.
A common approach
DeepSeek’s use of reinforcement learning is the main innovation that the company describes in its R1 paper. But DeepSeek is not the only firm experimenting with this technique. Two weeks before R1 dropped, a team at Microsoft Asia announced a model called rStar-Math, which was trained in a similar way. “It has similarly huge leaps in performance,” says Matt Zeiler, founder and CEO of the AI firm Clarifai.
AI2’s Tulu was also built using efficient reinforcement-learning techniques (but on top of, not instead of, human-led steps like supervised fine-tuning and RLHF). And the US firm Hugging Face is racing to replicate R1 with OpenR1, a clone of DeepSeek’s model that Hugging Face hopes will expose even more of the ingredients in R1’s special sauce.
What’s more, it’s an open secret that top firms like OpenAI, Google DeepMind, and Anthropic may already be using their own versions of DeepSeek’s approach to train their new generation of models. “I’m sure they’re doing almost the exact same thing, but they’ll have their own flavor of it,” says Zeiler.
But DeepSeek has more than one trick up its sleeve. It trained its base model V3 to do something called multi-token prediction, where the model learns to predict a string of words at once instead of one at a time. This training is cheaper and turns out to boost accuracy as well. “If you think about how you speak, when you’re halfway through a sentence, you know what the rest of the sentence is going to be,” says Zeiler. “These models should be capable of that too.”
It has also found cheaper ways to create large data sets. To train last year’s model, DeepSeekMath, it took a free data set called Common Crawl—a huge number of documents scraped from the internet—and used an automated process to extract just the documents that included math problems. This was far cheaper than building a new data set of math problems by hand. It was also more effective: Common Crawl includes a lot more math than any other specialist math data set that’s available.
And on the hardware side, DeepSeek has found new ways to juice old chips, allowing it to train top-tier models without coughing up for the latest hardware on the market. Half their innovation comes from straight engineering, says Zeiler: “They definitely have some really, really good GPU engineers on that team.”
#DeepSeek #ripped #playbookand #everyones #follow