Unlocking AI: How DeepSeek Achieved OpenAI-Level Reasoning with a Simplified Approach

In a groundbreaking development, DeepSeek has revolutionized artificial intelligence by achieving OpenAI o1-level reasoning without the need for a PhD! Through a unique blend of pure Reinforcement Learning (RL) and an innovative multi-stage training process, DeepSeek has successfully tackled longstanding challenges in AI training. Discover how this ambitious approach is setting new standards in the field and making advanced AI more accessible than ever before!

Explained

DeepSeek just made a breakthrough: you can train a model to match OpenAI o1-level reasoning using pure reinforcement learning (RL) without using labeled data (DeepSeek-R1-Zero). But RL alone isn’t perfect — it can lead to challenges like poor readability. A mix of methods in a multi-stage training fixes these (DeepSeek-R1).

The launch of GPT-4 forever changed the AI industry. But today, it feels like an iPhone 4 compared to the next wave of reasoning models (e.g. OpenAI o1).

These “reasoning models” introduce a chain-of-thought (CoT) thinking phase before generating an answer at inference time, which in turn improves their reasoning performance.

While OpenAI kept their methods under wraps, DeepSeek is taking the opposite approach — sharing their progress openly and earning praise for staying true to the open-source mission. Or as Marc said it best:

This open-source reasoning model is as good as OpenAI’s o1 in tasks like math, coding, and logical reasoning, which is a huge win for the open-source community… and the world (Marc, your words not ours!)

As someone who spends a lot of time working with LLMs and guiding others on how to use them, I decided to take a closer look at the DeepSeek-R1 training process. Using their paper as my guide, I pieced it all together and broke it down into something anyone can follow—no AI PhD required. Hopefully you’ll find it useful!

Now, let’s start with the fundamentals.

A quick primer

To better understand the backbone of DeepSeek-R1, let’s cover the basics:

Reinforcement Learning (RL): A model learns by receiving rewards or penalties based on its actions, improving through trial and error. In the context of LLMs, this can involve traditional RL methods like policy optimization (e.g., Proximal Policy Optimization, PPO), value-based approaches (e.g., Q-learning), or hybrid strategies (e.g., actor-critic methods). Example: When training on a prompt like “2 + 2 =”, the model receives a reward of +1 for outputting “4” and a penalty of -1 for any other answer. In modern LLMs, rewards are often determined by human-labeled feedback (RLHF) or as we’ll soon learn, with automated scoring methods like GRPO.

Supervised fine-tuning (SFT): A base model is re-trained using labeled data to perform better on a specific task. Example: Fine-tune an LLM using a labeled dataset of customer support questions and answers to make it more accurate in handling common queries. Great to use if you have an abundance of labeled data.

Cold start data: A minimally labeled dataset used to help the model get a general understanding of the task.* Example: Fine-tune a chatbot with a simple dataset of FAQ pairs scraped from a website to establish a foundational understanding. Useful when you don’t have a lot of labeled data.

Multi-stage training: A model is trained in phases, each focusing on a specific improvement, such as accuracy or alignment. Example: Train a model on general text data, then refine it with reinforcement learning on user feedback to improve its conversational abilities.

Rejection sampling: A method where a model generates multiple potential outputs, but only the ones that meet specific criteria, such as quality or relevance, are selected for further use. Example: After a RL process, a model generates several responses, but only keeps those that are useful for retraining the model.

First model: DeepSeek-R1-Zero

The team at DeepSeek wanted to prove whether it’s possible to train a powerful reasoning model using pure-reinforcement learning (RL). This form of “pure” reinforcement learning works without labeled data.

Skipping labeled data? Seems like a bold move for RL in the world of LLMs.

I’ve learned that pure-RL is slower upfront (trial and error takes time) — but iteliminates the costly, time-intensive labeling bottleneck. In the long run, it’ll be faster, scalable, and way more efficient for building reasoning models. Mostly, because they learn on their own.

DeepSeek did a successful run of a pure-RL training — matching OpenAI o1’s performance.

Calling this a ‘huge accomplishment” feels like an understatement—it’s the first time anyone’s made this work. Then again, maybe OpenAI did it first with o1, but we’ll never know, will we?

The biggest question on my mind was: ‘How did they make it work?’

Let’s cover what I found out.

Using the GRPO RL framework

Traditionally, RL for training LLMs has been most successful when combined with labeled data (e.g the PPO RL Framework). This RL approach employs a critic model that’s like an “LLM coach”, giving feedback on each move to help the model improve. It evaluates the LLM’s actions against labeled data, evaluating how likely the model is to succeed (value function) and guiding the model’s overall strategy.

The challenge?

This approach is limited by the labeled data it uses to evaluate decisions. If the labeled data is incomplete, biased, or doesn’t cover the full range of tasks, the critic can only provide feedback within those constraints — and it won’t generalize well.

Enter, GRPO!

The authors used the Group Relative Policy Optimization (GRPO) RL framework (invented by the same team, wild!) which eliminates the critic model.

With GRPO, you skip the ‘coach’—and the LLM moves are scored over multiple rounds by using predefined rules like coherence and/or fluency. These models learn by comparing these scores to the group’s average.

But wait, how did they know if these rules are the right rules?

In this method, the rules aren’t perfect—they’re just a best guess at what “good” looks like. These rules are designed to catch patterns that usually make sense, like:

Does the answer make sense? (Coherence)
Is it in the right format? (Completeness)
Does it match the general style we expect? (Fluency)

For example, for the DeepSeek-R1-Zero model, for mathematical tasks, the model could be rewarded for producing outputs that adhered to mathematical principles or logical consistency, even without knowing the exact answer.

It makes sense.. and it works!

The DeepSeek-R1-Zero model had great performance on reasoning benchmarks. Plus it had a 86.7% of pass@1 score on AIME 2024 (a prestigious mathematics competition for high school students), matching the performance of OpenAI-o1-0912. While this seems like the biggest breakthrough from this paper, the R1-Zero model didn’t come with a few challenges: poor readability, and language mixing.

Second model: DeepSeek-R1

Poor readability and language mixing is something you’d expect from using pure-RL, without the structure or formatting provided by labeled data.

^{_{ADVERTISEMENT}}

Breakthroughs in HIV Treatment

Links between metabolism and aggressive breast cancer

^{_{ADVERTISEMENT}}

Tigers in the Neighbourhood: India chapter

^{_{ADVERTISEMENT}}

CoT at inference time relies on RL

To effectively use chain-of-thought reasoning at inference, models need reinforcement learning training that promotes step-by-step reasoning. It’s crucial for achieving high-level reasoning. The question arises: why did OpenAI keep their training methods secret, given the seemingly straightforward multi-stage process of the o1 model?

They clearly employed RL, synthesized data from RL checkpoints, and used supervised training for clarity. What was the strategic advantage in delaying competition (R1) by just a couple of months?

I guess time will tell.

^{_{ADVERTISEMENT}}

How to use DeepSeek-R1

To use DeepSeek-R1, you can test it on their free platform or acquire an API key for integration with AI development platforms like Vellum or Fireworks AI.

The hosted model costs $0.55 per million input tokens and $2.19 per million output tokens, making it about 27 times cheaper for inputs and nearly 27.4 times cheaper for outputs compared to OpenAI’s o1 model.

This API version supports a maximum context length of 64K but lacks function calling and JSON outputs. It allows retrieval of both the “reasoning” and the final answer, albeit slowly, which is acceptable for reasoning models as speed isn’t the priority.

Additionally, it does not support several other parameters like temperature, top_p, presence_penalty, frequency_penalty, logprobs, and top_logprobs, which complicates production use.

^{_{ADVERTISEMENT}}

Conclusion

DeepSeek has demonstrated that significant improvements in LLM reasoning can be achieved purely through reinforcement learning (RL), without the need for labeled data. Their post-training techniques enhance performance even further.

Expect a surge of models like R1 and O1 soon. While it seemed model scaling had plateaued, this method is reopening doors for faster advancements. OpenAI took 6 months from GPT-3.5 to GPT-4, but DeepSeek achieved O1-level performance in just 2 months without prior knowledge of OpenAI’s methods.

Get ready for a new wave of models that will put O1 to shame.

^{_{ADVERTISEMENT}}

Source: Materials provided by Anita Kirkovska on Vellum.AI

KONSTRUCT