Monday, May 5, 2025

The Algorithm Behind DeepSeek’s Success

Share

Introduction to Reinforcement Learning

Reinforcement learning is a type of machine learning where an agent learns to take actions to achieve a goal by interacting with an environment. However, traditional methods of reinforcement learning have some limitations. One such method, Proximal Policy Optimisation (PPO), relies on two key components: a policy model and a critic model. The policy model generates outputs, while the critic model estimates the value of those outputs.

Challenges with Traditional Methods

The dual-model approach of PPO creates several challenges, especially when training large language models. These challenges include:

  • Memory Inefficiency: Managing two models consumes substantial GPU memory, which can be a problem when working with thousands of billions of parameters.
  • Training Instability: The critic model often struggles to provide consistent value estimates, which can be amplified when working with large corpora of training data.
  • Complex Implementation: Balancing policy and critic training requires careful tuning of hyperparameters, which can be difficult to do when working with large language models.

What is Group Relative Policy Optimisation (GRPO)?

GRPO is a new approach that addresses the limitations of traditional reinforcement learning methods. It gets rid of the critic model and instead focuses on relative comparisons within groups of samples. GRPO uses the average reward of multiple sampled outputs, produced in response to the same question, as the baseline. This approach makes GRPO more powerful and efficient, especially when working with large language models.

How GRPO Works

GRPO works by sampling a group of outputs from the old policy and then optimizing the policy model by maximizing a newly designed objective. This simple but profound shift makes GRPO more powerful even without the critic model. By estimating the baseline from group scores, GRPO significantly reduces training resources and improves training stability.

Benefits of GRPO

The benefits of GRPO include:

  • Improved Memory Efficiency: GRPO requires less GPU memory than traditional methods, making it more suitable for large language models.
  • Increased Training Stability: GRPO provides more consistent value estimates, which improves training stability and reduces the risk of training failure.
  • Simplified Implementation: GRPO eliminates the need for careful hyperparameter tuning, making it easier to implement and train.

Conclusion

GRPO is a powerful new approach to reinforcement learning that addresses the limitations of traditional methods. By getting rid of the critic model and focusing on relative comparisons within groups of samples, GRPO improves memory efficiency, training stability, and simplifies implementation. Whether you’re working on a small project or a large language model, GRPO is definitely worth considering. With its ability to reduce training resources and improve training stability, GRPO has the potential to revolutionize the field of reinforcement learning.

Latest News

Related News