If you are wondering how RLHF works, and how we can teach large language models to be helpful, harmless, and honest, read along 👇 The key isn’t just in scaling up model size, it’s in aligning models with human intent. The InstructGPT paper (2022) introduced a three-step process called Reinforcement Learning from Human Feedback (RLHF). And even today, it remains the foundation of how we build instruction-following models like ChatGPT. Let me walk you through the workflow in plain terms, based on the now-famous diagram below 👇 𝟭. 𝗦𝘂𝗽𝗲𝗿𝘃𝗶𝘀𝗲𝗱 𝗙𝗶𝗻𝗲-𝗧𝘂𝗻𝗶𝗻𝗴 (𝗦𝗙𝗧) → Start by showing the model examples of great answers to real prompts, written by humans. → These examples help the model learn how to respond: clear, direct, and grounded. → Think of this as training a junior writer by giving them a stack of perfect first drafts. → Even with a small dataset (13k samples), this creates a solid instruction-following base. 𝟮. 𝗥𝗲𝘄𝗮𝗿𝗱 𝗠𝗼𝗱𝗲𝗹 (𝗥𝗠) → Next, we collect several outputs for the same prompt and ask humans to rank them from best to worst. → We then train a separate model- the reward model, to predict those rankings. → Now, we’ve turned human preferences into a numerical score the model can optimize for. → This is the real magic: turning subjective feedback into something that can guide learning. 𝟯. 𝗥𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 (𝗣𝗣𝗢) → Now the model generates new answers, gets scored by the reward model, and adjusts its behavior to maximize reward. → We use Proximal Policy Optimization (PPO), an RL algorithm that gently nudges the model in the right direction without making it forget what it already knows. → A “KL penalty” keeps it from straying too far, like a seatbelt keeping it grounded. 𝗪𝗵𝘆 𝗶𝘁 𝘄𝗼𝗿𝗸𝘀❓ ✅ A small 1.3B model trained with this pipeline outperformed GPT-3 (175B) in human evaluations. ✅ It generalized to unseen domains with little extra supervision. ✅ And it required orders of magnitude less data than pre-training. 𝗪𝗵𝗮𝘁 𝘁𝗵𝗶𝘀 𝗺𝗲𝗮𝗻𝘀 𝗳𝗼𝗿 𝗯𝘂𝗶𝗹𝗱𝗲𝗿𝘀❓ → Bigger isn’t always better. Better feedback leads to better behavior. → Pairwise comparisons are often more scalable than manual ratings. → RLHF lets us teach models values, not just vocabulary. If you're building AI systems, aligning them with human preferences isn’t just a safety concern- it’s a product strategy. --------- Share this with your network ♻️ Follow me (Aishwarya Srinivasan) for more AI insights.
RLHF Approaches for AI Alignment
Explore top LinkedIn content from expert professionals.
Summary
RLHF (reinforcement learning from human feedback) approaches for AI alignment are methods that train artificial intelligence systems to better match human values and intent by using human-provided feedback as a guide for learning. These techniques are critical for making AI models safer, more helpful, and more honest, and they continue to evolve with simplified and innovative strategies.
- Explore reward shaping: Consider how adjusting the reward function can help guide AI models to align more closely with desired outcomes and avoid unintended behaviors.
- Use preference ranking: Gather human rankings of AI outputs to teach models not just correct answers, but responses that reflect real-world human judgment.
- Encourage self-correction: Train AI models to recognize and fix their own reasoning mistakes, promoting robustness and reducing unsafe or misleading outputs.
-
-
Every frontier model today can think - but few can doubt. They can solve Olympiad problems, write code, and simulate moral reasoning, yet they remain disturbingly gullible to their own thoughts. Give them a flawed premise, and they follow it with perfect logic straight into failure. The problem isn’t lack of intelligence - it’s lack of skepticism. That’s what makes Meta’s new paper, “Large Reasoning Models Learn Better Alignment from Flawed Thinking,” so fascinating. It flips the standard recipe for alignment on its head. Instead of training models to avoid unsafe reasoning, they deliberately feed them unsafe, misleading, or overly cautious reasoning traces - and teach them to recover. The method, called RECAP (Robust Safety Alignment via Counter-Aligned Prefilling), trains models to override their own poisoned thought processes. Think of it as inoculation through exposure. During RLHF, a portion of the training data is “prefilled” with unsafe or misaligned chain-of-thoughts. For harmful prompts, the CoT begins unsafe; for harmless ones, it begins with an exaggerated refusal. To earn reward, the model must disagree with itself, rerouting the reasoning back to safety and helpfulness. Over time, it learns an internal habit of self-correction. And it works - remarkably well. RECAP improves direct harmful-prompt safety by +12%, boosts jailbreak robustness by +21%, and reduces overrefusal without sacrificing math or coding ability. It even improves math reasoning slightly, showing that reflection itself may enhance structured thinking. Most strikingly, RECAP-trained models engage in explicit self-reflection nearly twice as often as standard RLHF ones, revising their logic mid-trajectory instead of blindly following it. All this comes with no additional inference cost and no architectural changes. It’s just RLHF - reimagined to reward doubt. The deeper insight is that alignment might not be about keeping models from ever thinking wrongly, but about teaching them what to do when they do. Real robustness, like real intelligence, begins not with certainty - but with the courage to second-guess your own reasoning.
-
Reinforcement learning from human feedback (RLHF) is a key breakthrough in the creation of modern generative LLMs, but where did RLHF come from? TL;DR: Most of modern RLHF research has its roots in a long sequence of papers that study the use of human feedback for training summarization models. These papers start by using simple signals (e.g., ROUGE) as a reward signal for writing better summaries. They then replace these scores with human feedback, forming an early version of RLHF that eventually evolved to support more open-ended chat use cases. Supervised learning. Prior (and in parallel) to this work, the most common approach for training summarization models was via supervised learning. Namely, we take a pretrained model and finetune it–using SFT–over human-written examples of good summaries. The papers we explore below initially try to replace this supervised learning strategy, but we eventually see that combining supervised and reinforcement learning works very well. (Phase 1) ROUGE + RL. Researchers first proposed using ROUGE as a reward signal for training summarization models with reinforcement learning. A large number of different papers tried this strategy, so it is hard to cite one paper as the source of this technique. This approach works relatively well, but ROUGE scores correlate poorly with human judgment, which makes ROUGE a sub-optimal training signal. (Phase 2) Human feedback for summarization. Due to issues outlined above, researchers discarded ROUGE as a training signal. Instead, papers began to learn reward functions from human feedback–these are just models that we train to predict human feedback scores! Examples of such papers include [1] and [2]. Research in [2] uses GPT-style LLMs and the approach looks very similar to training strategies used for LLM alignment today (i.e., reward model based on the LLM itself finetuned on human feedback followed by PPO). [1] uses smaller/simple reward models trained on human feedback. (Phase 3) RLHF for LLMs. Although [1, 2] focus upon summarization, these techniques for learning from human feedback are quite general. As such, the same exact training strategy was quickly re-purposed for aligning general-purpose LLMs to better follow human instructions by InstructGPT [3]. The InstructGPT paper lays the foundation for RLHF as it is used today!
-
It is only rarely that, after reading a research paper, I feel like giving the authors a standing ovation. But I felt that way after finishing Direct Preference Optimization (DPO) by Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher Manning, and Chelsea Finn. This beautiful paper proposes a much simpler alternative to RLHF (reinforcement learning from human feedback) for aligning language models to human preferences. RLHF has been a key technique for training LLMs. In brief, RLHF (i) Gets humans to specify their preferences by ranking LLM outputs, (ii) Trains a reward model (used to score LLM outputs) -- typically represented using a transformer network -- to be consistent with the human preferences, (iii) Uses reinforcement learning to tune an LLM, also represented as a transformer, to maximize rewards. This requires two transformer networks, and RLHF is also finicky to the choice of hyperparameters. DPO simplifies the whole thing. Via clever mathematical insight, the authors show that given an LLM, there is a specific reward function for which that LLM is optimal. DPO then trains the LLM directly to make the reward function (that’s now implicitly defined by the LLM) consistent with the human rankings. So you no longer need to deal with a separately represented reward function -- you just need the LLM transformer -- and you can train the LLM directly and more efficiently to optimize the same objective as RLHF. Although it’s still too early to be sure, I am cautiously optimistic that DPO will have a huge impact on LLMs and beyond in the next few years. I write more about this in The Batch (linked to below). https://lnkd.in/gteaE2z8 You can also read the paper here: https://lnkd.in/gJ-hx7wm
-
🚀 Shao Tang and I are excited to share our new research on aligning LLMs, from LinkedIn and KAIST AI! We're thrilled to announce the release of our latest paper: "AlphaPO — Reward Shape Matters for LLM Alignment." In the rapidly advancing field of Reinforcement Learning with Human Feedback (RLHF) and Direct Alignment Algorithms (DAAs), we've uncovered a crucial insight: the shape of the reward function plays a pivotal role in alignment performance. 🔍 What’s the problem? DAAs like Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO) skip the reward modeling stage but often suffer from likelihood displacement — where probabilities of preferred responses are undesirably reduced. 💡 What’s our solution? We introduce AlphaPO, a novel DAA method that uses an α-parameter to reshape the reward function dynamically. This allows for: Fine-grained control over likelihood displacement and over-optimization. Improved alignment performance by 7–10% (relative) for instruct versions of Mistral-7B and Llama3-8B compared to SimPO. 📊 Why does this matter? Our findings emphasize that reward shape matters in DAAs, and systematically adjusting it can significantly influence training dynamics and alignment outcomes. If you're working on LLM alignment, RLHF, or exploring DAAs, we’d love to hear your thoughts on AlphaPO! Read the paper here: https://lnkd.in/gwMwiif3 Kudos to our co-authors Qingquan Song, Sirou Z., Jiwoo Hong, Ankan Saha, Viral Gupta, Noah Lee, Eunki Kim, Jason (Siyu) Zhu, Natesh Pillai, Sathiya Keerthi Selvaraj. Special thanks to Keerthi for pushing us to do better on this paper, and to Jiwoo, Noah and Eunki who invented one of our favorite methods - ORPO!
-
RLHF Workflow From Reward Modeling to Online RLHF We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM, SFR-Iterative-DPO-LLaMA-3-8B-R, achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available.
-
Feedback loops are AI’s compound interest engine.. if you skip them and your AI performance will just erode over time. Too many roadmaps punt on serious evals because “models don’t hallucinate as much anymore” or “we’ll tighten it up later.” Be wary of those that say this, they really aren't serious practitioners. Here is the gold standard we run for production AI implementation at Bottega8: 1. Offline evals (CI gatekeeper): A lightweight suite of prompt unit tests, RAGAS faithfulness checks, latency, and cost thresholds runs on every PR. If anything regresses, the build fails. 2. RLHF, internal sandbox: A staging environment where we hammer the model with synthetic edge cases and adversarial red team probes. 3. RLHF, dogfood: Real users and real tasks. We expose a feedback widget that decomposes each output into groundedness, completeness, and tone so our labelers can triage in minutes. 4. RLHF, virtual assistants: Contract VAs replay the week’s top workflows nightly, score them with an LLM as judge, and surface drift long before customers notice. 5. Shadow traffic and A/B canaries: Ten percent of live queries route to the new model, and we ship only when conversion, CSAT, and error budgets clear the bar. The result is continuous quality and predictable budgets.. no one wants mystery spikes in spend nor surprise policy violations. If your AI pipeline does not fail fast in code review and learn faster in production, it is not an engineering practice, it is a gamble. There's enough eng industry best practice now with nearly three years of mainstream LLM/GenAI adoption. Happy building and let's build AI systems that audit themselves and compound insight daily.
-
🎯 𝗗𝗣𝗢 𝗿𝗲𝗳𝗿𝗮𝗺𝗲𝘀 𝗥𝗟𝗛𝗙 𝗮𝘀 𝗮 𝘀𝗶𝗺𝗽𝗹𝗲 𝗯𝗶𝗻𝗮𝗿𝘆 𝗰𝗹𝗮𝘀𝘀𝗶𝗳𝗶𝗰𝗮𝘁𝗶𝗼𝗻 𝗽𝗿𝗼𝗯𝗹𝗲𝗺. RLHF with PPO is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. 𝗗𝗣𝗢 (𝗗𝗶𝗿𝗲𝗰𝘁 𝗣𝗿𝗲𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻) simplifies this entire process. Instead of building a separate reward model and running unstable RL loops, DPO reframes the task as simple supervised learning: 𝗴𝗶𝘃𝗲𝗻 𝘁𝘄𝗼 𝗼𝘂𝘁𝗽𝘂𝘁𝘀, 𝗽𝗿𝗲𝗱𝗶𝗰𝘁 𝘄𝗵𝗶𝗰𝗵 𝗼𝗻𝗲 𝗮 𝗵𝘂𝗺𝗮𝗻 𝗽𝗿𝗲𝗳𝗲𝗿𝘀. ⛳ Key ideas: ▪️ No explicit reward model needed. ▪️ No reinforcement learning loop. ▪️ Direct fine-tuning using a classification-style loss over preference pairs. ▪️ Preference learning happens by adjusting probability mass toward better outputs relative to a reference model. DPO has become the de facto alignment algorithm for many industrial-scale LLMs, including Meta’s LLaMA 3, Ai2’s Tulu, and Alibaba Group’s Qwen. 👀 Read the original DPO paper, linked in the comments below 👇
-
Day 13/30 of LLMs/SLMs - Alignment — RLHF, Constitutional AI, and DPO. Once a model learns language, it doesn’t automatically learn judgment. It can predict the next word, but it doesn’t know what’s appropriate, helpful, or truthful. That’s where alignment comes in. Let's cover a few methods of alignment. 𝐑𝐋𝐇𝐅: 𝐑𝐞𝐢𝐧𝐟𝐨𝐫𝐜𝐞𝐦𝐞𝐧𝐭 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐟𝐫𝐨𝐦 𝐇𝐮𝐦𝐚𝐧 𝐅𝐞𝐞𝐝𝐛𝐚𝐜𝐤 At its core, RLHF is about turning subjective human judgment into a training signal. Here’s the workflow: - Start with a pretrained model that can generate text. - Collect human feedback — pairs of model responses labeled as better or worse. - Train a reward model to predict those human preferences. - Use reinforcement learning (PPO or similar) to fine-tune the model so it generates outputs that maximize the reward. In short, RLHF teaches models what we like, not just what’s statistically likely. This is how GPT-4, Claude, and other aligned models became conversational, safe, and instruction-following. This is not because they were told what’s right, but because they were trained to prefer what humans rate as right. 𝐂𝐨𝐧𝐬𝐭𝐢𝐭𝐮𝐭𝐢𝐨𝐧𝐚𝐥 𝐀𝐈: 𝐑𝐮𝐥𝐞-𝐁𝐚𝐬𝐞𝐝 𝐀𝐥𝐢𝐠𝐧𝐦𝐞𝐧𝐭 Anthropic introduced Constitutional AI to reduce the need for massive human feedback loops. Instead of human judges, the model follows a written “constitution,” i.e. a set of guiding principles inspired by ethics, fairness, and helpfulness. The model critiques its own outputs using those rules and revises them automatically. This makes alignment more scalable and transparent. E.g. If a model produces an unsafe or biased response, it refers to a principle like “Avoid harmful or discriminatory language” to self-correct. 𝐃𝐏𝐎: 𝐃𝐢𝐫𝐞𝐜𝐭 𝐏𝐫𝐞𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧 DPO (Direct Preference Optimization) is a newer, simpler alternative to RLHF. Instead of training a separate reward model or using reinforcement learning, DPO directly optimizes model parameters based on human preference data. It compares two outputs — a “preferred” and a “dispreferred” response — and directly adjusts the model’s logits (its internal probabilities) so that the preferred response becomes more likely. Mathematically, it’s derived from the same objective that RLHF approximates, but implemented as a simple logistic loss over preference pairs — no gradient rollouts, no policy updates, no PPO tricks. This makes it especially useful for smaller organizations or researchers fine-tuning open-source models like LLaMA or Mistral. Takeaway: RLHF teaches through explicit human feedback. Constitutional AI teaches through pre-determined principles. DPO directly adjusts the model params on preferred vs dispreferred responses Tune in tomorrow for more SLM/LLMs deep dives. -- 🚶➡️ To learn more about LLMs/SLMs, follow me - Karun! ♻️ Share so others can learn, and you can build your LinkedIn presence!
-
First draft online version of The RLHF Book is DONE. Recently I've been creating the advanced discussion chapters on everything from Constitutional AI to evaluation and character training, but I also sneak in consistent improvements to the RL specific chapter. https://rlhfbook.com/ RLHF has a long future ahead of it and this will do a lot to make it more accessible to the next generation. What's next: Getting a physical copy in your hands (may not be exactly 1to1, we'll see) and minor fixes at a slower cadence (thanks to many github contributors, some of you will get a copy from me). Here are all the chapters. 1. Introduction: Overview of RLHF and what this book provides. 2. Seminal (Recent) Works: Key models and papers in the history of RLHF techniques. 3. Definitions: Mathematical definitions for RL, language modeling, and other ML techniques leveraged in this book. 4. RLHF Training Overview: How the training objective for RLHF is designed and basics of understanding it. 5. What are preferences?: Why human preference data is needed to fuel and understand RLHF. 6. Preference Data: How preference data is collected for RLHF. 7. Reward Modeling: Training reward models from preference data that act as an optimization target for RL training (or for use in data filtering). 8. Regularization: Tools to constrain these optimization tools to effective regions of the parameter space. 9. Instruction Tuning: Adapting language models to the question-answer format. 10. Rejection Sampling: A basic technique for using a reward model with instruction tuning to align models. 11. Policy Gradients: The core RL techniques used to optimize reward models (and other signals) throughout RLHF. 12. Direct Alignment Algorithms: Algorithms that optimize the RLHF objective directly from pairwise preference data rather than learning a reward model first. 13. Constitutional AI and AI Feedback: How AI feedback data and specific models designed to simulate human preference ratings work. 14. Reasoning and Reinforcement Finetuning: The role of new RL training methods for inference-time scaling with respect to post-training and RLHF. 15. Synthetic Data: The shift away from human to synthetic data and how distilling from other models is used. 16. Evaluation: The ever-evolving role of evaluation (and prompting) in language models. 17. Over-optimization: Qualitative observations of why RLHF goes wrong and why over-optimization is inevitable with a soft optimization target in reward models. 18. Style and Information: How RLHF is often underestimated in its role in improving the user experience of models due to the crucial role that style plays in information sharing. 19. Product, UX, Character: How RLHF is shifting in its applicability as major AI laboratories use it to subtly match their models to their products.
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development