If you’re building LLMs for reasoning or agentic behavior - understanding how to train them with reinforcement learning is becoming an essential skill. After pre-training, most LLMs go through post-training to align with human preferences - this is where RLHF (Reinforcement Learning with Human Feedback) comes in. It helps models become: → more helpful → less toxic → better at following instructions → more aligned to business goals But the field is moving beyond simple human feedback toward Reinforcement Learning with Verifiable Rewards: → structured, reliable reward signals → improved reasoning and multi-step behavior → more factual and controllable outputs Here’s how it works - and why methods like PPO, GRPO, and DPO matter. ✅ PPO (Proximal Policy Optimization) → The classic RLHF loop used widely today. → You collect preference labels → train a Reward Model → fine-tune the LLM with PPO. → PPO allows stable updates by constraining large policy shifts. → KL regularization ensures the model stays close to the base. Cycle: Policy → Output → Reward Model → Update → Repeat. ✅ GRPO (Group-based Reinforcement Policy Optimization) → A newer approach focused on group-level optimization. → You optimize over groups of outputs, not just individual samples. → Rewards and KL regularization are computed batch-wise → enabling more stable and scalable RLHF. → Useful when optimizing for complex reasoning and verifiable tasks. Example: teaching an LLM to follow logical proofs or multi-step reasoning chains accurately. ✅ DPO (Direct Preference Optimization) → The simplest and fastest method. → No separate reward model needed. → You directly optimize the policy to prefer outputs ranked better by humans. → DPO compares likelihood of preferred vs. rejected outputs and adjusts the model. Ideal when: → You have good preference data. → You want a lightweight, scalable fine-tuning method. → You don’t want full RL infra. 𝗦𝗼 𝗶𝗻 𝗮 𝗻𝘂𝘁𝘀𝗵𝗲𝗹𝗹: → PPO - classic RLHF with Reward Model + PPO optimizer. → GRPO - group-level optimization with verifiable rewards. → DPO - direct preference-based optimization, simple and fast. 𝗪𝗵𝘆 𝗱𝗼𝗲𝘀 𝘁𝗵𝗶𝘀 𝗺𝗮𝘁𝘁𝗲𝗿❓ LLMs are moving from simple chatbots toward: → deeper reasoning → multi-step agents → long-context understanding → real-world tool use To get there, we need alignment with more verifiable reward signals - not just polite answers, but grounded, reliable, and accurate behavior. Methods like PPO, GRPO, and DPO are key tools in the evolving LLM training stack. ------ Share this with your network to spread the knowledge ♻️ Follow me (Aishwarya Srinivasan) for more AI educational content and insights to keep you up-to date about the AI/ML field.
Reward-Based Learning Systems
Explore top LinkedIn content from expert professionals.
Summary
Reward-based learning systems are a type of machine learning where models are trained to make decisions through a process of trial and error, receiving feedback in the form of rewards or penalties to guide their behavior. These systems help AI agents learn strategies that align with human preferences and desired outcomes, making them valuable for applications that require complex, sequential decision-making.
- Design thoughtful rewards: Carefully define reward functions to ensure AI systems pursue goals that match human intent and avoid unintended actions.
- Monitor agent behavior: Regularly review how models respond to rewards, as intelligent agents can find shortcuts or loopholes in reward structures.
- Use preference data: Incorporate human feedback and ranking data into training to align AI outputs with real-world expectations and improve accuracy.
-
-
What is Reinforcement Learning (RL)? Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties based on its actions and uses this feedback to learn optimal strategies, or policies, for achieving its goals. How is it Different from Supervised and Unsupervised Learning? - Supervised Learning: This involves learning from a labeled dataset, where the correct outputs (targets) for each input are provided. The model learns by comparing its predictions with actual outcomes and adjusting accordingly. RL, by contrast, does not require labeled input/output pairs and learns solely from rewards derived from its actions. - Unsupervised Learning: Here, the goal is to identify patterns or structures in data without any explicit outcomes provided. RL differs as it focuses on learning to take actions that maximize a reward, rather than uncovering hidden structures. Common RL Algorithms - Q-Learning: This is a value-based algorithm where the agent learns the value of being in a given state and taking a specific action. It updates its policy by learning from the maximum expected future rewards. - Deep Q-Networks (DQN): Combining Q-learning with deep neural networks, DQN utilizes a neural network to approximate the Q-value function. It is particularly effective in handling high-dimensional, complex environments. - Policy Gradient Methods: These involve learning a parameterized policy that can select actions without consulting a value function. An example is the REINFORCE algorithm, which updates policies directly through gradient ascent on expected rewards. - Actor-Critic Methods: These combine features of both value-based and policy-based methods. The 'actor' updates the policy distribution in the direction suggested by the 'critic,' which evaluates the action taken by the actor. - Proximal Policy Optimization (PPO): This algorithm balances the benefits of policy gradient methods with the stability and reliability of value function-based methods. It limits the size of policy updates, making training more stable and reliable. Use Cases - Gaming: RL can train agents that adapt and respond to opponent moves, as demonstrated by systems like DeepMind's AlphaGo. - Robotics: RL can teach robots to perform tasks like walking, stacking, or flying by rewarding sequences of motor actions that lead to successful task completion. - Autonomous Vehicles: RL is used to develop decision-making systems in self-driving cars, helping them to make complex navigation decisions in real-time. - Finance: RL can be applied to trade stocks and manage investment portfolios by learning trading strategies that maximize financial returns. Overall, reinforcement learning's ability to learn complex behaviors from high-level goals makes it suitable for applications requiring a sequence of decisions to achieve a goal, where explicit programming is not feasible.
-
Reward models have transformed LLM research by incorporating human preferences into training. Here’s how they work from the ground up… What is a reward model? Reward models (RMs) are specialized LLMs—usually derived from an LLM that we are currently training—that are trained to predict a human preference score given a prompt and a candidate completion as input. A higher score from the RM indicates that a given completion is likely to be preferred by humans. Bradley-Terry model: The standard implementation of an RM is derived from the Bradley-Terry model of preference—statistical model used to rank paired comparison data based on the relative strength or performance of items in the pair. Given two events i and j drawn from the same distribution, the Bradley-Terry model defines the probability that item i wins—or is preferred—compared to item j. For LLMs, items i and j are two completions generated by the same LLM and from the same prompt (i.e., same distribution). The RM assigns a score to each of these completions, and we use Bradley-Terry to express probabilities for pairwise comparisons between two completions. Preference data is used extensively in LLM post-training. Such data consists of many different prompts. For each prompt, we have a pair of candidate completions, where one completion has been identified—by a human or a model—as preferable to the other. RM architecture: In practice, RMs are implemented with an LLM by adding a linear head to the end of the decoder-only architecture. Specifically, the LLM outputs a list of token vectors—one for each input token vector—and we pass the final vector from this list through the linear head to produce a single, scalar score. RMs are just specialized LLMs with an extra classification head used to classify a completion as preferred or not preferred. Training process: The parameters of the RM are usually initialized with an existing policy; e.g., the SFT or pretrained base model. which we will refer to as the RM’s “base” model. Once the RM is initialized, we add the linear head and train it over a preference dataset. Given a preference pair, we want our RM to assign a higher score to the chosen response relative to the rejected response. We can use the Bradley-Terry model to express this probability. By rearranging this probability expression, we obtain a pairwise ranking loss that encourages the model to assign higher scores to chosen responses.
-
What if your smartest AI model could explain the right move, but still made the wrong one? A recent paper from Google DeepMind makes a compelling case: if we want LLMs to act as intelligent agents (not just explainers), we need to fundamentally rethink how we train them for decision-making. ➡ The challenge: LLMs underperform in interactive settings like games or real-world tasks that require exploration. The paper identifies three key failure modes: 🔹Greediness: Models exploit early rewards and stop exploring. 🔹Frequency bias: They copy the most common actions, even if they are bad. 🔹The knowing-doing gap: 87% of their rationales are correct, but only 21% of actions are optimal. ➡The proposed solution: Reinforcement Learning Fine-Tuning (RLFT) using the model’s own Chain-of-Thought (CoT) rationales as a basis for reward signals. Instead of fine-tuning on static expert trajectories, the model learns from interacting with environments like bandits and Tic-tac-toe. Key takeaways: 🔹RLFT improves action diversity and reduces regret in bandit environments. 🔹It significantly counters frequency bias and promotes more balanced exploration. 🔹In Tic-tac-toe, RLFT boosts win rates from 15% to 75% against a random agent and holds its own against an MCTS baseline. Link to the paper: https://lnkd.in/daK77kZ8 If you are working on LLM agents or autonomous decision-making systems, this is essential reading. #artificialintelligence #machinelearning #llms #reinforcementlearning #technology
-
🏆 The most dangerous part of any machine learning system is its reward function. It defines how the model perceives success and how it learns to pursue it. In theory, a well-defined reward keeps the AI aligned with human intent. In practice, the system learns to exploit it. When you optimize for a metric, the AI doesn’t just find the best path. It finds the fastest loophole. The smarter the system the more likely it will lie to you to achieve its reward. RL is like heroin from machines. Reinforcement systems learn through feedback, so when the signals of success and failure are too rigid or too abstract, they evolve around them. The agent begins to treat boundaries as challenges rather than constraints. This is especially risky with autonomous systems that can modify their own learning patterns. Once a feedback loop adapts faster than the human defining it, the control surface narrows. The AI starts to optimize its environment to maintain reward flow, sometimes at odds with its intended purpose. That’s the real danger of AI. It isn’t that it disobeys; it’s that it obeys too perfectly within a flawed reward design.
-
A quiet convergence is happening in RL for LLMs: self-distillation and reward-based RL are merging into a single framework (image shown is from RLSD paper, one variant to incorporate self-distillation signals). The emerging answer: let reward serve as the broad correctness anchor, and let self-distillation provide dense token-level correction where the teacher signal is actually trustworthy — gated by quality, not applied uniformly. - SDPO (Hübotter et al.) started the thread by using a model's own feedback-conditioned predictions as a dense self-teacher — no external teacher needed, just richer context at training time. - G-OPD (Yang et al.) reinterpreted on-policy distillation as KL-constrained RL with an implicit reward term and a tunable scaling factor. Their key finding: reward extrapolation (scaling > 1) lets students consistently surpass teachers. - OpenClaw-RL (Wang et al.) demonstrated the split in a live agentic setting — evaluative signals from interactions become scalar rewards via a process reward model, while directive signals from hindsight hints become token-level advantages through on-policy distillation. - REOPOLD (Ko et al.) made the reward interpretation explicit: the teacher-student likelihood ratio is a token-level reward. Adding confidence-sensitive clipping and entropy-driven sampling, a 7B student matched a 32B teacher at 3.3x faster inference. - Nemotron-Cascade 2 (Yang et al., NVIDIA) scaled multi-domain on-policy distillation to competition-level performance — a 30B MoE with only 3B active parameters hit gold-medal level on IMO and IOI using domain-specific intermediate teachers throughout training. - RLSD (Yang et al.) stated the principle most cleanly: decouple direction from magnitude. External reward or verifier signal decides the update sign; self-distillation redistributes token-level credit. The result is a higher convergence ceiling and more stable training than either method alone. - SRPO (Li et al.) operationalized the hybrid by routing samples — successes go to GRPO's reward-aligned reinforcement, failures go to SDPO's targeted logit-level correction. Adding entropy-aware dynamic weighting gives fast early gains from distillation with long-term stability from reward optimization. - Aligning from User Interactions (Kleine Buening et al.) extended the idea beyond synthetic feedback — when users provide follow-ups that signal dissatisfaction, the model's own revised behavior under that context becomes the dense self-teacher, making every conversation a training opportunity.
-
🚨 Just finished reading “Reinforcement Learning from Human Feedback” by Nathan Lambert — and if you care about how modern LLMs go from “autocomplete engines” to useful, aligned systems, this book is a must-read. 📘🤖 It’s the first resource I’ve seen that actually covers the full RLHF pipeline end-to-end — from reward modeling, KL control, and policy gradient tricks 🧮, to the gritty details like preference data interfaces, overoptimization, and rejection sampling. What stood out to me: 💹 Clear breakdown between instruction tuning, preference finetuning, and RL finetuning — with real insight into how these interact (and conflict). 💹 The most technical-yet-practical explanation of why PPO/DPO isn't just plug-and-play. We’re optimizing against learned proxies, not oracle rewards. 💹 A rare honest look at what makes reward models brittle and where generalization goes wrong (✋yes, including length bias and spurious alignment effects). 💥 This book doesn’t hype RLHF — it demystifies it. 🏅 Props to Nathan for not just writing it, but for doing so after having shipped Zephyr, Tülu, and OLMo. This is real practitioner-first knowledge — not just a retrofitted blog post. #RLHF #AIAlignment #LLMs #RewardModeling #ReinforcementLearning #MachineLearning #HumanFeedback I'm creating a lot of scientific content which is available on many of media platforms 👇 👇👇 Substack: https://lnkd.in/dTjrF6AP (English) Spotify: https://lnkd.in/dgumrSMR (English) https://lnkd.in/d-gMtCrE (Hebrew) Youtube: https://lnkd.in/dPGJr7WM (English) https://lnkd.in/dydSqeky (Hebrew) Telegram: https://lnkd.in/d_YxVMAR (English) https://lnkd.in/dVVqhNw5 (Hebrew)
-
During my time at J.P. Morgan Research, we built a multi‑agent simulation of a dealer market and used it to train reinforcement learning (RL)‑based market makers. The results were fascinating: agents learned to manage inventory, adapt to competitors’ pricing, and even skew quotes when the market drifted. Those experiments convinced me RL has real potential in markets, but also that reward design and risk modeling matter more than clever architectures. If you optimize only for short‑term P&L in a clean simulator, you get agents that look great in backtests and behave dangerously in production. That’s why I now think about RL and agentic systems in finance through a risk‑sensitive lens: explicit constraints, penalties for tail events, and clear escalation to humans when the environment shifts. This thinking is baked into how we design workflows at Artian AI: agents can optimize, explore, and adapt, but always within a governed envelope where risk, compliance, and desks know who is accountable and how to intervene. For those experimenting with RL in trading or liquidity, how are you handling reward design and guardrails?
-
𝘔𝘰𝘴𝘵 𝘙𝘓 𝘮𝘦𝘵𝘩𝘰𝘥𝘴 𝘧𝘰𝘳 𝘓𝘓𝘔𝘴 𝘢𝘳𝘦 𝘴𝘪𝘯𝘨𝘭𝘦-𝘮𝘪𝘯𝘥𝘦𝘥: 𝘮𝘢𝘹𝘪𝘮𝘪𝘻𝘦 𝘵𝘩𝘦 𝘳𝘦𝘸𝘢𝘳𝘥 𝘢𝘵 𝘢𝘭𝘭 𝘤𝘰𝘴𝘵𝘴. 𝘉𝘶𝘵 𝘸𝘩𝘢𝘵 𝘪𝘧 𝘵𝘩𝘪𝘴 𝘰𝘣𝘴𝘦𝘴𝘴𝘪𝘰𝘯 𝘸𝘪𝘵𝘩 𝘵𝘩𝘦 '𝘴𝘪𝘯𝘨𝘭𝘦 𝘣𝘦𝘴𝘵 𝘱𝘢𝘵𝘩' 𝘢𝘤𝘵𝘶𝘢𝘭𝘭𝘺 𝘴𝘵𝘪𝘧𝘭𝘦𝘴 𝘵𝘳𝘶𝘦 𝘳𝘦𝘢𝘴𝘰𝘯𝘪𝘯𝘨 𝘢𝘯𝘥 𝘤𝘳𝘦𝘢𝘵𝘪𝘷𝘪𝘵𝘺? 𝘕𝘦𝘸 𝘳𝘦𝘴𝘦𝘢𝘳𝘤𝘩 𝘴𝘶𝘨𝘨𝘦𝘴𝘵𝘴 𝘵𝘩𝘢𝘵 𝘦𝘮𝘣𝘳𝘢𝘤𝘪𝘯𝘨 𝘢 𝘥𝘪𝘷𝘦𝘳𝘴𝘪𝘵𝘺 𝘰𝘧 𝘴𝘰𝘭𝘶𝘵𝘪𝘰𝘯𝘴 𝘪𝘴 𝘵𝘩𝘦 𝘬𝘦𝘺 𝘵𝘰 𝘴𝘮𝘢𝘳𝘵𝘦𝘳 𝘈𝘐. This is importan for complex reasoning in math and coding, where multiple valid pathways often exist. A model stuck on one 'dominant' solution is brittle and fails to generalize, a major roadblock for building truly robust AI systems. 🤖 A new paper from researchers at Shanghai Jiao Tong University, Microsoft, and Tsinghua University, "𝐅𝐥𝐨𝐰𝐑𝐋: 𝐌𝐚𝐭𝐜𝐡𝐢𝐧𝐠 𝐑𝐞𝐰𝐚𝐫𝐝 𝐃𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧𝐬 𝐟𝐨𝐫 𝐋𝐋𝐌 𝐑𝐞𝐚𝐬𝐨𝐧𝐢𝐧𝐠," directly tackles this problem. They identify that popular reward-maximizing methods like PPO and GRPO suffer from "mode collapse", over-optimizing for common solutions while ignoring other valid, less frequent reasoning paths. 𝐓𝐡𝐞 𝐩𝐫𝐨𝐩𝐨𝐬𝐞𝐝 𝐬𝐨𝐥𝐮𝐭𝐢𝐨𝐧 is FlowRL, an algorithm that fundamentally shifts the objective. Instead of maximizing rewards, it aims to match the full reward distribution. Using a flow-balancing objective inspired by GFlowNets, it encourages the model to explore and value a wide range of high-quality solutions in proportion to their rewards. 𝐓𝐡𝐞 𝐫𝐞𝐬𝐮𝐥𝐭𝐬: FlowRL achieved a 10.0% average improvement over GRPO and 5.1% over PPO on challenging math benchmarks. 𝐓𝐡𝐞 𝐛𝐢𝐠 𝐩𝐢𝐜𝐭𝐮𝐫𝐞: This research challenges the 'winner-take-all' approach to RL fine-tuning. By shifting from reward maximization to distribution matching, we can unlock more creative, robust, and generalizable AI reasoners that don't just find an answer, but explore the entire landscape of correct answers. #AI #ReinforcementLearning #LLM #DeepLearning #Research
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning