Feedback-Driven Learning Models

Explore top LinkedIn content from expert professionals.

Summary

Feedback-driven learning models are AI systems that use ongoing feedback—from users, self-evaluation, or automated processes—to repeatedly improve their performance and adapt to new challenges. These models are designed to learn, reflect, and refine their outputs over time, making them more responsive and relevant in real-world situations.

Build in feedback: Create workflows that capture user ratings, reviews, and corrections to guide the AI's ongoing development and keep it up to date.
Encourage self-reflection: Allow the model to critique its own outputs and make iterative improvements, so it learns from mistakes without waiting for human intervention.
Simulate learning cycles: Use AI-driven tools to recreate real-world learning processes online, letting learners experience, reflect, and apply new approaches in practical scenarios.

Summarized by AI based on LinkedIn member posts

Karan Chandra Dey

AI Product Builder | Designing & Shipping Full-Stack GenAI SaaS Products | Human-AI Interaction • RAG Systems • LLM Apps • UI/UX

2,334 followers 1y
Report this post
Excited to announce my new (free!) white paper: “Self-Improving LLM Architectures with Open Source” – the definitive guide to building AI systems that continuously learn and adapt. If you’re curious how Large Language Models can critique, refine, and upgrade themselves in real-time using fully open source tools, this is the resource you’ve been waiting for. I’ve put together a comprehensive deep dive on: Foundation Models (Llama 3, Mistral, Google Gemma, Falcon, MPT, etc.): How to pick the right LLM as your base and unlock reliable instruction-following and reasoning capabilities. Orchestration & Workflow (LangChain, LangGraph, AutoGen): Turn your model into a self-improving machine with step-by-step self-critiques and automated revisions. Knowledge Storage (ChromaDB, Qdrant, Weaviate, Neo4j): Seamlessly integrate vector and graph databases to store semantic memories and advanced knowledge relationships. Self-Critique & Reasoning (Chain-of-Thought, Reflexion, Constitutional AI): Empower LLMs to identify errors, refine outputs, and tackle complex reasoning by exploring multiple solution paths. Evaluation & Feedback (LangSmith Evals, RAGAS, W&B): Monitor and measure performance continuously to guide the next cycle of improvements. ML Algorithms & Fine-Tuning (PPO, DPO, LoRA, QLoRA): Transform feedback into targeted model updates for faster, more efficient improvements—without catastrophic forgetting. Bias Amplification: Discover open source strategies for preventing unwanted biases from creeping in as your model continues to adapt. In this white paper, you’ll learn how to: Architect a complete self-improvement workflow, from data ingestion to iterative fine-tuning. Deploy at scale with optimized serving (vLLM, Triton, TGI) to handle real-world production needs. Maintain alignment with human values and ensure continuous oversight to avoid rogue outputs. Ready to build the next generation of AI? Download the white paper for free and see how these open source frameworks come together to power unstoppable, ever-learning LLMs. Drop a comment below or send me a DM for the link! Let’s shape the future of AI—together. #AI #LLM #OpenSource #SelfImproving #MachineLearning #LangChain #Orchestration #VectorDatabases #GraphDatabases #SelfCritique #BiasMitigation #Innovation #aiagents

38 Comments
Like Comment
Aishwarya Naresh Reganti

Founder & CEO @ LevelUp Labs | Ex-AWS | Consulting, Training & Investing in AI

123,797 followers 1y
Report this post
🤔 What if, instead of using prompts, you could fine-tune LLMs to incorporate self-feedback and improvement mechanisms more effectively? Self-feedback and improvement have been shown to be highly beneficial for LLMs and agents, allowing them to reflect on their behavior and reasoning and correct their mistakes as more computational resources or interactions become available. The authors mention that frequently used test-time methods like prompt tuning and few-shot learning that are used for self-improvement, often fail to enable models to correct their mistakes in complex reasoning tasks. ⛳ The paper introduces RISE: Recursive Introspection, an approach to improve LLMs by teaching them how to introspect and improve their responses iteratively. ⛳ RISE leverages principles from online imitation learning and reinforcement learning to develop a self-improvement mechanism within LLMs. By treating each prompt as part of a multi-turn Markov decision process (MDP), RISE allows models to learn from their previous attempts and refine their answers over multiple turns, ultimately improving their problem-solving capabilities. ⛳It models the fine-tuning process as a multi-turn Markov decision process, where the initial state is the prompt, and subsequent states involve recursive improvements. ⛳It employs a reward-weighted regression (RWR) objective to learn from both high- and low-quality rollouts, enabling models to improve over turns. The approach uses data generated by the learner itself or more capable models to supervise improvements iteratively. RISE significantly improves the performance of LLMs like LLaMa2, LLaMa3, and Mistral on math reasoning tasks, outperforming single-turn strategies with the same computational resources. Link: https://lnkd.in/e2JDQr8M
No more previous content

No more next content
5 Comments
Like Comment
Karen Kim

CEO @ Human Managed, the AI Service Platform for Cyber, Risk, and Digital Ops.

5,892 followers 1y
Report this post
User Feedback Loops: the missing piece in AI success? AI is only as good as the data it learns from -- but what happens after deployment? Many businesses focus on building AI products but miss a critical step: ensuring their outputs continue to improve with real-world use. Without a structured feedback loop, AI risks stagnating, delivering outdated insights, or losing relevance quickly. Instead of treating AI as a one-and-done solution, companies need workflows that continuously refine and adapt based on actual usage. That means capturing how users interact with AI outputs, where it succeeds, and where it fails. At Human Managed, we’ve embedded real-time feedback loops into our products, allowing customers to rate and review AI-generated intelligence. Users can flag insights as: 🔘Irrelevant 🔘Inaccurate 🔘Not Useful 🔘Others Every input is fed back into our system to fine-tune recommendations, improve accuracy, and enhance relevance over time. This is more than a quality check -- it’s a competitive advantage. - for CEOs & Product Leaders: AI-powered services that evolve with user behavior create stickier, high-retention experiences. - for Data Leaders: Dynamic feedback loops ensure AI systems stay aligned with shifting business realities. - for Cybersecurity & Compliance Teams: User validation enhances AI-driven threat detection, reducing false positives and improving response accuracy. An AI model that never learns from its users is already outdated. The best AI isn’t just trained -- it continuously evolves.
No more previous content

No more next content
1 Comment
Like Comment
Rudina Seseri Rudina Seseri is an Influencer

Venture Capital | Technology | Board Director

20,450 followers 8mo
Report this post
For years, fine-tuning LLMs has required large amounts of data and human oversight. Small improvements can disrupt existing systems, requiring humans to go through and flag errors in order to fit the model to pre-existing workflows. This might work for smaller use cases, but it is clearly unsustainable at scale. However, recent research suggests that everything may be about to change. I have been particularly excited about two papers from Anthropic and Massachusetts Institute of Technology, which propose new methods that enable LLMs to reflect on their own outputs and refine performance without waiting for humans. Instead of passively waiting for correction, these models create an internal feedback loop, learning from their own reasoning in a way that could match, or even exceed, traditional supervised training in certain tasks. If these approaches mature, they could fundamentally reshape enterprise AI adoption. From chatbots that continually adjust their tone to better serve customers to research assistants that independently refine complex analyses, the potential applications are vast. In today’s AI Atlas, I explore how these breakthroughs work, where they could make the most immediate impact, and what limitations we still need to overcome.

When AI Models Learn to Train Themselves Rudina Seseri on LinkedIn

10 Comments
Like Comment
Antonina Panchenko

Learning Experience Designer | Learning & Development Consultant | Instructional Designer

13,858 followers 5mo
Report this post
Many people believe live trainings work better simply because people can talk to each other face‑to‑face, but that’s not the real reason. In reality, their effectiveness comes from something else entirely, they naturally follow a powerful learning rhythm. Great offline trainings follow one simple logic: action → reflection → understanding → application. This is Kolb’s Cycle. And it’s incredibly powerful. The problem? It was almost impossible to implement it in online learning. That’s why 90% of online courses look like “interactive lectures”: nice slides, videos, quizzes. But that’s content consumption, not transformation. And now - the unexpected twist. For the first time, online learning has caught up with offline experiences. Because AI removed the main barrier: it finally allows learners to get experience, reflection, and practice in a personalized way. Here’s how Kolb’s Cycle looks in modern learning design: 1️⃣ Concrete Experience — action Essence: the learner must do something, live through a situation, face a task — ideally experiencing difficulty or making a mistake that shows their current model doesn’t work. How online: role-based dialogue, scenario simulation. 2️⃣ Reflective Observation — reflection Essence: pause and think — what happened, what actions were taken, and why the result turned out this way. How online: interactive reflection prompts; AI coach provides feedback based on performance and the learner’s own reflections. 3️⃣ Abstract Conceptualisation — understanding Essence: form a new behavioural model — concepts, principles, algorithms that explain how to act more effectively. How online: short video lecture, model breakdown, interactive frameworks, checklists, interactive infographics. 4️⃣ Active Experimentation — application Essence: try the new model in a safe environment and observe the result. How online: AI-based simulation, situational exercise, case-solving with the new approach; AI coach supports and adjusts. The outcome? Online learning stops being “content” and becomes a behaviour tracker. A course becomes a training simulator, not a film. Kolb’s Cycle finally becomes real in digital learning. Do you use this framework? What results have you seen?
No more previous content

No more next content
71 Comments
Like Comment
Pascal Biese

AI Lead at PwC </> Daily AI highlights for 80k+ experts 📲🤗

85,068 followers 7mo
Report this post
This could be huge: Researchers from MIT can now predict AI degradation. A new paper from Massachusetts Institute of Technology reveals that when teaching AI models new tasks, the method we use matters far more than we thought - not just for learning the new skill, but for preserving everything the model already knows. The researchers compared two common approaches: supervised fine-tuning (SFT), where you show the model exactly what to do, and reinforcement learning (RL), where the model learns through trial and feedback. Even when both methods achieve identical performance on a new task, RL-trained models retain dramatically more of their original capabilities. In one experiment, models fine-tuned with RL maintained 94% similarity to their original representations, while SFT models dropped to just 56%. The key insight is simple: it's not about the training method itself, but about how far the model moves from its starting point. The researchers identified that the KL divergence - essentially how different the model's outputs become - accurately predicts forgetting with 96% accuracy in controlled settings. RL naturally stays closer to the original model because it samples from the model's own distribution and gradually shifts it, rather than pulling it toward an arbitrary external target like SFT does. They call this principle "RL's Razor": among all ways to solve a new task, RL prefers those closest to the original model. They validated it across language models and robotic systems, showing consistent results regardless of scale or domain. This matters because the more mature our AI ecosystem are getting, the more we'll want AI to continuously adapt and learn. Whether it's a customer service bot learning about new products or a medical AI incorporating the latest research, the ability to learn without forgetting is crucial. For organizations deploying AI, this means reconsidering your fine-tuning strategies. The extra computational cost of RL might be worth it if you need models that can evolve without degrading. And for researchers, it opens a new design principle: future training methods should explicitly minimize distributional shift, not just maximize task performance. ↓ 𝐖𝐚𝐧𝐭 𝐭𝐨 𝐤𝐞𝐞𝐩 𝐮𝐩? Join my newsletter with 50k+ readers and be the first to learn about the latest AI research: llmwatch.com 💡

30 Comments
Like Comment
Nathan Lambert

Building Open Language Models @ Allen Institute for AI

33,975 followers 1y
Report this post
First draft online version of The RLHF Book is DONE. Recently I've been creating the advanced discussion chapters on everything from Constitutional AI to evaluation and character training, but I also sneak in consistent improvements to the RL specific chapter. https://rlhfbook.com/ RLHF has a long future ahead of it and this will do a lot to make it more accessible to the next generation. What's next: Getting a physical copy in your hands (may not be exactly 1to1, we'll see) and minor fixes at a slower cadence (thanks to many github contributors, some of you will get a copy from me). Here are all the chapters. 1. Introduction: Overview of RLHF and what this book provides. 2. Seminal (Recent) Works: Key models and papers in the history of RLHF techniques. 3. Definitions: Mathematical definitions for RL, language modeling, and other ML techniques leveraged in this book. 4. RLHF Training Overview: How the training objective for RLHF is designed and basics of understanding it. 5. What are preferences?: Why human preference data is needed to fuel and understand RLHF. 6. Preference Data: How preference data is collected for RLHF. 7. Reward Modeling: Training reward models from preference data that act as an optimization target for RL training (or for use in data filtering). 8. Regularization: Tools to constrain these optimization tools to effective regions of the parameter space. 9. Instruction Tuning: Adapting language models to the question-answer format. 10. Rejection Sampling: A basic technique for using a reward model with instruction tuning to align models. 11. Policy Gradients: The core RL techniques used to optimize reward models (and other signals) throughout RLHF. 12. Direct Alignment Algorithms: Algorithms that optimize the RLHF objective directly from pairwise preference data rather than learning a reward model first. 13. Constitutional AI and AI Feedback: How AI feedback data and specific models designed to simulate human preference ratings work. 14. Reasoning and Reinforcement Finetuning: The role of new RL training methods for inference-time scaling with respect to post-training and RLHF. 15. Synthetic Data: The shift away from human to synthetic data and how distilling from other models is used. 16. Evaluation: The ever-evolving role of evaluation (and prompting) in language models. 17. Over-optimization: Qualitative observations of why RLHF goes wrong and why over-optimization is inevitable with a soft optimization target in reward models. 18. Style and Information: How RLHF is often underestimated in its role in improving the user experience of models due to the crucial role that style plays in information sharing. 19. Product, UX, Character: How RLHF is shifting in its applicability as major AI laboratories use it to subtly match their models to their products.
No more previous content

No more next content
29 Comments
Like Comment
Sohrab Rahimi

Director, AI/ML Lead @ Google

23,610 followers 1y
Report this post
One of the biggest barriers to deploying LLM-based agents in real workflows is their poor performance on long-horizon reasoning. Agents often generate coherent short responses but struggle when a task requires planning, tool use, or multi-step decision-making. The issue is not just accuracy at the end, but the inability to reason through the middle. Without knowing which intermediate steps helped or hurt, agents cannot learn to improve. This makes long-horizon reasoning one of the hardest and most unsolved problems for LLM generalization. It is relatively easy for a model to retrieve a document, answer a factual question, or summarize a short email. It is much harder to solve a billing dispute that requires searching, interpreting policy rules, applying edge cases, and adjusting the recommendation based on prior steps. Today’s agents can generate answers, but they often fail to reflect, backtrack, or reconsider earlier assumptions. A new paper from Google DeepMind and Stanford addresses this gap with a method called SWiRL: Step-Wise Reinforcement Learning. Rather than training a model to get the final answer right, SWiRL trains the model to improve each step in a reasoning chain. It does this by generating synthetic multi-step problem-solving traces, scoring every individual step using a reward model (Gemini 1.5 Pro), and fine-tuning the base model to favor higher-quality intermediate steps. This approach fundamentally changes the way we train reasoning agents. Instead of optimizing for final outcomes, the model is updated based on how good each reasoning step was in context. For example, if the model generates a search query or a math step that is useful, even if the final answer is wrong, that step is rewarded and reinforced. Over time, the agent learns not just to answer, but to reason more reliably. This is a major departure from standard RLHF, which only gives feedback at the end. SWiRL improves performance by 9.2 percent on HotPotQA, 16.9 percent on GSM8K when trained on HotPotQA, and 11 to 15 percent on other multi-hop and math datasets like MuSiQue, BeerQA, and CofCA. It generalizes across domains, works without golden labels, and outperforms both supervised fine-tuning and single-step RL methods. The implications are substantial: we can now train models to reason better by scoring and optimizing their intermediate steps. Better reward models, iterative reflection, tool-assisted reasoning, and trajectory-level training will lead to more robust performance in multi-step tasks. This is not about mere performance improvement. It shows how we can begin to train agents not to mimic outputs, but to improve the quality of their thought process. That’s essential if we want to build agents that work through problems, adapt to new tasks, and operate autonomously in open-ended environments.
No more previous content

No more next content
18 Comments
Like Comment
Maxime Labonne

Head of Post-Training @ Liquid AI

68,276 followers 3mo
Report this post
🔁 Self-Distillation for RL: LLMs as their own teacher Current RLVR methods like GRPO learn from binary rewards (pass/fail), which creates a bottleneck: the model gets no signal about why it failed. This paper proposes SDPO (Self-Distillation Policy Optimization), which leverages rich textual feedback (runtime errors, failed tests) and the model's own in-context learning ability to provide dense credit assignment without any external teacher. → The core trick is elegant: show the model its own mistake + the error message, then ask it to re-score its original answer. Where the "informed" model disagrees with the original predictions, that's where the bug likely is. No extra models needed. → Even without rich feedback, SDPO works by using successful rollouts from the same batch as "feedback" for failed attempts. This lets the self-teacher identify what the student should have done differently. This allows SDPO to reach GRPO's accuracy 10x faster with 7x shorter responses (no "wait...") → On LiveCodeBench v6, SDPO reaches 48.8% vs GRPO's 41.2%, surpassing Claude Sonnet 4 (40.5%). The gains scale with model size: larger models are better in-context learners, so the self-teacher's retrospection becomes more accurate. On weak models (Qwen2.5-1.5B), SDPO can actually underperform GRPO. 🥲 Unlike GRPO which needs a success before learning starts, SDPO learns from failures immediately. This is a super interesting pivot from "traditional" RLVR with better credit assignment. I'm curious to see results with math, which is weirdly skipped in this paper.
No more previous content

No more next content
13 Comments
Like Comment
Reuven Cohen

♾️ Agentic Engineer / CAiO @ Cognitum One

60,854 followers 6mo
Report this post
♾️ People often ask how I’m building machine learning systems without neural networks. The answer is in recursive feedback loops. Instead of stacking layers of weights, I use a Q-table, a structured grid that learns through experience. Each row represents a state, each column represents a possible action, and each cell holds a value showing how effective that action has been in that situation. The system continuously updates these values after every interaction. Good results increase the value, poor results reduce it. Over time, it builds a dynamic memory of cause and effect. In AgentDB, this process runs through a high-speed OODA feedback loop: Observe, Orient, Decide, Act. Each cycle refines the system’s understanding and accelerates convergence toward better decisions. By hyper-optimizing these loops, I can make decisions in milliseconds that would take traditional neural networks or large language models hundreds or even thousands of times longer. This difference isn’t just speed, it changes what’s possible. Real-time decisions, adaptive behavior, and instantaneous reinforcement become the default, not the exception. Paired with embeddings, the system recognizes patterns across similar states, enabling it to generalize intelligently. You can try it directly using npx agentdb, which creates a local reinforcement learning environment that evolves in real time. Intelligence here doesn’t come from scale but from precision, timing, and feedback.

21 Comments
Like Comment

Feedback-Driven Learning Models

Summary

More in Facilitating Skill Acquisition

Explore categories