CMT Strategies for Ongoing LLM Training

Explore top LinkedIn content from expert professionals.

Summary

CMT strategies for ongoing LLM training refer to practical methods for continuously updating and improving large language models (LLMs) so they learn better reasoning, retain knowledge, and adapt to new challenges over time. These approaches move beyond basic prompting and fine-tuning, focusing on teaching models how to learn from their own outputs, develop new strategies, and avoid common pitfalls like factual errors.

Implement cumulative learning: Encourage the model to build a database of past problem-solving strategies and refine them as it encounters new challenges, improving its expertise with experience.
Encourage self-feedback: Teach the model to review and revise its own outputs, enabling it to identify and correct reasoning mistakes across multiple steps or turns.
Use diverse training data: Generate varied examples and study strategies based on existing knowledge so the model can gain a stronger grasp of facts without falling into over-reliance on standard fine-tuning.

Summarized by AI based on LinkedIn member posts

Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer

628,023 followers 8mo
Report this post
If you’re building LLM applications today, reasoning is where the real leverage lies. And yet, I see a lot of engineers still treating LLM outputs as a single-shot black box. LLMs can reason, but only if you give them the right scaffolding and the right post-training. Here’s a mental model I’ve been using to think about LLM reasoning methods (see chart below): ✅ Inference-time reasoning methods: These are techniques that can be applied at inference time, without needing to retrain your model: → Tree of Thoughts (ToT), search through reasoning paths → Chain of Thought (CoT) prompting, prompt models to generate intermediate reasoning steps → Reasoning + Acting, use tools or function calls during reasoning → Self-feedback, prompt the model to critique and refine its own output → Episodic Memory Agents, maintain a memory buffer to improve multi-step reasoning → Self-consistency, sample multiple reasoning paths and select the most consistent answer ✅ Training-time enhancements: Where things get really powerful is when you post-train your model to improve reasoning, using human annotation or policy optimization: → Use Preference pairs and Reward Models to tune for better reasoning (RFT, Proximal PO, KL Regularization) → Apply RLHF, PPO + KL, Rejection Sampling + SFT, Advantage Estimation, and other advanced techniques to guide the model’s policy → Leverage multiple paths, offline trajectories, and expert demonstrations to expose the model to rich reasoning signals during training Here are my 2 cents 🫰 If you want production-grade LLM reasoning, you’ll need both, → Smart inference-time scaffolds to boost reasoning without slowing latency too much → Carefully tuned post-training loops to align the model’s policy with high-quality reasoning patterns → We’re also seeing increasing use of Direct Preference Optimization (DPO) and reference-free grading to further improve reasoning quality and stability. I’m seeing more and more teams combine both strategies, and the gap between "vanilla prompting" and "optimized reasoning loops" is only getting wider. 〰️〰️〰️ Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg
No more previous content

No more next content
27 Comments
Like Comment
Asankhaya Sharma

Creator of OptiLLM and OpenEvolve | Founder of Patched.Codes (YC S24) & Securade.ai | Pioneering inference-time compute to improve LLM reasoning | PhD | Ex-Veracode, Microsoft, SourceClear | Professor & Author | Advisor

7,263 followers 11mo
Report this post
🧠 We just implemented the "third paradigm" for LLM learning - and the results are promising. Most of us know that leading AI applications like ChatGPT, Claude, and Grok achieve their impressive performance partly through sophisticated system prompts containing detailed reasoning strategies and problem-solving frameworks. Yet most developers and researchers work with basic prompts, missing out on these performance gains. 🚀 Introducing System Prompt Learning (SPL) Building on Andrej Karpathy's vision of a "third paradigm" for LLM learning, SPL enables models to automatically learn and improve problem-solving strategies through experience, rather than relying solely on pre-training or fine-tuning. ⚙️ How it works: 🔍 Automatically classifies incoming problems into 16 types 📚 Builds a persistent database of effective solving strategies 🎯 Selects the most relevant strategies for each new query 📊 Evaluates strategy effectiveness and refines them over time 👁️ Maintains human-readable, inspectable knowledge 📈 Results across mathematical benchmarks: OptILLMBench: 61% → 65% (+4%) MATH-500: 85% → 85.6% (+0.6%) Arena Hard: 29% → 37.6% (+8.6%) AIME24: 23.33% → 30% (+6.67%) After just 500 training queries, our system developed 129 strategies, refined 97 existing ones, and achieved 346 successful problem resolutions. ✨ What makes this approach unique: 🔄 Cumulative learning that improves over time 📖 Transparent, human-readable strategies 🔌 Works with any OpenAI-compatible API 🔗 Can be combined with other optimization techniques ⚡ Operates in both inference and learning modes 📝 Example learned strategy for word problems: 1. Understand: Read carefully, identify unknowns 2. Plan: Define variables, write equations 3. Solve: Step-by-step with units 4. Verify: Check reasonableness This represents early progress toward AI systems that genuinely learn from experience in a transparent, interpretable way - moving beyond static models to adaptive systems that develop expertise through practice. 🛠️ Implementation: SPL is available as an open-source plugin in optillm, our inference optimization proxy. Simple integration by adding "spl-" prefix to your model name. The implications extend beyond current capabilities - imagine domain-specific expertise development, collaborative strategy sharing, and human expert contributions to AI reasoning frameworks. 💭 What are your thoughts on LLMs learning from their own experience? Have you experimented with advanced system prompting in your work? #ArtificialIntelligence #MachineLearning #LLM #OpenSource #TechInnovation #ProblemSolving #AI #Research
No more previous content

No more next content
4 Comments
Like Comment
Max Buckley

Head of Knowledge Research at Exa

31,536 followers 7mo
Report this post
Fine-tuning for making expert, domain-specific models? Not so fast! I often get asked whether companies should fine-tune LLMs to internalize the knowledge required for their particular use case or domain. The answer I give is probably not…. There is research suggesting that large language models struggle to acquire new factual knowledge through fine-tuning. Novel knowledge is learned more slowly than knowledge consistent with what the model already knows. This same research also showed that when knowledge is eventually learned from novel examples, there is a linear increase in the model's tendency to hallucinate. Ouch! So what can you do? What should you do? RAG is one approach, but that comes with complexity and its own challenges: RAG pipelines are more complex, with larger storage costs, higher memory and compute requirements (due to longer contexts demanded by the additional context) and higher latency, due to the need to query an external index. In the long term, storing knowledge natively in the model's parameters may also provide generalization advantages, as the model can relate different pieces of knowledge in its parameters. This is particularly apparent for complex or indirect queries, where simple retrieval augmentation may fall short. A very exciting recent paper from Meta introduced a new approach called Active Reading. This approach leverages synthetic data to have LLMs generate a range of diverse training data based on a closed body of knowledge. By having the LLMs read and restructure the data in many and varied ways and training on that enlarged, restructured corpus, you can significantly improve the model's retention of the contained facts. Active Reading applies the same principles observed in human studying, allowing the model itself to propose multiple study strategies — e.g., paraphrasing, knowledge linking, active recall, etc. — and instantiates these different strategies on a document-by-document basis. This process results in a highly diverse and contextually grounded signal which can then be trained on. The authors demonstrate huge gains vs. vanilla fine-tuning: +313% and +160% (relative improvement over vanilla fine-tuning) on SimpleQA and FinanceBench respectively. They also trained a SOTA 8B model for factual QA, demonstrating the utility of the technique at pre-training scale (1T tokens). It should be noted that the Active Reading paper focuses on knowledge acquisition; that traditional fine tuning can still be useful for instilling style, format, reasoning patterns, or other behaviors. Learning Facts at Scale with Active Reading https://lnkd.in/e7FCAq-3 Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? https://lnkd.in/e_REAVZB
No more previous content

No more next content
12 Comments
Like Comment
Maxime Labonne

Head of Post-Training @ Liquid AI

68,274 followers 6mo
Report this post
🎓 On-Policy Distillation New blog post from Thinking Machines Lab discussing on-policy distillation for LLM post-training. This consists of training a student model on its own generated outputs with dense token-level feedback from a teacher. It combines RL's on-policy relevance with distillation's compute efficiency. → RL provides sparse feedback that doesn't pinpoint where mistakes happened. On the other hand, traditional distillation trains on teacher trajectories the student will rarely encounter. This work combines these approaches by having teachers grade each token the student actually generates. → The method achieves similar performance to RL with dramatically fewer training steps because distillation shortcuts the expensive search process. → On-policy distillation naturally handles continual learning better than SFT because the student converges toward a fixed teacher's behavior without the drift that happens when training on finite batches of its own outputs. → The experiments show you can match teacher performance on math reasoning (AIME problems) by training on a single prompt for 20 consecutive steps. This is extreme overfitting by normal standards, but the dense supervision extracts the semantic strategy efficiently. My take: on-policy distillation is not new, but this blog post has a lost of interesting results and an elegant framework to introduce it. They also have end-to-end examples with models, parameters, and datasets for reproducibility. It's a great read if you're interested in post-training.
No more previous content

No more next content
13 Comments
Like Comment
Aishwarya Naresh Reganti

Founder & CEO @ LevelUp Labs | Ex-AWS | Consulting, Training & Investing in AI

123,793 followers 1y
Report this post
🤔 What if, instead of using prompts, you could fine-tune LLMs to incorporate self-feedback and improvement mechanisms more effectively? Self-feedback and improvement have been shown to be highly beneficial for LLMs and agents, allowing them to reflect on their behavior and reasoning and correct their mistakes as more computational resources or interactions become available. The authors mention that frequently used test-time methods like prompt tuning and few-shot learning that are used for self-improvement, often fail to enable models to correct their mistakes in complex reasoning tasks. ⛳ The paper introduces RISE: Recursive Introspection, an approach to improve LLMs by teaching them how to introspect and improve their responses iteratively. ⛳ RISE leverages principles from online imitation learning and reinforcement learning to develop a self-improvement mechanism within LLMs. By treating each prompt as part of a multi-turn Markov decision process (MDP), RISE allows models to learn from their previous attempts and refine their answers over multiple turns, ultimately improving their problem-solving capabilities. ⛳It models the fine-tuning process as a multi-turn Markov decision process, where the initial state is the prompt, and subsequent states involve recursive improvements. ⛳It employs a reward-weighted regression (RWR) objective to learn from both high- and low-quality rollouts, enabling models to improve over turns. The approach uses data generated by the learner itself or more capable models to supervise improvements iteratively. RISE significantly improves the performance of LLMs like LLaMa2, LLaMa3, and Mistral on math reasoning tasks, outperforming single-turn strategies with the same computational resources. Link: https://lnkd.in/e2JDQr8M
No more previous content

No more next content
5 Comments
Like Comment
Ashu Garg

Enterprise VC-engineer-company builder. Early investor in @databricks, @tubi and 6 other unicorns - @cohesity, @eightfold, @turing, @anyscale, @alation, @amperity, | GP@Foundation Capital

42,146 followers 9mo
Report this post
Investment is moving from pre-training to post-training. Early LLM budgets focused on training ever-larger base models. Post-deployment, models were mostly fixed... improvements came via scaling next-gen models. Now labs split resources between pre-training and RL fine-tuning with careful inference optimization. In post-training, models learn reasoning, follow prefs and refine outputs for specific tasks. Models update more frequently, weekly or monthly vs yearly. Each iteration collects new data, extends RL training and adjusts rewards to improve reasoning. New economic models charge for updated reasoning engines vs static models. Serving costs and latency are key so providers favor efficient pro models with most accuracy at lower compute. 3 more takeaways from ICML: 1️⃣ RL is only beginning to cross into “unverifiable” domains. Traditional RL used tasks with clear auto checks (like code compilers or math calcs) for rewards. Domains like Math Olympiad or legal arguments have complex solutions that can't be auto-verified. These need more complex reward models scoring reasoning steps, argument clarity or persuasiveness vs just correctness. With solid reward models, RL could teach coherent proofs, experiment design or legal strategies. This is still early work - current systems often break tasks into verifiable parts or use human prefs, but broader paths are emerging. 2️⃣ Personalization is a safer near-term goal than true continual learning. Interest in adapting models to users is growing with 2 key approaches. Personalization tailors outputs by learning user-specific rewards from few feedbacks or adding curiosity rewards that prompt questions about user tastes. These adjust behavior without changing model weights, improving perceived helpfulness and empathy. Continual learning updates parameters in real time from all user input. But continual learning poses safety and privacy risks like overfitting, bias amp & data leakage. Personalization and context windows (remembering interactions without weight changes) are more practical and responsible. 3️⃣ RL scaling will follow multiple paths. → One path applies current RL methods (PPO, guided reward models, diffusion reg) to more domains. This incremental path needs no new algos, just better rewards, bigger varied datasets & optimization tweaks. → The second tackles sparse-reward probs with long feedback delays. Success depends on credit assignment in long seqs and off-policy learning using varied experiences. Off-policy RL supports multi-datacenter training with clusters handling acting, collection and learning. → The third involves continual learning with frequent updates using user feedback and new data. Each has trade-offs: incremental and safe, risky but domain-expanding, adaptive but complex w/ safety issues. What'd I miss?

5 Comments
Like Comment
Ankit Agarwal

Founder | CEO | Gen AI Board Advisor | Investor | Ex-Amazon

16,893 followers 1y
Report this post
𝗗𝗲𝗲𝗽 𝗗𝗶𝘃𝗲 𝗶𝗻𝘁𝗼 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗟𝗮𝗿𝗴𝗲 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 Very enlightening paper authored by a team of researchers specializing in computer vision and NLP, this survey underscores that pretraining—while fundamental—only sets the stage for LLM capabilities. The paper then highlights 𝗽𝗼𝘀𝘁-𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝗺𝗲𝗰𝗵𝗮𝗻𝗶𝘀𝗺𝘀 (𝗳𝗶𝗻𝗲-𝘁𝘂𝗻𝗶𝗻𝗴, 𝗿𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗹𝗲𝗮𝗿𝗻𝗶𝗻𝗴, 𝗮𝗻𝗱 𝘁𝗲𝘀𝘁-𝘁𝗶𝗺𝗲 𝘀𝗰𝗮𝗹𝗶𝗻𝗴) as the real game-changer for aligning LLMs with complex real-world needs. It offers: ◼️ A structured taxonomy of post-training techniques ◼️ Guidance on challenges such as hallucinations, catastrophic forgetting, reward hacking, and ethics ◼️ Future directions in model alignment and scalable adaptation In essence, it’s a playbook for making LLMs truly robust and user-centric. 𝗞𝗲𝘆 𝗧𝗮𝗸𝗲𝗮𝘄𝗮𝘆𝘀 𝗙𝗶𝗻𝗲-𝗧𝘂𝗻𝗶𝗻𝗴 𝗕𝗲𝘆𝗼𝗻𝗱 𝗩𝗮𝗻𝗶𝗹𝗹𝗮 𝗠𝗼𝗱𝗲𝗹𝘀 While raw pretrained LLMs capture broad linguistic patterns, they may lack domain expertise or the ability to follow instructions precisely. Targeted fine-tuning methods—like Instruction Tuning and Chain-of-Thought Tuning—unlock more specialized, high-accuracy performance for tasks ranging from creative writing to medical diagnostics. 𝗥𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗳𝗼𝗿 𝗔𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 The authors show how RL-based methods (e.g., RLHF, DPO, GRPO) turn human or AI feedback into structured reward signals, nudging LLMs toward higher-quality, less toxic, or more logically sound outputs. This structured approach helps mitigate “hallucinations” and ensures models better reflect human values or domain-specific best practices. ⭐ 𝗜𝗻𝘁𝗲𝗿𝗲𝘀𝘁𝗶𝗻𝗴 𝗜𝗻𝘀𝗶𝗴𝗵𝘁𝘀 ◾ 𝗥𝗲𝘄𝗮𝗿𝗱 𝗠𝗼𝗱𝗲𝗹𝗶𝗻𝗴 𝗜𝘀 𝗞𝗲𝘆: Rather than using absolute numerical scores, ranking-based feedback (e.g., pairwise preferences or partial ordering of responses) often gives LLMs a crisper, more nuanced way to learn from human annotations. Process vs. Outcome Rewards: It’s not just about the final answer; rewarding each step in a chain-of-thought fosters transparency and better “explainability.” ◾ 𝗠𝘂𝗹𝘁𝗶-𝗦𝘁𝗮𝗴𝗲 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴: The paper discusses iterative techniques that combine RL, supervised fine-tuning, and model distillation. This multi-stage approach lets a single strong “teacher” model pass on its refined skills to smaller, more efficient architectures—democratizing advanced capabilities without requiring massive compute. ◾ 𝗣𝘂𝗯𝗹𝗶𝗰 𝗥𝗲𝗽𝗼𝘀𝗶𝘁𝗼𝗿𝘆: The authors maintain a GitHub repo tracking the rapid developments in LLM post-training—great for staying up-to-date on the latest papers and benchmarks. Source : https://lnkd.in/gTKW4Jdh ☃ To continue getting such interesting Generative AI content/updates : https://lnkd.in/gXHP-9cW #GenAI #LLM #AI RealAIzation
No more previous content

No more next content
Like Comment
Bala Selvam

I make my own rules 100% of the time

8,690 followers 9mo
Report this post
After about a year and a half working with LLMs I've seen a few tips on how to turn a commercial LLM into your in-house expert: my six-step playbook is below: 1️⃣ Pick the lightest customization that does the job: • Retrieval-Augmented Generation keeps the base model frozen and pipes in your own documents at run time. • Fine-tuning bakes stable expertise directly into the weights. • Hybrid approaches freeze what rarely changes and retrieve what does. 2️⃣ Obsess over data quality: Clean, permission-cleared text matters more than GPU hours. Redact PII, keep training chunks under two thousand tokens, and label a handful of gold-standard examples for every task. 3️⃣ Choose a training method that matches your budget: Full fine-tune for “mission-critical or bust,” Low-Rank Adaptation (LoRA) when you have one GPU and a deadline, instruction tuning for conversational agents, reinforcement learning if safety and tone need tight control. 4️⃣ Stand up an evaluation pipeline before launch: Automated test suites (DeepEval, RAGAs, MLflow Evaluate) score every new checkpoint for accuracy, relevance, bias, and hallucination. Treat prompts like code: unit-test them nightly. 5️⃣ Build guardrails in, not on: Add content filters, prompt-injection shields, and telemetry hooks that log inputs, outputs, and confidence scores. Compliance teams sleep better when monitoring is automatic. 6️⃣ Iterate in production: Canary releases send five percent of traffic to the new model and compare KPIs. Active-learning loops capture low-confidence answers and route them back into the next training batch. Schedule quarterly refreshes so improvement is routine, not heroic. Key takeaway: start with data and evaluation, layer on the lightest customization path that meets accuracy, and measure everything. Do that, and your “off-the-shelf” LLM will start speaking your organization’s language in record time. What’s your go-to tactic for customizing large language models? Drop it below so we can all learn faster. Thoughts?

3 Comments
Like Comment
Sarthak Rastogi

AI engineer | Posts on agents + advanced RAG | Experienced in LLM research, ML engineering, Software Engineering

25,245 followers 1y
Report this post
OpenAI released a guide on how to improve LLMs’ accuracy and consistency. Here are some lesser known tactics I found very interesting: 1. Prompt Baking: It involves logging the inputs and outputs during a pilot phase to identify the most effective examples. This helps you refine and prune the data into a more efficient training set, which will help improve the model's performance. 2. How to scale prompting when dealing with a long context: Having a long context can cause the LLM to struggle to maintain the attention given to all the tokens in the input context -- especially if the instructions are very complex. So, in such cases it’s important to evaluate your LLM on its ability to retrieve info from varying depths in long-context documents. Needle in A Haystack is one such model evaluation you can use. 3. Fine-Tuning with RAG Examples: They recommend incorporating your RAG context examples, into the fine-tuning process. This makes the model learn to leverage retrieved info effectively, to generate more relevant outputs. The guide also mentions common recommendations like: - Splitting complex tasks into separate calls - Using chain-of-thought prompting (you can use: https://lnkd.in/gN5eHby5) - Using GPT-4 itself to evaluate and score its outputs for iterative improvement Here's the full guide: https://lnkd.in/gAzjKdyp #AI #LLMs #OpenAI
No more previous content

No more next content
8 Comments
Like Comment
Paul Iusztin

Senior AI Engineer • Founder @ Decoding AI • Author @ LLM Engineer’s Handbook ~ I ship AI products and teach you about the process.

98,298 followers 1y
Report this post
A blueprint for designing production LLM systems: From Notebooks to production For example, we will fine-tune an LLM and do RAG on social media data, but it can easily be adapted to any data. We have 4 core components. We will follow the feature/training/inference (FTI) pipeline architecture. 𝟭. 𝗗𝗮𝘁𝗮 𝗖𝗼𝗹𝗹𝗲𝗰𝘁𝗶𝗼𝗻 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲 It is based on an ETL that: - crawls your data from blogs and socials - standardizes it - loads it to a NoSQL database (e.g., MongoDB) As: - we work with text data, which is naturally unstructured - no analytics required → a NoSQL database fits like a glove. 𝟮. 𝗙𝗲𝗮𝘁𝘂𝗿𝗲 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲 It takes raw articles, posts and code data points from the data warehouse, processes them, and loads them into a logical feature store. Let's focus on the logical feature store. As with any RAG-based system, a vector database is one of the central pieces of the infrastructure. We directly use a vector database as a logical feature store. Unfortunately, the vector database doesn't offer the concept of a training dataset. To implement this, we will wrap the retrieved data into a versioned, tracked, and shareable MLOps artifact. To conclude: - the training pipeline will use the instruct datasets as artifacts (offline) - the inference pipeline will query the vector DB for RAG (online) 𝟯. 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲 It consumes instruct datasets from the feature store, fine-tunes an LLM with it, and stores the tuned LLM weights in a model registry. More concretely, when a new instruct dataset is available in the logical feature store, we will trigger the training pipeline, consume the artifact, and fine-tune the LLM. We run multiple experiments to find the best model and hyperparameters. We will use an experiment tracker to compare and select the best hyperparameters. After the experimentation phase, we store and reuse the best hyperparameters for continuous training (CT). The LLM candidate's testing pipeline is triggered for a detailed analysis. If it passes, the model is tagged as accepted and deployed to production. Our modular design lets us leverage an ML orchestrator to schedule and trigger the pipelines for CT. 𝟰. 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲 It is connected to the model registry and logical feature store. From the model registry, it loads a fine-tuned LLM, and from the logical feature store, it accesses the vector DB for RAG. It receives client requests as queries through a REST API. It uses the fine-tuned LLM and vector DB to do RAG to answer the queries. Everything is sent to a prompt monitoring system to analyze, debug, and understand the system. #artificialintelligence #machinelearning #mlops
No more previous content

No more next content
21 Comments
Like Comment

CMT Strategies for Ongoing LLM Training

Summary

More in Continuous Learning Practices

Explore categories