LLM Model Training Using Hidden Labels

Explore top LinkedIn content from expert professionals.

Summary

LLM model training using hidden labels refers to methods that allow large language models (LLMs) to improve and learn without relying on costly human-labeled datasets. By using techniques like reinforcement learning, reward functions, or synthetic data, these approaches tap into existing inputs or generate their own judgments, letting models grow smarter and more personalized automatically.

  • Experiment with self-supervision: Consider using teacher models or reward functions to create training signals, so your LLM can align with user preferences or reasoning tasks without manual labeling.
  • Iterate with synthetic judgments: Use iterative schemes where models evaluate their own outputs and generate new data, helping them improve their accuracy and consistency over time.
  • Personalize with unlabeled inputs: Make use of real-world user data, like chat histories or prompts, to fine-tune models for scalable personalization and better user alignment—no labeling required.
Summarized by AI based on LinkedIn member posts
  • View profile for Jonathan Frankle

    Chief AI Scientist at Databricks

    6,099 followers

    The hardest part about finetuning LLMs is that people generally don't have high-quality labeled data. Today, we at Databricks introduced TAO, a new finetuning method that only needs inputs, no labels necessary. Best of all, it actually beats supervised finetuning on labeled data. TAO has its roots in reinforcement learning and uses test-time compute methods as a source of training data. We built an enterprise-specific reward model (DBRM) and did tons of work to find a process and algorithms that could to train without labels. Why training without labels? We want to meet our customers where they are. Getting a high-quality labeled dataset for SFT is hard. But anyone who has deployed an LLM - even as a prototype - has an abundance of in-distribution inputs. Here's one scenario where TAO is great: 1. Deploy a prompt-engineered LLM. 2. Collect inputs. 3. Use TAO on Llama. 4. Deploy that. 5. Repeat steps 2-4. The more people use your LLM, the better it gets, with no expensive data labeling required. On our eval suite, which included popular open-source tasks and custom in-house benchmarks, TAO matched or beat SFT, even without access to labels. This is a triumph for TAO, RL, and DBRM, yet it also shows that even our team's best attempt at getting labeled data can fall short. For all the triumphs of modern AI, I think we need to do a better job meeting users where they are if we want AI to really matter. Things like helping users: * Specify what they're trying to accomplish * Leverage the data they already have * Find signals for non-verifiable tasks TAO is an important early result in our agenda at Databricks focused on data intelligence - helping enterprises make use of their data, in combination with AI, for solving their specific problems. There's lots more research where that came from. Stay tuned!

  • View profile for Bijit Ghosh

    CTO | CAIO | Leading AI/ML, Data & Digital Transformation

    10,438 followers

    I’ve been exploring a question that has sparked endless debates in the AI community: how do we actually teach LLMs to reason, without leaning on endless human labels or black-box pipelines? My latest blog post dives into this problem head-on, walking through how I built a reasoning LLM completely from scratch - 100% local, no labeled data, no human feedback loops using Group Relative Policy Optimization (GRPO). For those unfamiliar, GRPO is a reinforcement fine-tuning method that replaces expensive human preference labels with deterministic reward functions. That makes it especially powerful for domains like math and logic, where correctness is inherently verifiable. I shared the full pipeline: loading a base open-weight model (Qwen3-4B-Base in my case), applying LoRA for parameter-efficient fine-tuning with UnslothAI, creating a structured reasoning dataset from the Open R1 Math corpus, and then defining simple but effective reward functions that check answers, numbers, and formats. With HuggingFace’s TRL library handling the GRPO trainer, the loop becomes elegant: generate → score → reinforce. The results speak for themselves—what started as a generic base model quickly transformed into a reasoning-focused specialist, capable of producing step-by-step solutions with consistency and accuracy. I also reflect on when to choose reinforcement fine-tuning over supervised fine-tuning, why reward shaping often matters more than raw model size, and how GRPO opens the door for label-free reasoning across other domains like code, logic, or theorem proving. For me, the biggest takeaway was how empowering it felt to run the entire workflow locally, with complete transparency and reproducibility. If you’re curious about the future of reasoning LLMs beyond the hype, this piece offers a grounded, hands-on narrative. https://lnkd.in/g2BdmPJr

  • View profile for Aadharsh Kannan

    Making AI agents reliable, trustworthy, and safe at scale for your enterprise

    2,696 followers

    I’ve been exploring how we can make LLMs genuinely reflect individual users without the need for manual data labeling. In this piece, I introduce a method called Adversarial Contrastive Distillation (ACD) to enable Self-Supervised Persona Fine-Tuning. The idea is straightforward: use a teacher model to generate “contrarian” examples, then fine-tune a smaller model to better align with a user’s tone, style, or beliefs automatically. To put this into practice, I fine-tuned a DistilBERT model on four years of WhatsApp chat history with a close friend. The result was an accuracy jump from 66.91 percent to 85.69 percent, requiring zero manual labeling. If you are working on scalable personalization in LLMs or just curious about how self-supervision can make AI more you, I’d love to hear your thoughts. https://lnkd.in/djPuWJXZ

  • View profile for Sachin Kumar

    Senior Data Scientist III at LexisNexis | Experienced Agentic AI and Generative AI Expert

    8,693 followers

    Self-Taught Evaluators: improving LLM-as-Judge evaluators without human annotations Training evaluators require a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. To address it, authors of this paper present an approach aimed to improve evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, proposed iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions 𝗠𝗲𝘁𝗵𝗼𝗱𝗼𝗹𝗼𝗴𝘆 iterative training scheme that bootstraps improvements by annotating the current model’s judgments using constructed synthetic data ,so that the Self-Taught Evaluator is more performant on the next iteration, as outlined in pipeline as follows: i) Initialization - assume access to a large set of human-written user instructions and an initial seed LLM ii) Instruction Selection - next select a challenging, balanced distribution of user instructions from the uncurated set by categorizing them via LLM iii) Response Pair Construction - For each user instruction (example) we create a preference pair of two model responses (chosen & rejected), generating them via prompting such that the rejected response is likely of lower quality than the chosen response iv) Judgment Annotation - For each example, we sample from the current model up to N times LLM-as-a-Judge generated reasoning traces and judgments - If we find a correct judgment we add that example to our training set, otherwise we discard it. v) Model Fine-tuning (Iterative Training) - fine-tune the model on the newly constructed training set which yields an updated model for the next iteration 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗥𝗲𝘀𝘂𝗹𝘁𝘀 - model is initalized from Llama3-70B-Instruct - Self-Taught Evaluator can improve a strong LLM (Llama3-70BInstruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench - outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples 𝗟𝗶𝗺𝗶𝘁𝗮𝘁𝗶𝗼𝗻𝘀 - large LLMs(70B params) was used, but smaller models not explored if they work with this approach - approach is limited by having a capable instruction fine-tuned model which is already reasonably aligned to human preferences - LLM-as-a-Judge models usually have longer outputs and thus higher inference cost because of the reasoning chains generation 𝗕𝗹𝗼𝗴: https://lnkd.in/e_yWKX4y 𝗣𝗮𝗽𝗲𝗿: https://lnkd.in/eN4hYZBb

  • View profile for Andrei Lopatenko

    VP, Applied AI @ Govini | Transforming Defense with AI | Ex-Google, Apple, eBay, Zillow | Hiring AI Leaders

    25,604 followers

    “In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 211% on the AIME 2024 with only unlabeled test data”

Explore categories