Last week we ran a $18,000 clinical query for just over $180 A team we work with wanted to extract a set of clinical terms from molecular profiling notes, but had so many notes that doing it directly with an LLM would have cost over $18,000. After some optimization, we ran the same query for less than $200. Here’s a few changes that made this possible. 1. Agent-Driven Relevance Pruning LLM costs here are primarily driven by the huge quantity of tokens that clinical data can have. Discharge summaries and lab reports are all filled with boilerplate-- we trim those tokens long before the frontier model sees them. In practice, this means: - One model is responsible for planning a pruning process - Irrelevant notes and text chunks are pruned using tools from regular expressions to inexpensive LLM prompts. For example, one agent might auto-generate regexes for section headers it finds (ie. “Operative Findings:”, “Pathology:”). In other cases, we might use a small cheap model to include negating text like “no evidence of DVT”. In some document types we found that this step can remove up to ~90% of irrelevant text. - Lastly, an advanced reasoning model takes the combined pruned and transformed data for each patient and reduces it to the answer for the query. As a result the vast majority of tokens never hit our most expensive models, but sensitivity and specificity are completely unchanged 2. Incremental Context Construction Vector search alone misses subtle synonyms in clinical language. Instead of using vector stores– which mostly amounts to hoping the embeddings align– we intentionally build up the necessary context for each label. In practice, we split up notes and tag them with section metadata. We then search every chunk of note we’ve split up, and tweak the amount of surrounding context we include for every hit found. Some types of notes don’t require much additional context, while others benefit from the entire document being passed in. Because we’re actively involved in the needle-in-haystack search, we have fine grained control of the process. We found that this results in higher quality signals from large document sets. 3. Continuous Token Accounting An example of a token saving tweak is inserting line numbers into the inputs, then citing line numbers instead of quotes. This minor fix alone has saved us double digit percentages in costs. Another optimization is sending tasks to different LLMs depending on complexity. With Sonnet being 100x the cost of Nova, efficient triaging saves tons of costs as well. If you run large-scale chart reviews, directly using a frontier model can be like lighting cash on fire. We found that we were able to reduce ongoing model costs by 2 orders of magnitude with some of these improvements while improving sensitivity. Have you dealt with similar problems using LLMs for clinical notes? Have any of these worked for you, or did you run into something I didn't mention? Let me know!
Automating Relevance Labeling With LLMs
Explore top LinkedIn content from expert professionals.
Summary
Automating relevance labeling with large language models (LLMs) means using advanced AI to quickly and accurately determine how relevant data or search results are to specific questions or needs—without needing manual review for each item. This process makes evaluating huge amounts of information faster, cheaper, and more consistent for industries like healthcare, search engines, and knowledge management.
- Streamline data processing: Use LLMs to prune irrelevant content from large datasets before deeper analysis, which can save significant costs and reduce workload.
- Build context thoughtfully: Adjust the amount of surrounding information considered by LLMs for each label to improve the quality of relevance signals, especially when handling complex documents or queries.
- Combine automation and human review: Let LLMs handle repetitive relevance labeling tasks, but involve people to guide model improvements and check results where decisions are tricky or highly nuanced.
-
-
Building useful Knowledge Graphs will long be a Humans + AI endeavor. A recent paper lays out how best to implement automation, the specific human roles, and how these are combined. The paper, "From human experts to machines: An LLM supported approach to ontology and knowledge graph construction", provides clear lessons. These include: 🔍 Automate KG construction with targeted human oversight: Use LLMs to automate repetitive tasks like entity extraction and relationship mapping. Human experts should step in at two key points: early, to define scope and competency questions (CQs), and later, to review and fine-tune LLM outputs, focusing on complex areas where LLMs may misinterpret data. Combining automation with human-in-the-loop ensures accuracy while saving time. ❓ Guide ontology development with well-crafted Competency Questions (CQs): CQs define what the Knowledge Graph (KG) must answer, like "What preprocessing techniques were used?" Experts should create CQs to ensure domain relevance, and review LLM-generated CQs for completeness. Once validated, these CQs guide the ontology’s structure, reducing errors in later stages. 🧑⚖️ Use LLMs to evaluate outputs, with humans as quality gatekeepers: LLMs can assess KG accuracy by comparing answers to ground truth data, with humans reviewing outputs that score below a set threshold (e.g., 6/10). This setup allows LLMs to handle initial quality control while humans focus only on edge cases, improving efficiency and ensuring quality. 🌱 Leverage reusable ontologies and refine with human expertise: Start by using pre-built ontologies like PROV-O to structure the KG, then refine it with domain-specific details. Humans should guide this refinement process, ensuring that the KG remains accurate and relevant to the domain’s nuances, particularly in specialized terms and relationships. ⚙️ Optimize prompt engineering with iterative feedback: Prompts for LLMs should be carefully structured, starting simple and iterating based on feedback. Use in-context examples to reduce variability and improve consistency. Human experts should refine these prompts to ensure they lead to accurate entity and relationship extraction, combining automation with expert oversight for best results. These provide solid foundations to optimally applying human and machine capabilities to the very-important task of building robust and useful ontologies.
-
Search is a crucial part of many modern internet products, as companies strive to improve the relevance of search results to enhance customer experience and increase retention. The first step in making these improvements is to measure relevance accurately. This blog, written by the data scientist team at Faire, shares their approach to using large language models (LLMs) to measure semantic relevance in search. - To define semantic relevance, the team uses a tiered approach based on the ESCI framework, which classifies each search result as "Exact," "Substitute," "Complement," or "Irrelevant." This classification allows for flexible relevance labeling and provides flexibility in fitting various downstream application needs. - To measure semantic relevance, the team initially relied on human annotators. However, this method was costly and slow, providing only a general measurement on a monthly cadence. With recent advancements in large language models (LLMs), the team transitioned to using these models to assess the relevance between search queries and products automatically. They fine-tuned a leading LLM model to align with the human labelers and measure agreement. The higher the agreement, the better the LLM performance. This LLM could then scale out much more effectively to provide daily evaluations of search's semantic performance. - The team’s LLM approach underwent multiple iterations, including adopting more advanced models (e.g., LLaMA 3) and more complex techniques (like quantization and horizontal scaling). With these efforts, the solution reached reasonable accuracy with good scalability and can serve the team’s purpose of measuring semantic performance to guide their improvements. This case study highlights that successful LLM applications need clear problem definitions, high-quality labeled data, and iterative model improvements, similar to standard machine learning product integration. It also demonstrates the potential of fine-tuned LLMs in the AI era, making it a compelling read! #machinelearning #datascience #llm #ai #search #relevance – – – Check out the "Snacks Weekly on Data Science" podcast and subscribe, where I explain in more detail the concepts discussed in this and future posts: -- Spotify: https://lnkd.in/gKgaMvbh -- Apple Podcast: https://lnkd.in/gj6aPBBY -- Youtube: https://lnkd.in/gcwPeBmR https://lnkd.in/gwaxRs2r
-
Evaluating Retrieval-Augmented Generation (RAG) systems has long been a challenge, given the complexity and subjectivity of long-form responses. A recent collaborative research paper from institutions including the University of Waterloo, Microsoft, and Snowflake presents a promising solution: the AutoNuggetizer framework. This innovative approach leverages Large Language Models (LLMs) to automate the "nugget evaluation methodology," initially proposed by TREC in 2003 for assessing responses to complex questions. Here's a technical breakdown of how it works under the hood: 1. Nugget Creation: - Initially, LLMs automatically extract "nuggets," or atomic pieces of essential information, from a set of related documents. - Nuggets are classified as "vital" (must-have) or "okay" (nice-to-have) based on their importance in a comprehensive response. - An iterative prompt-based approach using GPT-4o ensures the nuggets are diverse and cover different informational facets. 2. Nugget Assignment: - LLMs then automatically evaluate each system-generated response, assigning nuggets as "support," "partial support," or "no support." - This semantic evaluation allows the model to recognize supported facts even without direct lexical matching. 3. Evaluation and Correlation: - Automated evaluation scores strongly correlated with manual evaluations, particularly at the system-run level, suggesting this methodology could scale efficiently for broad usage. - Interestingly, the automation of nugget assignment alone significantly increased alignment with manual evaluations, highlighting its potential as a cost-effective evaluation approach. Through rigorous validation against human annotations, the AutoNuggetizer framework demonstrates a practical balance between automation and evaluation quality, providing a scalable, accurate method to advance RAG system evaluation. The research underscores not just the potential of automating complex evaluations, but also opens avenues for future improvements in RAG systems.
-
Pinterest improved search relevance by 2.18% and global search fulfillment by 1.5%—without needing labeled data for most countries. The key? A distilled LLM-based relevance system that combines deep semantics with real-world scalability. Here’s how it works: 🔍 Cross-Encoder Teacher Model Pinterest fine-tuned multilingual LLMs using 5-scale human relevance labels. Here, LLaMA-3 8B outperformed XLM-R and mDeBERTa-v3. These models outperformed embedding-based baselines by up to 19.7% in accuracy, especially on long, nuanced queries. 📎 Enriched Pin Text Representations Each Pin is described using: • User-written titles and descriptions • Synthetic image captions from BLIP • High-engagement past query terms • Board titles from user curation • Titles and metadata from linked websites These textual signals drastically boosted relevance accuracy in ablation studies. ⚡ Distilled Student Model for Real-Time Use LLMs are expensive and slow for production. So Pinterest distilled this teacher into a lightweight student model using billions of search impression logs via semi-supervised learning. The student model uses: • Query features (SearchSAGE embeddings, interest vectors) • Pin features (visual embeddings, PinSAGE) • Interaction features (BM25, text match scores, engagement rates) Fed into a feed-forward network to predict 5-scale relevance. 🌍 Multilingual Generalization by Design Even though training data was mostly US-English, the teacher model’s multilingual nature enabled robust generalization to unseen countries and languages. 📈 Real-World Impact • +2.18% in nDCG@20 from human relevance evaluations • +1.5% in search fulfillment rate, driven by better ranking of high-quality content • Improvements seen globally, including markets with no annotated data Pinterest shows how to make LLMs useful in production: use them to teach, not serve. Source blog in comments
-
LLMs are exceptional at zero-shot classification, and the ability to add a label to a text is extremely valuable in recruiting. However, there are limits to it. You cannot use zero-shot classification directly if you have a huge taxonomy (let's say 10K labels). Putting 10K possible labels in a prompt is expensive and will also generate bad results. You need to find a way to reduce the possible labels to a set of 60-100. You have multiple options but two have worked best for me: 1) Compute a fixed vector for each label by embedding the label name + description with an “off-the-shelf” model and comparing this to the new document's embedding, retrieving the top-K similar labels. Use these possible labels in your zero-shot prompt. My vectorDB of choice is Qdrant for their speed and simplicity of starting. 2) Transform your taxonomy to a 2-level hierarchy. Use the LLM to specify the first level of the taxonomy in prompt one and then submit all available labels from this leaf with prompt 2 for the actual classification. In the second option, you can train some "lightweight" classifier using FastText, for example, instead of using an LLM. You can also embed the top-level categories using the same logic as in option 1. Let me know in the comments if you are building classifiers for complex taxonomies and how you solve these challenges!
-
Using LLMs to assist RecSys and generate data with them! TLDR: The most effective use of LLMs in large-scale systems is as a component in a data-generation flywheel. Use large models offline to create high-quality labels, metadata, and synthetic data. Then, distill this into specialized, cost-effective models that can be deployed as part of a robust, hybrid production architecture. This approach maximizes semantic quality while respecting latency and cost constraints. Here are the 3 technical patterns observed across the industry: 1. LLM-Powered Features for Hybrid Rankers: LLM-derived signals are integrated as features into existing, battle-tested ranking systems (e.g., GBDTs), rather than replacing them entirely. This combines the semantic power of LLMs with the performance of traditional rankers. - Bing: An LLM-finetuned MiniLM cross-encoder generates a quality score, which is linearly combined with a click prediction score from a LightGBM ranker. This explicitly injects a quality signal to counteract CTR bias. - Spotify: LLM-generated synthetic queries are candidates in a system ranked by a point-wise ranker that uses a mix of lexical, statistical, and user vector features. 2. Offline Generation for Post-Processing Filters: When online LLM inference is too slow, LLMs are used to generate labels for a lightweight classifier that acts as a real-time filter. - A 6.7s latency finetuned GPT-3.5 was too slow for online use. They used it to generate millions of "bad match" labels to train a lightweight eBadMatch classifier. This classifier runs in production as a post-processing filter, removing low-quality recommendations with an AUC-ROC of 0.86. 3. Scaling with Caching and Tiered Models: To manage cost and latency across a power-law distribution of traffic, a tiered approach is used: - Yelp: Pre-computed and cached LLM responses (e.g., query segmentations, synonym expansions) for common "head" queries, covering 95% of traffic. For the long tail, they use smaller, real-time models like BERT and T5.
-
🎇 #New #Publication! DIRAS: Efficient LLM Annotation of Document Relevance in Retrieval Augmented Generation I'm super happy that this important piece was accepted to NAACL 2025! DIRAS explores a largely overlooked, yet significant aspect of #Retrieval #Augmented #Generation and general #Large #Language #Model usage: How to determine the importance of a source for a question? ❓ Why does it matter? When we use LLMs (or even humans) to answer questions, we know that more (irrelevant) information is more confusing, too less information is insufficient. Give me the right sources and I give you the right answer! ↘️ How do classical approaches go about this? Usually, people just take the top most similar text paragraphs in a document as sources to answer a question. 🚨 #PROBLEM: If I want to know the name of the CEO of a company, I need one sentence of its business report. If I want to know the new business innovations, I likely need two to three pages. The topmost similar approach doesn't account for that! It uses a fixed number for both which confuses or falsifies the answer. 🗝️ #SOLUTION: DIRAS! In this paper, we propose a system that tells you #HOW #RELEVANT the source is. This way, we can filter based on relevance thresholds. We show how to create such an evaluator for specific tasks in a scaleable way reaching GPT-4-level performance (or Deepseek-level performance lol). Besides, DIRAS helps to create rankings that order the sources from most to least important. 🔗 Much more can be found in the paper: https://lnkd.in/dSNsFnM3 It's a super nice recognition to be accepted to NAACL. I think the importance of DIRAS for researchers and practitioners is very high, so I'm happy to hear feedback. Big thanks to my co-authors Jingwei Ni, Meihong Lin, Mrinmaya Sachan, Elliott Ash, and Markus Leippold! University of Zurich | ETH Zürich #NAACL25 #ACL #RAG #InformationRetrieval
-
Query understanding and Ranking and Relevancy are integral and interconnected components of search and chat systems. 𝗤𝘂𝗲𝗿𝘆 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 is about understanding a user’s query to identify their intent and information needs. 𝗥𝗮𝗻𝗸𝗶𝗻𝗴 𝗮𝗻𝗱 𝗥𝗲𝗹𝗲𝘃𝗮𝗻𝗰𝘆 involves ordering retrieved content based on how well it aligns with a user’s intent. Two recent works illustrate how LLMs can be leveraged to improve query understanding and ranking systems. 𝟭. 𝗜𝗻 𝗦𝗲𝗮𝗿𝗰𝗵 𝗤𝘂𝗲𝗿𝘆 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗟𝗟𝗠𝘀: 𝗙𝗿𝗼𝗺 𝗜𝗱𝗲𝗮𝘁𝗶𝗼𝗻 𝘁𝗼 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻, Yelp leverages LLMs to improve query understanding and features such as review highlights. For a query like “𝗥𝗮𝗺𝗲𝗻 𝗨𝗽𝗽𝗲𝗿 𝗪𝗲𝘀𝘁 𝗦𝗶𝗱𝗲”, the LLM can recognize “Ramen” as food and “Upper West Side” as a location in Manhattan instead of just “New York”. In healthcare, "𝘁𝗼𝗽 𝗱𝗲𝗿𝗺𝗮𝘁𝗼𝗹𝗼𝗴𝗶𝘀𝘁𝘀 𝗶𝗻 𝗕𝘂𝗰𝗸𝗵𝗲𝗮𝗱" would be segmented as "dermatologists" (provider type/specialty), "Buckhead" (location in Atlanta) instead of a generic Atlanta, GA location. Yelp also trains an LLM with curated phrase examples for review highlights. For the query “𝗿𝗲𝘀𝘁𝗮𝘂𝗿𝗮𝗻𝘁𝘀 𝗻𝗲𝗮𝗿 𝗺𝗲 𝘄𝗶𝘁𝗵 𝗴𝗿𝗲𝗮𝘁 𝗼𝘂𝘁𝗱𝗼𝗼𝗿 𝗽𝗮𝘁𝗶𝗼" reviews mentioning “beautiful outdoor patio” or “charming patio” might be highlighted. In healthcare, a query for pediatricians can highlight reviews containing "gentle approach", "Good with kids". 𝟮. 𝗜𝗻 𝗜𝗺𝗽𝗿𝗼𝘃𝗶𝗻𝗴 𝗣𝗶𝗻𝘁𝗲𝗿𝗲𝘀𝘁 𝗦𝗲𝗮𝗿𝗰𝗵 𝗥𝗲𝗹𝗲𝘃𝗮𝗻𝗰𝗲 𝗨𝘀𝗶𝗻𝗴 𝗟𝗮𝗿𝗴𝗲 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 Pinterest enhances search relevance by considering additional Pin text like captions, context from user boards, and past user interactions to better understand search intent. They fine-tune a multilingual LLM using a human-curated dataset with a 5-level relevance scale. For example, in healthcare, an LLM analyzing "𝗸𝗻𝗲𝗲 𝗽𝗮𝗶𝗻" would examine not only related search queries but also medical image captions, associated research papers, and content from knee pain support groups to gain a deeper understanding. 𝟯. 𝗣𝗿𝗮𝗰𝘁𝗶𝗰𝗮𝗹 𝗰𝗼𝗻𝘀𝗶𝗱𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝘀: 𝗬𝗲𝗹𝗽: use a Proof of Concept (POC) for a targeted use case where LLMs can improve the use case validated via testing and evaluation. Balance cost-effectiveness by caching LLM results of complex queries while fine-tuning a smaller model for simpler and long-tail queries to balance latency and computational costs. 𝗣𝗶𝗻𝘁𝗲𝗿𝗲𝘀𝘁: uses knowledge distillation to create a smaller, faster "student" model for real-time search by having it mimic a larger LLM. To handle the limited labeled data for LLM training, they use semi-supervised learning, combining a small amount of human-labeled data with a large amount of unlabeled search log data. The model learns from the labeled data and then uses its predictions on the unlabeled data as additional training examples for the student model.
Explore categories
- Hospitality & Tourism
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development