Machine Learning Model Tuning

Explore top LinkedIn content from expert professionals.

  • View profile for Terezija Semenski, MSc

    Helping 300,000+ people master AI and Math fundamentals faster | LinkedIn [in]structor 15 courses | Author @ Math Mindset newsletter

    31,107 followers

    I taught myself machine learning > 10 years ago. If I had to start again today, I wouldn’t touch models, LLMs, or agents first, as many AI experts suggest. I'd start with the math and the code. Ugly truth: 90% of people skip the foundations, then wonder why everything feels like magic or falls apart in production. If you want to be different, actually understand ML, not just copy-paste, this is the roadmap I'd follow: Start with fundamentals: Because no matter how fast LLMs or GenAI evolve, your math, code, and logic will keep you relevant. Here's what you should focus on: 📐 1. Linear Algebra Learn these core ideas: Vectors, matrices, tensors Matrix multiplication (dot products, broadcasting) Transpose, inverse, rank, determinants Eigenvalues & eigenvectors (especially for PCA & embeddings) Projections and orthogonality ✅ Use NumPy to implement everything yourself → Practice matrix ops, dot products, and visualizing transformations with Matplotlib 🔁 2. Calculus Focus on: Derivatives & partial derivatives Chain rule (for backpropagation in neural nets) Gradient descent Convex functions, minima/maxima ✅ Use SymPy or JAX to visualize and compute derivatives → Plot functions and their gradients to develop deep intuition 🎲 3. Probability You need a solid grip on: Random variables (discrete & continuous) Conditional probability & Bayes' rule Joint & marginal probability The Chain rule Expectation, variance, entropy Common distributions: Bernoulli, Binomial, Gaussian, Poisson Central limit theorem The law of large numbers ✅ Simulate simple probability experiments in Python with NumPy → E.g. simulate sampling from distributions 📊 4. Statistics These are must-know topics: Descriptive stats: mean, median, mode, standard deviation Hypothesis testing: p-values, confidence intervals, t-tests Correlation vs. causation Sampling, bias, and variance Overfitting/underfitting A/B testing basics ✅ Use Pandas & SciPy to explore real datasets → Calculate descriptive stats, create histograms/box plots, run t-tests 🔧 Essential Python libraries to learn early NumPy – for vectorized math and fast array ops Pandas – for loading, cleaning, and analyzing tabular data Matplotlib / Seaborn – for plotting and visualizing distributions, relationships, and trends SymPy – for symbolic math and calculus SciPy – for stats, optimization, and numerical methods Use Jupyter Notebooks(to combine math, code, & visuals in one place) 📚 Best resources to nail the fundamentals: ✅ Machine Learning Foundations Math series (ML Foundations: Linear Algebra, Calculus, Probability, and Statistics)-series of 4 courses that I've created together with LinkedIn learning ✅ Hands-On ML with TensorFlow & Keras book by Aurélien Géron ✅ The Hundred-page Machine Learning Book by Andriy Burkov If you want to become an actual ML engineer, not just someone who watches and copies demos, start here. ♻️ Repost to help others💚

  • View profile for Avi Chawla

    Co-founder DailyDoseofDS | IIT Varanasi | ex-AI Engineer MastercardAI | Newsletter (150k+)

    172,659 followers

    4 techniques to compress ML models for production (actively used to save 1000s of $$ in costs) Training the best-performing model is just a small part of model building. Much of our engineering effort, however, goes into making the model production-friendly. Because typically, the model that gets shipped is NEVER solely determined by performance (a misconception that many have). Instead, we also consider several operational and feasibility metrics, such as: - Inference Latency - Model size - Ease of scalability, etc. For instance, consider the results in the image below (taken from my personal experiment). It compares the accuracy and size of a large neural network with its pruned (or reduced) versions. Looking at these results, don’t you strongly prefer deploying the model that is 72% smaller, but is still (almost) as accurate as the large model? Of course, which model to proceed with still depends on various business considerations, but in many cases, it might make very little sense to deploy the large model when one of its largely pruned versions performs equally well. The techniques that help us achieve this are called model compression techniques. Four widely popular model compression techniques are: 1) Knowledge distillation: Train a large model (teacher) and transfer its knowledge to a smaller model (student). 2) Model Pruning: Remove irrelevant edges and nodes from a network. Three popular pruning techniques are zero pruning, activation pruning, and redundancy pruning. 3) Low-rank factorization: Decompose weight matrices into smaller “low-rank” matrices. 4) Quantization: Reduce the model’s memory usage by using a lower-bit representation to store parameters. I have linked an article in the comments to dive more. 👉 Over to you: Assuming you are not dealing with any sensitive use case and cost is a consideration, which model will you deploy—Model A or Model B? Answer using the image below. ____ Find me → Avi Chawla Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.

  • View profile for Elie Bakouch

    ML Research at Prime Intellect

    7,060 followers

    Mistral just dropped the Ministral 3 tech report, and it's a great example of how you don't need massive compute to build competitive small models. they trained their 3B/8B/14B models on only 1-3 trillion tokens. the trick? smart pruning + distillation the approach is pretty clean: start with their 24B model, progressively prune it down to smaller sizes, then use the original model as a teacher to recover performance through distillation. each smaller model is initialized from the pruned weights of the previous one, but they all learn from the same 24B instruct teacher the pruning itself is interesting. they prune depth, hidden dim and ffn dim, each with a different method. for layers they look at how much each layer transforms its input (output/input norm ratio). for hidden dimensions they use PCA to find the important directions since features aren't axis-aligned, then prune the low variance ones. for FFN they look at the gated activation score since in SwiGLU a high value can still be killed by a low gate they have some cool ablations too. using an instruct model as teacher works better than base for STEM tasks. and they show a capacity gap effect where a bigger teacher can actually hurt, Medium 3 (larger model but size is not public) as teacher performed worse than Small 3.1 for pretraining but better for post-training solid paper with good ablations, bravo Mistral AI!

  • View profile for Santiago Valdarrama

    Computer scientist and writer. I teach hard-core Machine Learning at ml.school.

    121,950 followers

    Fine-tuning a model with just a prompt sounds like a joke until you try it. Prompt engineering with a general-purpose model can only get you so far. Prompt engineering influences how a model uses its knowledge, but it does not introduce new knowledge into the mix. If you want complete control over the results of your model, you need fine-tuning. But fine-tuning is hard: • You need a curated dataset (hard) • You need distributed training pipelines (hard + expensive) • You need a lot of compute (hard) Fine-tuning takes time, money, and skill. Most companies have neither of these. Here is where the idea of vibe-tuning comes in. Vibe-tuning is a method for fine-tuning a small language model using only a natural language prompt. You describe what you want, and the tuner generates synthetic data, sets up distillation, fine-tunes the model, and evaluates the results. The first time I heard about this was from DistilLabs. They are currently automating the entire fine-tuning process: 1. You provide a prompt describing the task 2. The platform generates and labels synthetic training data 3. You pick a Teacher model (say gpt-oss-120b) and a Student model (say llama-3.2-3B) 4. The platform distills, fine-tunes, benchmarks, and delivers a downloadable small language model 5. You can deploy this model and start using it right away. The technique builds on model distillation: transferring knowledge from a large "teacher" model to a compact "student" model that's cheaper and faster. Honestly, this is huge. You can literally teach a model your company's tone, classification rules, or tool-calling logic by writing a few sentences in English. Here is an article explaining how this works: https://lnkd.in/eDNTBg2F

  • View profile for Aishwarya Srinivasan
    Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer
    627,879 followers

    If you’re an AI engineer trying to optimize your LLMs for inference, here’s a quick guide for you 👇 Efficient inference isn’t just about faster hardware, it’s a multi-layered design problem. From how you compress prompts to how your memory is managed across GPUs, everything impacts latency, throughput, and cost. Here’s a structured taxonomy of inference-time optimizations for LLMs: 1. Data-Level Optimization Reduce redundant tokens and unnecessary output computation. → Input Compression:  - Prompt Pruning, remove irrelevant history or system tokens  - Prompt Summarization, use model-generated summaries as input  - Soft Prompt Compression, encode static context using embeddings  - RAG, replace long prompts with retrieved documents plus compact queries → Output Organization:  - Pre-structure output to reduce decoding time and minimize sampling steps 2. Model-Level Optimization (a) Efficient Structure Design → Efficient FFN Design, use gated or sparsely-activated FFNs (e.g., SwiGLU) → Efficient Attention, FlashAttention, linear attention, or sliding window for long context → Transformer Alternates, e.g., Mamba, Reformer for memory-efficient decoding → Multi/Group-Query Attention, share keys/values across heads to reduce KV cache size → Low-Complexity Attention, replace full softmax with approximations (e.g., Linformer) (b) Model Compression → Quantization:  - Post-Training, no retraining needed  - Quantization-Aware Training, better accuracy, especially <8-bit → Sparsification:  - Weight Pruning, Sparse Attention → Structure Optimization:  - Neural Architecture Search, Structure Factorization → Knowledge Distillation:  - White-box, student learns internal states  - Black-box, student mimics output logits → Dynamic Inference, adaptive early exits or skipping blocks based on input complexity 3. System-Level Optimization (a) Inference Engine → Graph & Operator Optimization, use ONNX, TensorRT, BetterTransformer for op fusion → Speculative Decoding, use a smaller model to draft tokens, validate with full model → Memory Management, KV cache reuse, paging strategies (e.g., PagedAttention in vLLM) (b) Serving System → Batching, group requests with similar lengths for throughput gains → Scheduling, token-level preemption (e.g., TGI, vLLM schedulers) → Distributed Systems, use tensor, pipeline, or model parallelism to scale across GPUs My Two Cents 🫰 → Always benchmark end-to-end latency, not just token decode speed → For production, 8-bit or 4-bit quantized models with MQA and PagedAttention give the best price/performance → If using long context (>64k), consider sliding attention plus RAG, not full dense memory → Use speculative decoding and batching for chat applications with high concurrency → LLM inference is a systems problem. Optimizing it requires thinking holistically, from tokens to tensors to threads. Image inspo: A Survey on Efficient Inference for Large Language Models ---- Follow me (Aishwarya Srinivasan) for more AI insights!

  • View profile for Zain Hasan

    I build and teach AI | AI/ML @ Together AI | EngSci ℕΨ/PhD @ UofT | Previously: Vector DBs, Data Scientist, Lecturer & Health Tech Founder | 🇺🇸🇨🇦🇵🇰

    19,603 followers

    An explanation of language model distillation, how it works, why it’s useful, and examples of how you can perform distillation. What is distillation? Distillation is a model compression technique where a smaller "student" model is trained to mimic the behavior of a larger "teacher" model. This is achieved by transferring knowledge from the teacher to the student, usually through methods like logit-based or hidden states-based distillation. These methods are designed to help the student model replicate the teacher's output distribution or internal representations, often leading to a more efficient model with comparable performance. When would we use this? Distillation is commonly used when deploying large models is impractical due to resource constraints, such as in real-time applications or edge devices. For instance, a smaller student model can be distilled from a powerful teacher model like Llama3.1 405B, retaining much of the original model’s capability but with significantly lower computational demands. Distillation is also useful when adapting models to specific tasks or domains, as seen in domain-specific distillation cases like "function calling," where specialized knowledge from a teacher model is transferred to a smaller model for specific use cases. What’s the benefit? Distillation offers a significant reduction in model size and computational requirements while maintaining a high level of performance. This is especially valuable in scenarios where memory and processing power are limited. Moreover, distillation allows for flexibility in model architecture choices; for example, distilling knowledge from a Llama-3.1-70B model into a much smaller StableLM-2-1.6B model. Distillation methods like those provided in Arcee-AI's DistillKit, including logit-based and hidden states-based distillation, can lead to substantial performance gains over traditional training routines without requiring additional data. Examples of Distillation Techniques: (1) Logit-based Distillation: This method involves transferring knowledge by using both the hard targets (actual labels) and soft targets (teacher logits) to guide the student model. The student is trained to minimize the difference between its output distribution and the teacher’s output, typically using Kullback-Leibler (KL) divergence. This method is particularly effective for maintaining performance close to the teacher model while improving the student’s generalization abilities. (2) Hidden States-based Distillation:  Here, the focus is on aligning the intermediate layer representations of the student with those of the teacher. This layer-wise guidance helps the student model capture similar features and improves its performance and generalization. This method also allows for cross-architecture distillation, enabling knowledge transfer between different model architectures, such as distilling from a Llama-3.1-70B model into a StableLM-2-1.6B model.

  • View profile for Venkata Naga Sai Kumar Bysani

    Data Scientist | 300K+ Data Community | 3+ years in Predictive Analytics, Experimentation & Business Impact | Featured on Times Square, Fox, NBC

    241,672 followers

    90% of ML projects never make it to production. Here's the 8-step framework that works. 𝐒𝐭𝐞𝐩 𝟏: 𝐃𝐞𝐟𝐢𝐧𝐞 𝐭𝐡𝐞 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐏𝐫𝐨𝐛𝐥𝐞𝐦 ↳ Start with WHY, not HOW ↳ Is ML even the right solution? ↳ Define success criteria upfront 𝐒𝐭𝐞𝐩 𝟐: 𝐃𝐚𝐭𝐚 𝐂𝐨𝐥𝐥𝐞𝐜𝐭𝐢𝐨𝐧 & 𝐄𝐱𝐩𝐥𝐨𝐫𝐚𝐭𝐢𝐨𝐧 ↳ Check data quality: missing values, duplicates, outliers ↳ EDA: distributions, correlations, patterns ↳ Document your data sources and limitations 𝐒𝐭𝐞𝐩 𝟑: 𝐅𝐞𝐚𝐭𝐮𝐫𝐞 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 ↳ Handle missing values (imputation, dropping) ↳ Encode categorical variables ↳ Create new features from domain knowledge ↳ This alone can improve performance by 20-30% 𝐒𝐭𝐞𝐩 𝟒: 𝐓𝐫𝐚𝐢𝐧-𝐓𝐞𝐬𝐭 𝐒𝐩𝐥𝐢𝐭 & 𝐕𝐚𝐥𝐢𝐝𝐚𝐭𝐢𝐨𝐧 ↳ Split: 70% train, 15% validation, 15% test ↳ Use stratified split for imbalanced data ↳ Never touch test data until final evaluation 𝐒𝐭𝐞𝐩 𝟓: 𝐌𝐨𝐝𝐞𝐥 𝐒𝐞𝐥𝐞𝐜𝐭𝐢𝐨𝐧 & 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 ↳ Start simple (logistic regression, decision tree) ↳ Try XGBoost, LightGBM, Random Forest ↳ Track experiments with MLflow or W&B 𝐒𝐭𝐞𝐩 𝟔: 𝐌𝐨𝐝𝐞𝐥 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 ↳ Use appropriate metrics (F1, ROC-AUC, RMSE) ↳ Analyze errors: confusion matrix, feature importance ↳ Does 85% accuracy actually solve the business problem? 𝐒𝐭𝐞𝐩 𝟕: 𝐃𝐞𝐩𝐥𝐨𝐲𝐦𝐞𝐧𝐭 ↳ Build API endpoint (FastAPI, Flask) ↳ Containerize with Docker ↳ Deploy to cloud (AWS, GCP, Azure) 𝐒𝐭𝐞𝐩 𝟖: 𝐌𝐨𝐧𝐢𝐭𝐨𝐫𝐢𝐧𝐠 & 𝐌𝐚𝐢𝐧𝐭𝐞𝐧𝐚𝐧𝐜𝐞 ↳ Track prediction accuracy over time ↳ Monitor for data drift and concept drift ↳ Retrain periodically with fresh data 𝐂𝐨𝐦𝐦𝐨𝐧 𝐏𝐢𝐭𝐟𝐚𝐥𝐥𝐬 𝐭𝐨 𝐀𝐯𝐨𝐢𝐝: ❌ Data leakage (using future info to predict past) ❌ Ignoring class imbalance ❌ Deploying without monitoring ❌ Optimizing metrics without business context 𝐏𝐫𝐨 𝐭𝐢𝐩: Your first end-to-end project will be messy, that's normal. Focus on completing the full cycle, then iterate. 𝐖𝐚𝐧𝐭 𝐭𝐨 𝐬𝐭𝐚𝐫𝐭 𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐌𝐋? Here are 5 resources I recommend: 1. Machine Learning by Andrew Ng - https://lnkd.in/diqSeD-k 2. Codebasics ML Playlist - https://lnkd.in/dBiYAeN7 3. Krish Naik ML Playlist - https://lnkd.in/dcpAS5gA 4. StatQuest with Joshua Starmer - https://lnkd.in/dhZ3aVhf 5. Sentdex ML Tutorials - https://lnkd.in/dCFPtDv8 Which step do you find most challenging? 👇 ♻️ Repost to help someone starting their ML journey

  • View profile for Rishabh Misra

    Principal ML Lead - Generative Personalization | ML Book and Course Author | Researcher - LLMs & RecSys - 1k+ citations | Advisory @ Startups | Featured in TechCrunch, NBC, TheSun | AI Consultant

    6,616 followers

    I watched a senior engineer spend three weeks quantizing an LLM to 4-bit. The P99 latency got worse. The issue wasn’t the technique; it was treating quantization as a storage problem instead of a memory-bandwidth problem. At Twitter, I spent a month debugging why our "optimized" models ran slower than the originals. The models were smaller. The math was correct. Yet latency regressed. The missing piece: the *unpacking tax*. Here’s the reality most benchmarks hide: Time ≈ Total bytes moved / Memory bandwidth On paper, moving from FP16 (16-bit) to INT4 (4-bit) means 4× less data moving across the memory bus per token. In a memory-bound regime, that translates to 3–4× higher throughput. But there’s a catch. GPUs don’t compute in 4-bit or 8-bit. Those weights are dequantized back to FP16/BF16 in the local cache before computation. That dequantization costs clock cycles and creates production surprises: → High batch sizes: Time saved on memory movement dominates = throughput improves → Batch size of 1: Unpacking overhead dominates = latency gets worse Quantization is not a free win. It’s a tradeoff. If you’re choosing a method, align it with your deployment reality: → GPTQ: Effective for static weights, but sensitive to outliers → AWQ: Preserves critical weights at higher precision for better quality → GGUF: Excellent for CPU/Metal inference, less relevant for H100/A100 clusters This is Part 4 of a deep dive into inference optimization. Previous posts: Memory Wall: https://lnkd.in/gdT26UTV KV Cache: https://lnkd.in/gKkrqVzf Paged Attention: https://lnkd.in/gX5JNZhn Next up: I will break down the closest thing to "cheating physics" in ML - Speculative Decoding. What’s the most expensive quantization mistake you’ve seen in production - latency, quality, or operability?

  • View profile for Greg Coquillo
    Greg Coquillo Greg Coquillo is an Influencer

    AI Infrastructure Product Leader | Scaling GPU Clusters for Frontier Models | Microsoft Azure AI & HPC | Former AWS, Amazon | Startup Investor | Linkedin Top Voice | I build the infrastructure that allows AI to scale

    228,959 followers

    Nothing changed in the product. But the AI bill doubled overnight. That’s when most teams learn the hard truth: 𝐭𝐨𝐤𝐞𝐧 𝐮𝐬𝐚𝐠𝐞 𝐝𝐨𝐞𝐬𝐧’𝐭 𝐞𝐱𝐩𝐥𝐨𝐝𝐞 𝐛𝐞𝐜𝐚𝐮𝐬𝐞 𝐨𝐟 𝐨𝐧𝐞 𝐛𝐢𝐠 𝐦𝐢𝐬𝐭𝐚𝐤𝐞, 𝐢𝐭 𝐜𝐫𝐞𝐞𝐩𝐬 𝐢𝐧 𝐭𝐡𝐫𝐨𝐮𝐠𝐡 𝐝𝐨𝐳𝐞𝐧𝐬 𝐨𝐟 𝐬𝐦𝐚𝐥𝐥 𝐨𝐧𝐞𝐬. Here’s a simple breakdown of the core strategies that keep AI systems fast, affordable, and predictable as they scale: 𝐂𝐨𝐬𝐭 𝐑𝐞𝐝𝐮𝐜𝐭𝐢𝐨𝐧 𝐅𝐨𝐜𝐮𝐬 ‣ Shorten System Prompts Cut the unnecessary instructions. Smaller system prompts mean lower cost on every single call. ‣ Use Structured Prompts Bullets, schemas, and clear formats reduce ambiguity and prevent the model from generating long, wasteful responses. ‣ Trim Conversation History Only include the parts relevant to the current task. Long-running agents often burn tokens without you noticing. ‣ Budget Your Context Window Divide context into strict sections so one part doesn’t overwhelm the whole window. 𝐋𝐚𝐭𝐞𝐧𝐜𝐲 & 𝐄𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐜𝐲 𝐅𝐨𝐜𝐮𝐬 ‣ Compress Retrieved Content Summaries → key chunks → only then full text. This keeps retrieval grounded without ballooning token usage. ‣ Metadata-First Retrieval Start with summaries or metadata; pull full documents only when required. ‣ Replace Text with IDs Instead of resending repeated text, reference IDs, states, or steps. ‣ Limit Tool Output Size Filter tool returns so agents only receive the data they actually need. 𝐂𝐨𝐧𝐭𝐞𝐱𝐭 & 𝐒𝐩𝐞𝐞𝐝 𝐅𝐨𝐜𝐮𝐬 ‣ Use Smaller Models Smartly Not every step needs your biggest model. Route simple tasks to lighter ones. ‣ Stop Over-Explaining If you don’t ask for long reasoning, the model won’t generate it. Huge hidden token savings. ‣ Cache Stable Responses If an instruction doesn’t change, don’t regenerate it. Cache it. ‣ Enforce Max Output Tokens Set strict caps so the model never produces more than required. Costs rarely spike because AI got more expensive, they spike because your system became less disciplined. Optimizing tokens isn’t optional anymore. It’s how you build AI products that scale without burning your budget.

  • View profile for Kuldeep Singh Sidhu

    Senior Data Scientist @ Walmart | BITS Pilani

    16,024 followers

    Groundbreaking Research Alert: 4-bit Quantization for RAG Systems A fascinating new paper from San José State University introduces an innovative approach to optimize Retrieval-augmented Generation (RAG) systems through 4-bit quantization of vector embeddings. >> Technical Deep Dive: The research tackles a critical challenge in RAG systems - the massive memory requirements for storing high-dimensional embedding vectors. Current top-ranked models on MTEB typically use embedding dimensions between 512-4096, consuming substantial memory resources. Consider this: A standard dbpedia dataset with 1M entries and 1536 dimensions requires 6.1GB of RAM just for embeddings. The proposed solution? A sophisticated 4-bit quantization approach that: - Reduces memory footprint by up to 87.5% - Maintains search accuracy within 4% of original performance - Implements group-wise quantization for enhanced precision - Outperforms HNSW algorithm in accuracy with group sizes ≤ 128 >> Under the Hood: The system employs symmetric linear quantization with group-wise processing, where vectors are split into equal-sized groups with individual quantization scales. This approach significantly outperforms traditional Product Quantization methods, maintaining correlation coefficients above 0.82 across multiple semantic textual similarity datasets. >> Impact: This breakthrough enables RAG deployment in resource-constrained environments while maintaining high accuracy. The research demonstrates that intelligent quantization can dramatically reduce infrastructure costs without compromising performance.

Explore categories