Top LinkedIn Content on Scientific Methodological Standards

Data Scientist who designs experiments and fixes broken metrics | Causal Inference | 50+ publications, 1 federal policy change | R, SQL

7,074 followers 1mo

There's a statement about A/B testing that gets repeated so confidently that nobody pushes back on it. It's not wrong, exactly. It's just not as true as people think. "Random assignment ensures the groups are equivalent." This gets said in every A/B testing primer, every experimentation course, every stakeholder meeting where someone asks "but how do we know the groups are comparable?", and some of the comments on my LinkedIn posts. And it's... almost true. Random assignment ensures the groups are equivalent in expectation, meaning if you repeated the randomization thousands of times, the average difference between groups on any variable would be zero. In stats language, we would say, "There's no systematic bias." Practically, we can be confident that nobody is cherry-picking who gets treatment. But you don't run thousands of randomizations. You only run one. And one randomization is one draw from that distribution. Big sample? The draw is almost certainly fine. Small sample? You can get groups that look nothing alike, and the randomization didn't fail. That's just how probability works at small N. I ran a simulation. Take 40 people with a known covariate — say, prior engagement score — and randomly split them 20/20. Do it 500 times. Some splits are nearly perfect. Others are off by more than half a standard deviation. Every single one of those is "correctly randomized." Some of them will absolutely give you misleading results if you don't deal with it. Do the same thing at 1,000 per group and the distribution of imbalances basically disappears. That's the Law of Large Numbers doing its thing. But n=20 isn't large, so you can't count on LLN to save you. So what do you actually do when your experiment isn't huge? 𝗖𝗵𝗲𝗰𝗸 𝗯𝗮𝗹𝗮𝗻𝗰𝗲 𝗯𝗲𝗳𝗼𝗿𝗲 𝘆𝗼𝘂 𝗹𝗼𝗼𝗸 𝗮𝘁 𝗼𝘂𝘁𝗰𝗼𝗺𝗲𝘀. Compare the groups on everything you measured pre-treatment. Age, tenure, prior usage, whatever. If something is meaningfully off, you need to know before you interpret results. 𝗦𝘁𝗿𝗮𝘁𝗶𝗳𝘆 𝗼𝗿 𝗯𝗹𝗼𝗰𝗸 𝘁𝗵𝗲 𝗿𝗮𝗻𝗱𝗼𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻. If you know certain variables matter, force balance on them upfront. Randomize within strata. 𝗔𝗱𝗷𝘂𝘀𝘁 𝗳𝗼𝗿 𝗯𝗮𝘀𝗲𝗹𝗶𝗻𝗲 𝗰𝗼𝘃𝗮𝗿𝗶𝗮𝘁𝗲𝘀 𝗶𝗻 𝘁𝗵𝗲 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀. Be smart about using information you already have, and it almost always gives you tighter estimates. (But make sure you're not adding in colliders or post-treatment effects!) And if you ran a small experiment without doing any of this — just be honest about it. The results might be fine. But you're trusting luck more than you think. Randomization solves the systematic bias problem. It doesn't solve the bad luck problem. Those are different things, and small experiments are exactly where the difference shows up.

42 Comments

Hao Hoang

Daily AI Interview Questions | Senior AI Researcher & Engineer | ML, LLMs, NLP, DL, CV, ML Systems | 56k+ AI Community

55,210 followers 4mo

You're in a final round interview for a Machine Learning Engineer role at Walmart. The interviewer sets a trap: "We have 5 petabytes of transaction history spanning 5 years. Train a model to predict next month's purchases." 90% of candidates walk right into the trap. They say : "Awesome. More data equals better generalization. I'll ingest the whole 5-year history, feature engineer 𝘙𝘦𝘤𝘦𝘯𝘤𝘺, 𝘍𝘳𝘦𝘲𝘶𝘦𝘯𝘤𝘺, and 𝘔𝘰𝘯𝘦𝘵𝘢𝘳𝘺 𝘷𝘢𝘭𝘶𝘦 (𝘙𝘍𝘔), and train a massive 𝘟𝘎𝘉𝘰𝘰𝘴𝘵 𝘮𝘰𝘥𝘦𝘭." The interviewer stops writing. They just failed. Why? Because they assumed the historical logs represent reality. The historical logs don't. A 5-year transaction log isn't a complete history. It's a list of survivors. They fell victim to 𝐓𝐡𝐞 𝐒𝐢𝐥𝐞𝐧𝐭 𝐆𝐫𝐚𝐯𝐞𝐲𝐚𝐫𝐝 𝐄𝐟𝐟𝐞𝐜𝐭. By training only on transaction logs, your dataset systematically excludes every user who got annoyed and churned over the last five years. They stopped transacting, so they vanished from your logs. Their model is now over-indexing on loyalist behavior and is completely blind to the pre-churn signals of at-risk users. When deployed, it will fail exactly where the business needs it most: retaining wavering customers. The Senior Engineer knows that "𝘣𝘪𝘨 𝘥𝘢𝘵𝘢" often means "𝘣𝘪𝘨 𝘣𝘪𝘢𝘴." The fix involves "𝐓𝐢𝐦𝐞-𝐓𝐫𝐚𝐯𝐞𝐥 𝐅𝐞𝐚𝐭𝐮𝐫𝐞 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠": 1️⃣ You don't take the end-state of 5 years. 2️⃣ You take a snapshot at T-minus-2 years. 3️⃣ You identify everyone active then. 4️⃣ You label them based on whether they made a purchase in the following month, regardless of if they exist today. 5️⃣You must force the "failures" back into the training distribution. 𝐓𝐡𝐞 𝐀𝐧𝐬𝐰𝐞𝐫 𝐓𝐡𝐚𝐭 𝐆𝐞𝐭𝐬 𝐘𝐨𝐮 𝐇𝐢𝐫𝐞𝐝: "Historical logs suffer from severe survivorship bias. To predict future purchasing behavior, we cannot just look at retained users. We must explicitly reconstruct historical states to include the 'ghosts', the users who subsequently churned, otherwise, the model will never learn to spot an exit risk." #MachineLearning #MLEngineering #DataScience #BigData #FeatureEngineering #XGBoost #SurvivorshipBias

27 Comments

Prukalpa ⚡

Founder & Co-CEO at Atlan | Forbes30, Fortune40, TED Speaker

53,922 followers 2mo

15 people sent me the same article in the last 24 hours, OpenAI's announcement of how they built their own internal in-house data agent. Why does everyone think I need to see this? Beyond just being interesting, it validates something I've been saying for years: The model isn't the hard part. Context is. When we started talking about the idea of context being king for AI at Atlan, people would sometimes respond with blank stares: "Why are you building a context platform? Just plug in GPT." Finally, I can send them this article from OpenAI as a response. As they put it, "CONTEXT IS EVERYTHING. High-quality answers depend on rich, accurate context. Without context, even strong models can produce wrong results, such as vastly misestimating user counts or misinterpreting internal terminology. To avoid these failure modes, the agent is built around multiple layers of context that ground it in OpenAI’s data and institutional knowledge." To make their data agent successful, OpenAI needed to unify lots of different types of context from different sources, both within and beyond their data platform. They call it "multilayered contextual grounding." Here's what that means: → Table usage: Going beyond table names to understand how data flows and gets used (e.g. table schemas, relationships, lineage, usage patterns, and historical queries) → Human annotations: Pulling from domain-expert knowledge for each table that goes beyond metadata (e.g. semantics, business meaning, and known caveats) → Codex enrichment: Examining the code behind each data table to understand insights like scope and granularity, which can highlight important differences between tables that look similar on the surface → Institutional knowledge: Pulling context from Slack, Google Docs, and Notion to understand company specifics (e.g. launches, reliability incidents, internal codenames, key metrics) → Memory: Saving and learning from prior user corrections and agent discoveries over time via saved, editable memories → Runtime context: Live queries to the data warehouse or other data platform systems when context is missing or stale Can't wait for the next time someone tells me that context is easy. I'll just send them this article! Great work by Bonnie Xu, Aravind Suresh and Emma Tang.

4 Comments

Pan Wu

Senior Data Science Manager at Meta

51,378 followers 5mo

In causal inference, one of the most important decisions is figuring out what to control for. Choosing the right covariates can make or break an analysis. Include too few, and the results may be biased. Include too many—or the wrong ones—and you risk blocking the causal pathway itself. In a recent tech blog, the data science team at Booking.com explored this challenge and explained how to identify the right variables when estimating treatment effects. They emphasized that not all covariates play the same role, and understanding these roles is key to drawing valid conclusions. Confounders are variables that influence both the treatment and the outcome, and they should be included to reduce bias. Mediators lie along the causal path from treatment to outcome, so controlling for them can remove part of the very effect we want to measure and should therefore be handled with care. There are also treatment-only predictors, which are related to the treatment but not the outcome; outcome-only predictors, which can improve precision without introducing bias; and colliders, which are caused by both the treatment and the outcome. By distinguishing among these different types of covariates and investigating how each affects bias and variance through simulation, the team demonstrated that causal inference isn’t just about adding more variables to a model. Thoughtful covariate selection is a crucial step for generating reliable insights and enabling smarter, evidence-based business decisions. #DataScience #MachineLearning #CausalInference #Analytics #ABTesting #Experimentation #SnacksWeeklyonDataScience – – – Check out the "Snacks Weekly on Data Science" podcast and subscribe, where I explain in more detail the concepts discussed in this and future posts: -- Spotify: https://lnkd.in/gKgaMvbh -- Apple Podcast: https://lnkd.in/gFYvfB8V -- Youtube: https://lnkd.in/gcwPeBmR https://lnkd.in/gyE_Kr5a

Covariate Selection in Causal Inference: Good and Bad Controls booking.ai

2 Comments

Dr. Alexander Krannich

Statistician | Clinical Research Expert

17,076 followers 1y

How Many Stratification Factors Are ‘Too Many’ to Use in a Randomisation Plan? Stratification is often used in randomized controlled trials. This means that one or more factors are taken into account in the randomization so that their proportion in the groups is approximately equal. For example, the proportion of gender as a strata can lead to the same proportion of men and women being randomized in both groups. You can see this in the figure when randomized patients are included in the lists one after the other. Occasionally, the study centers, age, and previous illness are also randomized. However, as the number of strata increases, the number of lists required also increases depending on the characteristics of the strata. For gender with 3 characteristics (male, female, diverse), age with 2 characteristics (<65, ≥65), and 10 study centers, this results in 3*2*10 = 60 lists. If the number of patients is relatively small, it is possible that some lists are filled only slightly or not at all, which can lead to unbalanced groups and strata. This problem is also addressed in EMA/CHMP/295050/2013: ‘With an increasing number of strata adjustment for baseline covariates in clinical trials the chance of empty / infrequently occupied strata increases, thus the targeted treatment allocation within strata might not be achieved. ’ The question is therefore: How Many Stratification Factors Are ‘Too Many’ to Use in a Randomisation Plan? Terry M. Therneau answers this question in a publication with the corresponding title. He states that the balance also depends on the method of randomization and that no problems arise with suitable minimization methods if the number of factor combinations remains less than n/2. Do you have experience with stratified randomization? Any questions? #statistics #ClinicalTrials

5 Comments

Kevin Hartman

Associate Teaching Professor at the University of Notre Dame, Former Chief Analytics Strategist at Google, Author "Digital Marketing Analytics: In Theory And In Practice"

24,647 followers 1y

#ThrowBackThursdayDataviz: A 3D Data Visualization Masterpiece from 1880s Italy Long before modern software brought data to life in 3D, Italian statistician Luigi Perozzo created a remarkable stereogram — a three-dimensional population pyramid using Swedish census data from 1750 to 1875. This visualization remains a masterpiece of how innovation and design can transform raw data into a compelling story. What are we looking at? This stereogram shows the number of surviving male births across 125 years. Through intersecting gridlines and isometric projections, Perozzo visualized three key variables: • Vertical axis: Number of individuals • Horizontal axes: Age (from birth to old age) and time (1750–1875) This multidimensional view reveals population dynamics over time—showing trends in birth rates, survival rates, and the impact of historical events like wars, famines, and medical advances. Why is it groundbreaking? 1. Temporal and Demographic Insights: The visualization tracks survival rates across different cohorts over 125 years. You can follow how each year's births fared over decades—an innovative early example of time-series analysis in visual form. 2. Innovation in Design: Perozzo blended art with science, using layered grids and color-coded lines to clarify complex patterns. Key features like "Linea delle Nascite" ("Line of Births") highlight important trends. 3. Pioneering 3D Visualization: In an era before computers, Perozzo showed how combining dimensions (age, time, and population) could reveal insights beyond traditional 2D charts. Historical Significance Sweden led the way in maintaining systematic population records, and Perozzo's work transformed this data into something revolutionary. His stereogram highlights the rising importance of demography in the late 19th century and data visualization's emerging role in understanding society. A Legacy of Innovation While modern tools like R, Python, Tableau, and Excel make creating visualizations straightforward, Perozzo's stereogram reminds us that data visualization's foundations lie in creativity and purpose. It exemplifies the enduring mission to make data both accessible and meaningful. Art+Science Analytics Institute | University of Notre Dame | University of Notre Dame - Mendoza College of Business | University of Illinois Urbana-Champaign | University of Chicago | D'Amore-McKim School of Business at Northeastern University | ELVTR | Grow with Google - Data Analytics #Analytics #DataStorytelling #TBTD

2 Comments

Aishwarya Srinivasan

628,134 followers 8mo

Evaluating LLMs is not like testing traditional software. Traditional systems are deterministic → pass/fail. LLMs are probabilistic → same input, different outputs, shifting behaviors over time. That makes model selection and monitoring one of the hardest engineering problems today. This is where Eval Protocol (EP) developed by Fireworks AI is so powerful. It’s an open-source framework for building an internal model leaderboard, where you can define, run, and track evals that actually reflect your business needs. → Simulated Users – generate synthetic but realistic user interactions to stress-test models under lifelike conditions. → evaluation_test – pytest-compatible evals (pointwise, groupwise, all) so you can treat model behavior like unit tests in CI/CD. → MCP Extensions – evaluate agents that use tools, multi-step reasoning, or multi-turn dialogue via Model Context Protocol. → UI Review – a dashboard to visualize eval results, compare across models, and catch regressions before they ship. Instead of relying on generic benchmarks, EP lets you encode your own success criteria and continuously measure models against them. If you’re serious about scaling LLMs in production, this is worth a look: evalprotocol.io

14 Comments

Woojin Kim

LinkedIn Top Voice · Chief Strategy Officer & CMIO at HOPPR · CMO at ACR DSI · MSK Radiologist · Serial Entrepreneur · Keynote Speaker · Advisor/Consultant · Transforming Radiology Through Innovation

11,020 followers 1y

🚨 Several months ago, I shared my perspective on the paper “Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine” to emphasize the need to avoid testing LLMs and LMMs on web-sourced data and called for more rigorous research methods to prevent overestimating these models' capabilities. 🧐 Recently, a paper in Radiology also used the NEJM Image Challenge (the authors referenced the aforementioned paper)—you could already guess some of the things I am going to say in this post. Here’s my take on the paper, with the hope of improving the quality of radiology research using foundation models in the future. ❌ The authors stated, “NEJM website was searched.” This poses a significant risk of data contamination. Ironically, they ensured “none of the human readers had experience with the NEJM Image Challenge cases.” This same rigor was not applied to the test set. To minimize data contamination risks, don’t simply copy and use test questions (❗️that also have answers) from the web verbatim. If you can Google it, the LMM you’re using probably has seen it. ❌ The authors noted, “LLMs achieved similar accuracies regardless of the image input,” which should have triggered concerns about data contamination and limited vision capabilities. Studies have shown LLM performance declines when evaluation problems are paraphrased or recontextualized. Modifying questions and answers could have helped assess contamination risks. Also, while it may take considerable effort (good research often does), one should try to develop their own test cases. As I discussed during the RSNA 2024 Radiology AI Fireside Chat, while MCQs can serve as a testing method, we need to explore beyond this type of model evaluation methodology, especially in medicine, to better reflect how we practice medicine. ❓ What is disappointing is that the authors knew and acknowledged these limitations that their previous paper on Radiology Diagnosis Please cases also described. Why continue repeating these issues? 🔍 The use of foundation models in radiology presents exciting opportunities, but to drive the field forward responsibly, we must ensure that evaluations are rigorous and free from contamination. 🔗 to all mentioned resources are in the first comment. 👇🏼 #GenAI #radiology #RadiologyResearch #LLMs #LMMs

10 Comments

Sahar Mor

I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

41,888 followers 1y

Researchers at UC San Diego and Tsinghua just solved a major challenge in making LLMs reliable for scientific tasks: knowing when to use tools versus solving problems directly. Their method, called Adapting While Learning (AWL), achieves this through a novel two-component training approach: (1) World knowledge distillation - the model learns to solve problems directly by studying tool-generated solutions (2) Tool usage adaptation - the model learns to intelligently switch to tools only for complex problems it can't solve reliably The results are impressive: * 28% improvement in answer accuracy across scientific domains * 14% increase in tool usage precision * Strong performance even with 80% noisy training data * Outperforms GPT-4 and Claude on custom scientific datasets Current approaches either make LLMs over-reliant on tools or prone to hallucinations when solving complex problems. This method mimics how human experts work - first assessing if they can solve a problem directly before deciding to use specialized tools. Paper https://lnkd.in/g37EK3-m — Join thousands of world-class researchers and engineers from Google, Stanford, OpenAI, and Meta staying ahead on AI http://aitidbits.ai

3 Comments

Milan Janosov

The New Science of Maps · Geospatial AI Consultant & Educator · Forbes 30U30 · TEDx Speaker · Bestselling Author

94,512 followers 8mo

As I did my PhD in network science, this field within data science is particularly dear to me. Complementing my two courses << links in comments >> I am recapping this ever-growing list of network analytics tools. Comment/add your favorite ones! What else would you add? 𝐏𝐨𝐢𝐧𝐭-𝐚𝐧𝐝-𝐜𝐥𝐢𝐜𝐤 𝐬𝐨𝐟𝐭𝐰𝐚𝐫𝐞: - Cytoscape - https://cytoscape.org - Gephi - https://gephi.org - Graphia - https://graphia.app - GraphInsight - https://lnkd.in/d5XnkWJr - NodeXL - https://nodexl.com - Orange - https://lnkd.in/dZU8Zx3D - SemSpect - https://www.semspect.de - SocNetV - https://socnetv.org - Tulip - https://lnkd.in/dtc_BD33 - Ucinet - https://lnkd.in/dE8k34v7 - VOSviewer - https://www.vosviewer.com 𝐎𝐧𝐥𝐢𝐧𝐞 𝐭𝐨𝐨𝐥𝐬: - Gephisto - https://lnkd.in/diSp3BWN - Gephi Lite - https://lnkd.in/dHJ3F-r6 - Kumu - https://kumu.io - Graphistry - https://www.graphistry.com - Cosmograph - https://lnkd.in/dUBJS4w3 𝐏𝐲𝐭𝐡𝐨𝐧 𝐥𝐢𝐛𝐫𝐚𝐫𝐢𝐞𝐬: - networkx - https://lnkd.in/dKCCXjif - graph-tool - https://lnkd.in/dvytUzdu - graphviz - https://lnkd.in/d3GqtmQn - ipycytoscape - https://lnkd.in/dvTwmySk - ipydagred3 - https://lnkd.in/diXgFWMD - ipysigma - https://lnkd.in/dP55J5et - ipyvolume - https://lnkd.in/dq52_wdr - netwulf - https://lnkd.in/dsgKDHPh - nxviz - https://lnkd.in/duHbKGPN - Py3Plex - https://lnkd.in/dhwe7f_g - py4cytoscape - https://lnkd.in/d7NwU8_Y - pydot - https://lnkd.in/d8w6VfyP - pyGraphistry - https://lnkd.in/dz-NfFf7 - pygsp - https://lnkd.in/dS7s-A_v - python-igraph - https://lnkd.in/dCGsRXh2 - PyTorch Geometric - https://lnkd.in/duT3y8-U - pyvis - https://lnkd.in/duJ5kWAd - scikit-network - https://lnkd.in/dKPXenCk - SNAP - https://lnkd.in/duM5uHnr - visjs.org - https://visjs.org - visNetwork - https://zurl.co/zY3O - 3D Force-Directed Graph - https://zurl.co/AYks More info reading: https://lnkd.in/dWYgERtK #datascience #networkscience #datavisualization #data #analytics#networkvisualization #ai

7 Comments

Scientific Methodological Standards

More in Scientific Methodological Standards

More Science topics

Explore categories