LLM Performance in Modeling Human Opinions

Explore top LinkedIn content from expert professionals.

Summary

llm performance in modeling human opinions refers to how well large language models (llms) can mimic, judge, or align with the ways humans form opinions, make decisions, and categorize concepts. research shows that while llms can organize information and make judgments, they differ from humans in reasoning, motivation, and bias, often relying on statistical patterns rather than rich context or personal values.

Check for bias: always review llm-generated opinions and judgments for signs of preference leakage or bias, especially when relying on them for synthetic data or evaluations.
Validate with humans: supplement llm assessments with human input to ensure alignment with real-world values and cultural contexts, reducing the risk of algorithmic monoculture.
Diversify data sampling: use varied and negatively-correlated sampling methods to broaden the range of opinions llms can learn from, improving their ability to model human preferences.

Summarized by AI based on LinkedIn member posts

Valerio Capraro

Associate Professor at the University of Milan Bicocca

9,998 followers 4mo
Report this post
Major preprint just out! We compare how humans and LLMs form judgments across seven epistemological stages. We highlight seven fault lines, points at which humans and LLMs fundamentally diverge: The Grounding fault: Humans anchor judgment in perceptual, embodied, and social experience, whereas LLMs begin from text alone, reconstructing meaning indirectly from symbols. The Parsing fault: Humans parse situations through integrated perceptual and conceptual processes; LLMs perform mechanical tokenization that yields a structurally convenient but semantically thin representation. The Experience fault: Humans rely on episodic memory, intuitive physics and psychology, and learned concepts; LLMs rely solely on statistical associations encoded in embeddings. The Motivation fault: Human judgment is guided by emotions, goals, values, and evolutionarily shaped motivations; LLMs have no intrinsic preferences, aims, or affective significance. The Causality fault: Humans reason using causal models, counterfactuals, and principled evaluation; LLMs integrate textual context without constructing causal explanations, depending instead on surface correlations. The Metacognitive fault: Humans monitor uncertainty, detect errors, and can suspend judgment; LLMs lack metacognition and must always produce an output, making hallucinations structurally unavoidable. The Value fault: Human judgments reflect identity, morality, and real-world stakes; LLM "judgments" are probabilistic next-token predictions without intrinsic valuation or accountability. Despite these fault lines, humans systematically over-believe LLM outputs, because fluent and confident language produce a credibility bias. We argue that this creates a structural condition, Epistemia: linguistic plausibility substitutes for epistemic evaluation, producing the feeling of knowing without actually knowing. To address Epistemia, we propose three complementary strategies: epistemic evaluation, epistemic governance, and epistemic literacy. Full paper in the first comment. Joint with Walter Quattrociocchi and Matjaz Perc.
No more previous content

No more next content
74 Comments
Like Comment
Ravid Shwartz Ziv

AI Researcher| Meta | NYU | Consultant | LLMs - Memory, World Models, Compression, & Tabular Data

18,606 followers 11mo
Report this post
You know all those arguments that LLMs think like humans? Turns out it's not true 😱 In our new paper we put this to the test by checking if LLMs form concepts the same way humans do. Do LLMs truly grasp concepts and meaning analogously to humans, or is their success primarily rooted in sophisticated statistical pattern matching over vast datasets? We used classic cognitive experiments as benchmarks. What we found is surprising... 🧐 We used seminal datasets from cognitive psychology that mapped how humans actually categorize things like "birds" or "furniture" ('robin' as a typical bird). The nice thing about these datasets is that they are not crowdsourced, they're rigorous scientific benchmarks. We tested 30+ LLMs (BERT, Llama, Gemma, Qwen, etc.) using an information-theoretic framework that measures the trade-off between: - Compression (how efficiently you organize info) - Meaning preservation (how much semantic detail you keep) Finding #1: The Good News LLMs DO form broad conceptual categories that align with humans significantly above chance. Surprisingly (or not?), smaller encoder models like BERT outperformed much larger models. Scale isn't everything! Finding #2: But LLMs struggle with fine-grained semantic distinctions. They can't capture "typicality" - like knowing a robin is a more typical bird than a penguin. Their internal concept structure doesn't match human intuitions about category membership. Finding #3: The Big Difference Here's the kicker: LLMs and humans optimize for completely different things. - LLMs: Aggressive statistical compression (minimize redundancy) - Humans: Adaptive richness (preserve flexibility and context) This explains why LLMs can be simultaneously impressive AND miss obvious human-like reasoning. They're not broken - they're just optimized for pattern matching rather than the rich, contextual understanding humans use. What this means: - Current scaling might not lead to human-like understanding - We need architectures that balance compression with semantic richness - The path to AGI ( 😅 ) might require rethinking optimization objectives Our paper gives tools to measure this compression-meaning trade-off. This could guide future AI development toward more human-aligned conceptual representations. Cool to see cognitive psychology and AI research coming together! Thanks to Chen Shani, Ph.D., who did all the work and Yann LeCun and Dan Jurafsky for their guidance
No more previous content

No more next content
146 Comments
Like Comment
Philipp Schmid

AI Developer Experience at Google DeepMind 🔵 prev: Tech Lead at Hugging Face, AWS ML Hero 🤗 Sharing my own views and AI News

165,273 followers 1y
Report this post
How biased are LLMs when you use them for synthethic data generation and as LLM as a Judge to evaluate? Answer: Significantly biased. 👀 The “Preference Leakage: A Contamination Problem in LLM-as-a-judge” paper shows that using the same LLM, Family or even previous version can have a preference towards their “own” data. Experiments: 1️⃣ Use LLM (e.g., GPT-4, Gemini) to generate synthetic responses to a set of prompts (e.g., UltraFeedback). 2️⃣ Fine-tune different versions of a "student" models (e.g., Mistral, Qwen) on the synthetic data. 3️⃣ Evaluation: Use multiple "judge" LLMs to perform pairwise comparisons of these student models on benchmark (e.g., Arena-Hard, AlpacaEval 2.0). 4️⃣ Bias: Calculate and Analyze the Preference Leakage Score (PLS) across different scenarios (same model, inheritance, same family) PLS measures how much more often a judge LLM prefers a student model trained on its own data compared to Judge. If both teachers give similar grades to both students = low PLS (fair judging), If teachers give better grades to their own students = high PLS (biased judging). Insights 💡LLMs show a bias towards student models trained on data generated by themselves. 📈 Model size matters: Larger models (14B vs 7B) show stronger preference leakage. 🧪 Supervised fine-tuning (SFT) leads to the highest PLS (23.6%), (DPO) reduces it (5.2%). ❓PLS is higher in subjective tasks, e.g. writing compared to objective ones. 🧑🧑🧒🧒 Relationship bias: Same model > inheritance > same family in terms of leakage severity. 🌊 Data mixing helps but doesn't solve: Even 10% synthetic data shows detectable leakage. ✅ Use multiple independent judges and mix with human evaluation. Paper: https://lnkd.in/eupf2Vyx Github: https://lnkd.in/eeDdrEXb
No more previous content

No more next content
24 Comments
Like Comment
Shahul Elavakkattil Shereef

Founder @ Vibrant Labs (YC W24)

10,875 followers 8mo
Report this post
Can we really trust LLMs as judges? AI teams often rely on LLM as judges. But what if the judges themselves misalign with humans? We built a realistic benchmark dataset and tested 9 models from OpenAI, Anthropic, and Google with common optimisation strategies (few-shot, retrieval, prompt tuning) for LLM as judges. The results surprised us: 1. Bigger reasoning models improved by up to +10 F1 points 2. Smaller distilled models sometimes got worse after optimisation 3. Anthropic models were the most stable Key takeaway: There’s no one-size-fits-all strategy. If you use LLM-as-judges, you need to align, validate, and pick your strategy. 📖 Read the blog + paper here: https://lnkd.in/gD6VkNpd 💻 Code + dataset: https://lnkd.in/g3TdjYG4
No more previous content

No more next content
8 Comments
Like Comment
Maximilian Nickel

Research Director at Meta, FAIR | AI ∩ Society ∩ Complex Systems

3,771 followers 9mo
Report this post
🦄 Today we're releasing Community Alignment - the largest open-source dataset to align LLMs with people's preferences in a variety of cultural contexts, containing ~200k comparisons from >3000 annotators in 5 countries and languages! There was a lot of research that went into this... 🧵 🔍 We started by conducting a joint human study and model evaluation with 15,000 nationally-representative participants from 5 countries & 21 LLMs. We found that the LLMs exhibited an *algorithmic monoculture* were all models aligned with the same minority of human preferences. 🚫 Standard alignment methods fail to learn common human preferences (as identified from our joint human-model study) from existing preference datasets because the candidate responses that people choose from are too homogeneous, even when they are sampled from multiple models. 🥭 Intuitively, if all the candidate responses only cover one set of values, then you'll never be able to learn preferences outside of those values. It is like if someone asks me to pick between four types of apples, but if what I really want is a mango, you won't be measuring that 🌈 To produce more diverse candidate sets, rather than independently sampling them, you want some kind of "negatively-correlated (NC) sampling", where sampling one candidate makes other similar ones less likely. Turns out, prompting can implement this decently well, with win rates jumping from random chance to ~0.8 🤡 💽 Finally, based on these insights we collect and open-source (CC-BY 4.0) the Community Alignment (CA) dataset. Features include: - NC-sampled candidate responses - Multilingual (64% non-English) - >2500 prompts are annotated by >= 10 people - Natural language explanations for > 1/4 of choices and more! This was a big project and collective effort spanning FAIR, AI at Meta, Meta Governance, Meta Policy as well es NYU and Ecole Polytechnique -- major thanks to all the collaborators (see paper) and especially the amazing Smitha Milli and Kris R., who led this project masterfully from start to finish. Also, thanks to Joelle Pineau, Rob Fergus, Stephane Kasriel, and Rob Sherman for their support🙏 And this is not the end! 😉 If you want to support us in doing more of these releases, email communityalignment@meta.com (or me) with feedback on what you liked about CA and what you want to see more of Paper: https://lnkd.in/ejJqGQfS Dataset: https://lnkd.in/e5Vp6z2E

6 Comments
Like Comment
Sohrab Rahimi

Director, AI/ML Lead @ Google

23,609 followers 4mo
Report this post
Most enterprise conversations about what LLMs and agents can or cannot do are still driven by anecdotes: “it worked in my pilot” vs “it hallucinated in my demo.” This new paper forces a cleaner frame: the core gap is not fluency or even accuracy. It’s epistemology, meaning how a system forms, justifies, and regulates “knowledge.” The authors claim LLM can mimic the surface of knowing (answers, explanations, rationales) while lacking the machinery of knowing. That distinction matters because it explains why hallucination is not a corner case. It’s a structural consequence of “always produce the next step” even when the right epistemic move is abstention, escalation, or evidence-seeking. They also point to a workflow-level shift that’s easy to miss: we used to “retrieve then judge.” Generative systems collapse retrieval, selection, and explanation into a single authoritative artifact. That collapse changes user behavior. Evaluation becomes optional by default, and plausibility becomes a substitute for verification. Their term for this failure mode is “Epistemia”i.e. the feeling of knowing without the labor of judgment. The paper then decomposes the human vs machine gap into seven fault lines: 1. Grounding: LLMs can speak as if they have access to the world, but they do not carry commitments to a system-of-record unless you force it through tooling and policy. 2. Parsing: they may miss hidden constraints, because token-level continuity is not the same as constructing a situation model. 3. Experience: embeddings are not lived priors; the model can be statistically “familiar” without having a robust smell test for what’s missing. 4. Motivation: humans choose objectives and tradeoffs; the model cannot decide what should be optimized unless you specify it. 5. Causality: coherent narratives are cheap; causal validity is not. 6. Metacognition: humans can stop; the model is incentivized to continue. 7. Value and accountability: humans own consequences; the model cannot carry responsibility for harm. It is not that LLMs are bad at reasoning. It’s that they are good at producing reasons and bad at the epistemic control loop: grounding, calibration, and accountability. And that leads to the real risk: even correct outputs can degrade decision quality if they bypass the evaluation process. I think there are three main implications for the future of work: 1. Output production gets cheap and evaluation becomes scarce. 2. The org chart shifts from creators to governors. 3. Agents fit best where tasks are bounded and evidence is accessible. The right mental model is a Venn diagram, not replacement. Machines expand throughput and consistency. Humans provide grounding, tradeoffs, counterfactual checks, and ownership. In the overlap you get compounding value: faster cycles without outsourcing judgment. One plus one equals three. Where in your workflows is plausibility becoming the decision criterion, and what gates have you designed to prevent it?
No more previous content

No more next content
4 Comments
Like Comment
Jayeeta Putatunda

Director - AI CoE @ Fitch Ratings | NVIDIA NEPA Advisor | HearstLab VC Scout | Global Keynote Speaker & Mentor | AI100 Awardee | Women in AI NY State Ambassador | ASFAI

10,084 followers 10mo
Report this post
𝗜 𝗵𝗮𝘃𝗲 𝗯𝗲𝗲𝗻 𝗶𝗻 𝘁𝗵𝗲 𝗡𝗟𝗣 𝘀𝗽𝗮𝗰𝗲 𝗳𝗼𝗿 𝗮𝗹𝗺𝗼𝘀𝘁 𝟭𝟬 𝘆𝗲𝗮𝗿𝘀 𝗻𝗼𝘄, and I know the first-hand challenges of building text-based models in the pre-GPT era! So, I am a 𝗽𝗿𝗼-𝗟𝗮𝗿𝗴𝗲 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹 (𝗟𝗟𝗠) 𝗲𝗻𝘁𝗵𝘂𝘀𝗶𝗮𝘀t, but I don’t believe they will replace humans or solve all our problems, especially when it comes to highly complex reasoning in industries like Finance. 𝗧𝗵𝗶𝘀 𝘄𝗲𝗲𝗸𝗲𝗻𝗱, I spent reading two compelling papers, and I’m convinced we’re bumping into real reasoning ceilings: 𝗜> "𝗧𝗵𝗲 𝗜𝗹𝗹𝘂𝘀𝗶𝗼𝗻 𝗼𝗳 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴: 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 𝘁𝗵𝗲 𝗦𝘁𝗿𝗲𝗻𝗴𝘁𝗵𝘀 𝗮𝗻𝗱 𝗟𝗶𝗺𝗶𝘁𝗮𝘁𝗶𝗼𝗻𝘀 𝗼𝗳 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗠𝗼𝗱𝗲𝗹𝘀 𝘃𝗶𝗮 𝘁𝗵𝗲 𝗟𝗲𝗻𝘀 𝗼𝗳 𝗣𝗿𝗼𝗯𝗹𝗲𝗺 𝗖𝗼𝗺𝗽𝗹𝗲𝘅𝗶𝘁𝘆" (Apple) Apple researchers rigorously tested 𝗟𝗮𝗿𝗴𝗲 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗠𝗼𝗱𝗲𝗹𝘀 (𝗟𝗥𝗠𝘀), LLMs that explicitly generate chain-of-thought reasoning, using controlled puzzles like Tower of Hanoi and River Crossing Key insights: 1. 𝗧𝗵𝗿𝗲𝗲 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗿𝗲𝗴𝗶𝗺𝗲𝘀: ▪️Low complexity: standard LLMs outperform LRMs ▪️Medium complexity: LRMs excel ▪️High complexity: 𝗯𝗼𝘁𝗵 𝗰𝗼𝗹𝗹𝗮𝗽𝘀𝗲, accuracy plummets 2. Fascinating observation, 𝗟𝗥𝗠𝘀 “𝗴𝗶𝘃𝗲 𝘂𝗽” as puzzle complexity increases, their reasoning effort declines rapidly, even with enough tokens 3. Even when provided an exact algorithm (e.g., Tower of Hanoi strategy), the 𝗺𝗼𝗱𝗲𝗹𝘀 𝘀𝘁𝗶𝗹𝗹 𝗳𝗮𝗶𝗹𝗲𝗱 𝘁𝗼 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘇𝗲 and mostly outputs based on past observed data pattern it is trained on 𝗜𝗜> "𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗼𝗿 𝗢𝘃𝗲𝗿𝘁𝗵𝗶𝗻𝗸𝗶𝗻𝗴: 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗻𝗴 𝗟𝗮𝗿𝗴𝗲 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 𝗼𝗻 𝗙𝗶𝗻𝗮𝗻𝗰𝗶𝗮𝗹 𝗦𝗲𝗻𝘁𝗶𝗺𝗲𝗻𝘁 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀" (Dimitris Vamvourellis & Dhagash Mehta, Ph.D., BlackRock) This study tested major 𝗟𝗟𝗠𝘀 (𝗚𝗣𝗧‐𝟰𝗼, 𝗚𝗣𝗧‐𝟰.𝟭, 𝗼𝟯‐𝗺𝗶𝗻𝗶, 𝗙𝗶𝗻𝗕𝗘𝗥𝗧 𝘃𝗮𝗿𝗶𝗮𝗻𝘁𝘀) on financial sentiment classification using: - "𝗦𝘆𝘀𝘁𝗲𝗺 𝟭" (𝗳𝗮𝘀𝘁/𝗶𝗻𝘁𝘂𝗶𝘁𝗶𝘃𝗲) - "𝗦𝘆𝘀𝘁𝗲𝗺𝟮" (𝘀𝗹𝗼𝘄/𝗱𝗲𝗹𝗶𝗯𝗲𝗿𝗮𝘁𝗲) 𝗽𝗿𝗼𝗺𝗽𝘁𝗶𝗻𝗴 Key takeaways: ▪️𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗽𝗿𝗼𝗺𝗽𝘁𝘀 𝗱𝗶𝗱 𝗻𝗼𝘁 𝗶𝗺𝗽𝗿𝗼𝘃𝗲 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 ▪️Surprisingly, straightforward, intuitive prompts with GPT-4o (no chain-of-thought) outperformed all others ▪️More reasoning led to overthinking, reducing alignment with human-labeled sentiments 💡 Why it matters for builders and researchers in Finance and every industry: ❎ 𝗕𝗶𝗴𝗴𝗲𝗿 𝗺𝗼𝗱𝗲𝗹𝘀 + 𝗺𝗼𝗿𝗲 “𝘁𝗵𝗶𝗻𝗸𝗶𝗻𝗴” = 𝗯𝗲𝘁𝘁𝗲𝗿 𝗼𝘂𝘁𝗰𝗼𝗺𝗲𝘀. Sometimes it’s actively worse ❎ We’re not seeing a soft plateau — these are 𝗵𝗮𝗿𝗱 𝗰𝗲𝗶𝗹𝗶𝗻𝗴𝘀 𝗶𝗻 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗰𝗮𝗽𝗮𝗰𝗶𝘁𝘆 ❎ For real-world systems, agents, and financial tools: design for 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗲𝗰𝗼𝗻𝗼𝗺𝘆, not just reasoning depth. #LLMs #ReasoningLimits #LLMChainofthought #LLMReasoningDecline
No more previous content

No more next content
4 Comments
Like Comment
Bhargav Patel, MD, MBA

AI x Healthcare | Bridging Medicine & AI for Clinicians, Founders, Engineers & Health Systems | Physician-Innovator | Medical AI Research | Psychiatrist | Upcoming Books: Trauma Transformed & Future of AI in Healthcare

10,507 followers 2mo
Report this post
LLMs scored 95% on identifying medical conditions when tested alone. When real people used them for medical advice, accuracy dropped to 35%. A new randomized study in Nature Medicine tested whether large language models actually help the public make better medical decisions. 1,298 participants were given medical scenarios and asked to identify conditions and recommend next steps. GPT-4o, Llama 3, and Command R+ all performed well when directly prompted. They identified relevant conditions in 94.9% of cases and recommended correct disposition in 56.3% on average. But when participants used these same models for assistance, condition identification dropped below 34.5% and disposition accuracy fell to 44.2% (no better than the control group using search engines). The gap wasn't medical knowledge. It was interaction. Researchers analyzed conversation transcripts and found users provided incomplete information to models. Models sometimes misinterpreted context or gave inconsistent advice. Even when models suggested correct conditions, users didn't consistently follow recommendations. Standard medical benchmarks didn't predict this. Models achieved passing scores (>60%) on MedQA questions matched to scenarios but still failed in interactive testing. Performance on structured exams was largely uncorrelated to performance with real users. Simulated patient interactions didn't predict it either. When researchers replaced humans with LLM-simulated users, simulated users performed better (57.3% vs 44.2%) and showed less variation. Simulations were only weakly predictive of human behavior. Here’s what this means: Benchmark performance is necessary but insufficient. A model scoring 80% on medical licensing exams can produce 20% accuracy when paired with real users. The constraint isn't algorithmic capability. It's human-AI interaction design. Users don't know what information to provide. Models don't ask the right clarifying questions. Correct suggestions get lost in conversation. For clinicians: expect patients to arrive with AI-informed conclusions that may not be accurate. Patients using LLMs were no better at assessing clinical acuity than those using traditional methods. For developers: user testing with real humans must precede deployment. Simulations and benchmarks don't capture interaction failures. AI excels at medical exams. But medicine isn't a multiple-choice test. It's a conversation under uncertainty. — Source: Nature Medicine - "Reliability of LLMs as medical assistants for the general public"

29 Comments
Like Comment
Jared Feldman

Entrepreneur • Operator • Investor • Strategic Advisor

7,753 followers 3mo
Report this post
LLMs can now predict human purchase intent with over 90 percent of the maximum achievable accuracy. That number should make research teams pause. A new paper shows that the gap between synthetic and human survey data is not about model intelligence. It is about how we ask the question. Most LLM-based survey simulations force models to pick a Likert number. The result is predictable and wrong. Ratings collapse toward the middle and lose the distributional shape you see in real consumer data. This research uses a different approach called Semantic Similarity Rating. Instead of asking for a number, the model explains why it would or would not buy. That text is embedded and compared to anchor statements that represent each Likert point, producing a score that preserves variance and intent. The outcome is synthetic panels that closely match real human purchase intent rankings, without fine-tuning on consumer data and with qualitative rationale generated by default. The implication is bigger than faster surveys. It suggests the bottleneck in AI-driven insights has been elicitation, not capability. When you ask models to reason in language and only translate to numbers afterward, you get something far closer to human behavior. This is the direction synthetic research needs to move if we want to trust it. For the curious, full paper is here: https://lnkd.in/gwvXUphw cc Kristi Zuhlke Doug Guion Dana Kim Patricia King, CE. Phil Ahad Seth Hardy

LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings arxiv.org

15 Comments
Like Comment
Suki W.

Data Scientist at Roblox & Faculty at USC | ex-Meta, National Science Foundation fellow | Educate & Inspire

1,957 followers 7mo
Report this post
💡 Why LLM Evals Should Look More Like Human Rater Evals. When we evaluate large language models (LLMs) as judges, the conversation often stops at “do they agree with humans?”. But in psychometrics and education, we know that’s far from enough. In human assessment, scores are not just numbers — they are latent variables that combine many factors: task difficulty, rater severity, and even biases in context. We model these explicitly through frameworks like IRT and many-facet Rasch, because we know: - Consistency (Reliability): Raters must be calibrated to stay stable across time and items. - Difficulty: Some prompts are easy, others are hard. Some raters are lenient, others strict. Ignoring this masks true ability. - Bias & Fairness: Every evaluation system needs safeguards against systematic bias. LLMs as evaluators behave no differently. Their “judgments” are multi-faceted: influenced by prompt difficulty, role bias, temperature, and context. That means a single agreement score hides what really matters. 👉 Instead, we should treat LLM evaluations as latent networks of interacting factors: - Use network loadings and structural stability to pinpoint which prompts or categories drive divergence. - Combine psychometric structure (reliability, difficulty, fairness) with granular error audits that reveal when and why judgments break down. Because whether it’s a teacher grading essays or an LLM scoring outputs, the same principle holds: ⚖️ It’s not about who assigns the score, but how consistent, fair, and construct-valid that score truly is.
No more previous content

No more next content
1 Comment
Like Comment

LLM Performance in Modeling Human Opinions

Summary

More in Machine Learning Model Tuning

Explore categories