We are happy to share the results of our exhaustive benchmarking study on forecasting models, where we assessed 87 models across 24 varied datasets. This project aimed to evaluate the performance of univariate forecasting models ranging from naive baselines to sophisticated neural networks, using a comprehensive set of metrics such as RMSE, RMSSE, MAE, MASE, sMAPE, WAPE, and R-squared. The 24 datasets contained a wide range of frequencies, including hourly (4 datasets), daily (5), weekly (2), and monthly (4), quarterly (2), yearly (3). Additionally, there are 4 synthetic datasets without a specific frequency. Some of the datasets also contain covariates (exogenous features) of static, past, and/or future nature. For each model, we aimed to identify hyperparameters that were effective on a global level, across all datasets. Dataset specific hyperpameter tuning for each model was not performed due to budget constraints on this project. We use a simple train/test split along the temporal dimension, ensuring models are trained on historical data and assessed on unseen future data. The attached chart shows a heatmap of the average RMSSE scores for each model, grouped by dataset frequency. The results are filtered to 43 models for brevity, excluding noticeably inferior models and redundant implementations. RMSSE is a scaled version of RMSE, where a model's RMSE score is divided by the RMSE of a naive model. With RMSSE, the lower the score, the better the model's performance. A score of 1.0 indicates performance on par with the naive baseline. Key Findings: - Machine-Learning Dominance: Extra trees and random forest models demonstrate the best overall performance. - Neural Network Success: Variational Encoder, PatchTST, and MLP emerged as top neural network models, with Variational Encoder showing the best results, notably including pretraining on synthetic data. - Efficacy of Simplicity: DLinear and Ridge regression models show strong performance, highlighting efficiency in specific contexts. - Statistical Models' Relevance: TBATS stands out among statistical models for its forecasting accuracy. - Yearly Datasets Insight: On yearly datasets, none of the advanced models surpassed the performance of the naive mean model, highlighting the difficulty of forecasting with datasets that lack conspicuous seasonal patterns. - Pretraining Advantage: The improvement in models like Variational Encoder and NBeats through pretraining on synthetic data suggests a promising avenue for enhancing neural networks' forecasting abilities. All models and datasets are open-source. For a detailed examination of models, datasets, and scores, visit https://lnkd.in/d6mMSudJ. Registration is free, requiring only your email. Our platform is open to anyone interested in benchmarking their models. Any feedback or questions are welcome. Let's raise the state of the art in forecasting!
Benchmarking Studies
Explore top LinkedIn content from expert professionals.
Summary
Benchmarking studies are systematic comparisons that measure and evaluate the performance of methods, models, or technologies using standardized datasets and metrics. These studies help organizations and researchers identify strengths, weaknesses, and improvement areas by providing clear, data-driven insights.
- Compare consistently: Use standardized benchmarks and data splits to ensure fair and meaningful comparisons between models or technologies.
- Recognize limitations: Be mindful of variability and uncertainty in benchmark standards, as these can affect how results are interpreted and validation targets are set.
- Drive progress: Utilize open-source datasets and transparent methodologies to focus community efforts and support reproducible research across fields like forecasting, finance, clinical trials, genomics, and toxicology.
-
-
> Sharing Resource < Interesting benchmark for finance: "Quantum vs. Classical Machine Learning: A Benchmark Study for Financial Prediction" by Rehan Ahmad, Muhammad Kashif, Nouhaila I., Muhammad Shafique Abstract: In this paper, we present a reproducible benchmarking framework that systematically compares QML models with architecture-matched classical counterparts across three financial tasks: (i) directional return prediction on U.S. and Turkish equities, (ii) live-trading simulation with Quantum LSTMs versus classical LSTMs on the S\&P 500, and (iii) realized volatility forecasting using Quantum Support Vector Regression. By standardizing data splits, features, and evaluation metrics, our study provides a fair assessment of when current-generation QML models can match or exceed classical methods. Our results reveal that quantum approaches show performance gains when data structure and circuit design are well aligned. In directional classification, hybrid quantum neural networks surpass the parameter-matched ANN by \textbf{+3.8 AUC} and \textbf{+3.4 accuracy points} on \texttt{AAPL} stock and by \textbf{+4.9 AUC} and \textbf{+3.6 accuracy points} on Turkish stock \texttt{KCHOL}. In live trading, the QLSTM achieves higher risk-adjusted returns in \textbf{two of four} S\&P~500 regimes. For volatility forecasting, an angle-encoded QSVR attains the \textbf{lowest QLIKE} on \texttt{KCHOL} and remains within ∼ 0.02-0.04 QLIKE of the best classical kernels on \texttt{S\&P~500} and \texttt{AAPL}. Our benchmarking framework clearly identifies the scenarios where current QML architectures offer tangible improvements and where established classical methods continue to dominate. Link: https://lnkd.in/e4WUdr-n #quantummachinelearning #machinelearning #research #paper #benchmark #finance
-
Machine learning research requires benchmarks. Clinical trial outcome prediction has lacked one. Different papers use different data, different labels, different evaluation protocols. Results aren't comparable. Progress is hard to measure. We're releasing CTO — a comprehensive benchmark covering ~125,000 drug and biologics trials. Why benchmarks matter: ImageNet transformed computer vision. GLUE transformed NLP. Standardized benchmarks enable apples-to-apples comparison and focus community effort on hard problems. What CTO provides: Multi-source outcome labeling: LLM interpretations: We processed trial publications and extracted outcomes using structured prompting Phase progression tracking: Trials that advance to the next phase succeeded (mostly) News sentiment analysis: FDA decisions, partnership announcements, termination reports Stock price movements: Market reactions encode collective intelligence about trial prospects The methodology insight: No single signal is perfectly reliable. Publication bias affects papers. Phase progression has exceptions. News coverage is selective. We combine signals using evidential reasoning — each source contributes evidence with associated uncertainty. Multiple confirming sources increase confidence. Why this enables progress: Researchers can now train on consistent data, evaluate on held-out sets, and compare methods fairly. Code: https://lnkd.in/gUAvp-Ws Paper: https://lnkd.in/gDZ7EuMN #OpenScience #ClinicalTrials #MachineLearning #Benchmarks #DrugDevelopment #ReproducibleResearch
-
🚀 How good is good enough for long read sequencing in clinical microbiology? 🧬 🧫 ➡️ Whole-genome sequencing is now routine in many diagnostic labs, but are all sequencing platforms equally fit for purpose? And where does Oxford Nanopore Technologies (ONT) really stand in 2026? In our new paper, we systematically benchmark Illumina vs. ONT for bacterial WGS across ESKAPE pathogens and ATCC strains, focusing on what actually matters in clinical microbiology: base accuracy, genome assembly quality, AMR gene detection, and outbreak analysis (cgMLST). 🔍 What we found (short teaser): - Illumina still delivers the highest base accuracy. - Modern ONT chemistries (R10.4.1) with SUP basecalling (Dorado/Rerio) now produce highly contiguous genomes and resolve complex regions (e.g. rRNA operons) far better. - Hybrid assemblies remain the most robust solution when accuracy and completeness are critical. - For cgMLST and outbreak detection, ONT performance is species-dependent, good enough for some pathogens, still risky for others. 💡 Why this matters: If you are using, or planning to use, ONT in routine diagnostics, surveillance, or outbreak investigations, this study provides practical, evidence-based guidance on where ONT already works well, where caution is needed, and why validation must be species-specific. 📖 The full benchmarking study is now available: https://lnkd.in/ecsidnu9 Special thanks to: Srinithi Purushothaman (who did this work as part of her PhD thesis!). And also a special thanks to Schweizerischer Nationalfonds SNF for funding! 👉 If you’re running a clinical genomics lab, working on AMR surveillance, or wondering whether ONT-only workflows are ready for prime time, this paper is for you. 👉 Curious to hear: Where do you see ONT making the biggest impact in your lab right now - speed, completeness, or decentralisation? #Innovation #Research #Diagnostics #Sequencing #Microbiology #teamUZH #QC #benchmark #Illumina #ONT #AMR #typing
-
We ask NAMs to prove they are "equivalent or better" than animal studies. But equivalent or better than what, exactly? Karmaus, Nicole Kleinstreuer, and colleagues just published the first comprehensive review of replicability across in vivo toxicological guideline studies. The paper compiles decades of retrospective analyses. Ocular irritation. Dermal sensitization. Acute lethality. Repeated dose. Carcinogenicity. Neurotoxicity. Genotoxicity. The numbers deserve attention. Draize rabbit eye irritation: GHS Category 2B classification replicated only 16% of the time. Carcinogenicity between species (rat and mouse): 36% replicability. Acute oral lethality categories: as low as 49%. Subchronic and chronic organ-level concordance: 38.5% to 90%, depending on organ and species. DNT motor activity in negative controls: coefficients of variation from 20% to 140%. These are not outliers. These are the guideline studies we use to make regulatory safety decisions. The same studies we use as the gold standard benchmark when we evaluate whether a NAM is good enough. The authors make a critical point. Study variability and the uncertainty it introduces are not typically recognized in current regulatory frameworks. Uncertainty factors may indirectly absorb some of this variance. But the variance itself has never been systematically characterized across study types until now. This matters for one specific reason. If the benchmark is variable, the validation target is variable. You cannot hold a NAM to a standard of precision that the reference method does not meet. That is not scientific rigor. That is asymmetric burden of proof. The paper also identifies where variability comes from. Species. Strain. Vehicle. Dose spacing. Endpoint selection. Protocol flexibility allowed within OECD test guidelines, biological variability that no protocol can eliminate. None of this means animal studies are useless. It means we have been operating without a quantified baseline for the methods we trust most. And that baseline, now that we have it, should recalibrate how we design validation frameworks for NAMs. This is exactly the kind of work the field needed. Not advocacy. Not ideology. Data. Full reference: Karmaus et al. (2026) Perspectives on variability of in vivo toxicology studies: considerations for next-generation toxicology. Frontiers in Toxicology, 8:1778353. #NAMs #RegulatoryScience #Toxicology #AnimalResearch #Replicability #DigitalBiomarkers #PreclinicalResearch #DrugDevelopment
-
#AI benchmarking conversations I have often turn into a false choice. #Synthetic vs. #real_world. The truth? Both matter. At Dell Technologies, our TME labs not only analyze benchmarks but also work directly with customer’s production environments to see real-world challenges. Synthetic benchmarks show what the infrastructure can do under ideal conditions. They expose architectural limits, scaling characteristics, peak throughput, and theoretical efficiency. That’s important. It tells you the ceiling. But production environments don’t operate at the ceiling. Real-world benchmarking shows what the system will do under concurrency, mixed workloads, thermal pressure, network contention, and operational overhead. That’s where latency, sustained throughput, and cost per token actually determine business value. One shows maximum capability. The other reveals operational truth. If you’re evaluating AI infrastructure and only looking at peak tokens per second, you’re missing half the story. The best enterprise AI strategies validate both: • Lab performance to understand headroom • Production performance to understand reality Peak numbers sell slides. Sustained performance drives outcomes. #AI #AIBenchmarking #EnterpriseAI #Infrastructure #DataCenter #IWork4Dell
-
FDIC data shows insured deposits grew just 0.5% in 2024. But some banks still managed 5× that growth. Same economy. Same rate environment. Vastly different outcomes. So what separates the winners from the underperformers? After analyzing hundreds of institutions through our Infusion Norm benchmarking, three patterns stand out: 1. Winners track dollars, not impressions. They can tell you - down to the account - what last quarter’s marketing delivered in deposits. They measure cost per funded dollar, not just cost per lead. Every dollar spent ties back to balance-sheet impact. Underperformers? They celebrate impressions and clicks or just ignore marketing entirely. Campaigns “look good” on slides, but deposits quietly flow to competitors who offer measurable value. 2. Winners concentrate spend on proven channels. They double down on what works. They use first-party data and actual behavior insights to reach the right households without repricing the entire book. Underperformers? They spread budget like peanut butter - billboards, chasing shiny objects, broad digital - without attribution or optimization. 3. Winners obsess over retention. In our Infusion Norms dataset, top-quartile programs show double-digit year-1 and high single-digit year-2 retention lifts. They know acquisition is just the start - profitable growth comes from nurturing relationships. Underperformers? They count front-door wins. New accounts look great until 40%+ vanish within 12 months. Here’s the quick test for your bank: Can you calculate cost per deposit dollar in under 5 minutes? Do you know your marketing-attributed growth last quarter? Can you benchmark campaign-specific performance against peers? If not, you’re already behind. And the pressure is mounting: ~46% of banks expect flat or declining marketing budgets, and small to mid-sized banks typically allocate only around 2.7%–2.9% of non-interest expense to marketing. With less spend and more scrutiny, ROI is non-negotiable. Benchmarking shows you where you stand. But knowing you’re behind doesn’t close the gap. At Infusion Marketing, we’ve helped move banks from bottom-quartile to top-quartile performance by applying these differentiators. One regional bank uncovered a $264M gap in money market share versus peers. With a precise plan, they closed it in 18 months - without repricing their book. That’s the power of benchmarking paired with disciplined execution. Our pay-for-performance model ensures accountability: no balance-sheet growth, no fee. Curious where your bank really stands? Reach out to me - we’ll review your position and map a path to top-quartile growth.
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development