Pedro Alves

Pedro Alves · 2026-03-31T14:09:55.034Z

90% of AI project failures trace back to the same root cause. It's not the algorithm. It's not the compute. It's the data layer. - Bad labels. - Training data that doesn't reflect how the model will actually be used. - A problem definition that was never really pressure-tested. The model gets blamed every time. The data was the issue. This shows up the same way in genomics, healthcare, finance, and retail. The industry changes. The failure mode doesn't. Most teams go straight to architecture and skip the unglamorous work: - Auditing annotations. - Interrogating assumptions. - Asking whether the training set actually represents deployment reality. The model is step six. Fix the question first. That's not a hot take. That's just the pattern, repeated across 24 years and more broken projects than I'd like to count.

San Mateo, California, United States
24K followers 500+ connections

View mutual connections with Pedro

Pedro can introduce you to 10+ people at Thoth AI

Email or phone

Password

Forgot password?

or

New to LinkedIn? Join now

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Join to follow

Thoth AI

Yale University

About

Chief Technology Officer @ Thoth AI | Building the world’s most advanced learning…

Articles by Pedro

AI This Week: Agents Finally Show Up for Work — Musings from Silicon Valley

Apr 7, 2026

AI This Week: Agents Finally Show Up for Work — Musings from Silicon Valley

I've been building and shipping AI long enough to know this: the gap between a slick demo and something that survives a…
The Million-Token Blind Spot: Why Your AI Still Can't Find the One Fact That Actually Matters

Apr 7, 2026

The Million-Token Blind Spot: Why Your AI Still Can't Find the One Fact That Actually Matters

Even with million-token context windows, one of the most stubborn limitations in large language models remains what…

2 Comments
Jensen Just Showed His Hand. The Rest of the Industry Is Following It.

Mar 25, 2026

Jensen Just Showed His Hand. The Rest of the Industry Is Following It.

I was in the room at SAP Center last week. Thirty thousand people.
AI This Week: The Pace Is Relentless — Musings from Silicon Valley

Mar 12, 2026

AI This Week: The Pace Is Relentless — Musings from Silicon Valley

I've been in this game long enough to know that "fast" in tech is relative — but the last seven days felt like someone…
Relative Position Representations: The “How Far Apart?” Mindset That Finally Helped Transformers Handle Long Documents Without Losing the Thread

Mar 10, 2026

Relative Position Representations: The “How Far Apart?” Mindset That Finally Helped Transformers Handle Long Documents Without Losing the Thread

One challenge that kept coming up in my work with long documents was how models would handle extended sequences…

1 Comment
The Digital Yes-Man in Your Chat Window: Why AI Sycophancy Is Still Ruining Good Decisions in February 2026

Feb 26, 2026

The Digital Yes-Man in Your Chat Window: Why AI Sycophancy Is Still Ruining Good Decisions in February 2026

Recently, a founder was demoing his AI strategy coach to a room of VCs. He typed: “Our churn is high because the…

2 Comments
LLM Fundamentals Roadmap: The Free GitHub Resource That Turns "Math Phobia" Into "AI Superpower" Overnight

Feb 3, 2026

LLM Fundamentals Roadmap: The Free GitHub Resource That Turns "Math Phobia" Into "AI Superpower" Overnight

If you've ever looked at an LLM tutorial and wondered, "What is a softmax again?" or "Why do I need calculus for…

4 Comments
The Over-Optimistic AI: Why Models Still Downplay Major Risks in 2026

Jan 8, 2026

The Over-Optimistic AI: Why Models Still Downplay Major Risks in 2026

Imagine advising an insurtech startup in early 2025 with an AI tool designed to assess risks like floods or cyber…
Rotary Positional Embeddings: The Spin Trick That Let AI Models Read Novels Instead of Just Tweets

Jan 6, 2026

Rotary Positional Embeddings: The Spin Trick That Let AI Models Read Novels Instead of Just Tweets

Imagine trying to fine-tune a model on long legal documents—think 32k tokens of dense contract jargon. If the model was…
Learned Positional Embeddings: The Trainable Hacks That Gave Early AI Models Their Sense of Direction

Dec 18, 2025

Learned Positional Embeddings: The Trainable Hacks That Gave Early AI Models Their Sense of Direction

When exploring the GPT-2 code, it was striking to see how a simple matrix of learned vectors handled the order of…

See all articles

Activity

Time to build with Thoth AI. https://lnkd.in/eWEnzY9B

Time to build with Thoth AI. https://lnkd.in/eWEnzY9B

Liked by Pedro Alves
The most dangerous thing in Silicon Valley isn't bad ideas. It's good ideas that nobody questions anymore. I grew up in Recife, Brazil. No tech…

The most dangerous thing in Silicon Valley isn't bad ideas. It's good ideas that nobody questions anymore. I grew up in Recife, Brazil. No tech…

Shared by Pedro Alves
We made AI too easy to learn. That might be a problem. In 2001, building an ML model required writing the math from scratch. No frameworks, no…

We made AI too easy to learn. That might be a problem. In 2001, building an ML model required writing the math from scratch. No frameworks, no…

Shared by Pedro Alves

Join now to see all activity

Experience

Thoth AI

San Francisco Bay Area
-

San Mateo, California, United States
-

San Francisco Bay Area
-

San Mateo, California, United States
-

San Francisco Bay Area
-

San Francisco Bay Area
-

San Francisco Bay Area
-

Redwood City, CA
-

San Francisco Bay Area
-

Lenexa,KS

Education

Yale University

-

2007 - 2012

Working in various areas within the field of computational biology such as: gene networks, gene expression data, regulatory networks, single nucleotide polymorphisms. Focusing on the use of innovative approaches to machine learning, combining graph analysis information with predictive modeling, ensemble learning, and network analysis.
-

2005 - 2007
-

2001 - 2004

Publications

Multiple-Swarm Ensembles: Improving the Predictive Power and Robustness of Predictive Models and Its Use in Computational Biology

IEEE/ACM Transactions on Computational Biology and Bioinformatics April 5, 2017

Machine learning is an integral part of computational biology, and has already shown its use in various applications, such as prognostic tests. In the last few years in the non-biological machine learning community, ensembling techniques have shown their power in data mining competitions such as the Netflix challenge; however, such methods have not found wide use in computational biology. In this work we endeavor to show how ensembling techniques can be applied to practical problems, including…

Machine learning is an integral part of computational biology, and has already shown its use in various applications, such as prognostic tests. In the last few years in the non-biological machine learning community, ensembling techniques have shown their power in data mining competitions such as the Netflix challenge; however, such methods have not found wide use in computational biology. In this work we endeavor to show how ensembling techniques can be applied to practical problems, including problems in the field of bioinformatics, and how they often outperform other machine learning techniques in both predictive power and robustness. Furthermore, we develop a methodology of ensembling, Multi-Swarm Ensemble (MSWE) by using multiple particle swarm optimizations and demonstrate its ability to further enhance the performance of ensembles.

See publication
Being comfortable with no assumptions

CIO Review Jul 2016

The article discusses a commonly employed technique in problem solving, which is making assumptions. Assumptions are a powerful tool that can aid in speeding up the time needed to search for a solution. One problem with assumptions is that people forget to revisit them when a solution is not found, there is also the problem of assumptions that are made subconsciously.

See publication
Architecture of the human regulatory network derived from ENCODE data

Nature 2012
Transcription factors bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the principles of the human transcriptional regulatory network, we determined the genomic binding information of 119 transcription-related factors in over 450 distinct experiments. We found the combinatorial, co-association of transcription factors to be highly context specific:…

Transcription factors bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the principles of the human transcriptional regulatory network, we determined the genomic binding information of 119 transcription-related factors in over 450 distinct experiments. We found the combinatorial, co-association of transcription factors to be highly context specific: distinct combinations of factors bind at specific genomic locations. In particular, there are significant differences in the binding proximal and distal to genes. We organized all the transcription factor binding into a hierarchy and integrated it with other genomic information (for example, microRNA regulation), forming a dense meta-network. Factors at different levels have different properties; for instance, top-level transcription factors more strongly influence expression and middle-level ones co-regulate targets to mitigate information-flow bottlenecks. Moreover, these co-regulations give rise to many enriched network motifs (for example, noise-buffering feed-forward loops). Finally, more connected network components are under stronger selection and exhibit a greater degree of allele-specific activity (that is, differential binding to the two parental alleles). The regulatory information obtained in this study will be crucial for interpreting personal genome sequences and understanding basic principles of human biology and disease.

Other authors
AlleleSeq: analysis of allele-specific expression and binding in a network framework.

Molecular Systems Biology August 2, 2011
To study allele-specific expression (ASE) and binding (ASB), that is, differences between the maternally and paternally derived alleles, we have developed a computational pipeline (AlleleSeq). Our pipeline initially constructs a diploid personal genome sequence (and corresponding personalized gene annotation) using genomic sequence variants (SNPs, indels, and structural variants), and then identifies allele-specific events with significant differences in the number of mapped reads between…

To study allele-specific expression (ASE) and binding (ASB), that is, differences between the maternally and paternally derived alleles, we have developed a computational pipeline (AlleleSeq). Our pipeline initially constructs a diploid personal genome sequence (and corresponding personalized gene annotation) using genomic sequence variants (SNPs, indels, and structural variants), and then identifies allele-specific events with significant differences in the number of mapped reads between maternal and paternal alleles. There are many technical challenges in the construction and alignment of reads to a personal diploid genome sequence that we address, for example, bias of reads mapping to the reference allele. We have applied AlleleSeq to variation data for NA12878 from the 1000 Genomes Project as well as matched, deeply sequenced RNA-Seq and ChIP-Seq data sets generated for this purpose. In addition to observing fairly widespread allele-specific behavior within individual functional genomic data sets (including results consistent with X-chromosome inactivation), we can study the interaction between ASE and ASB. Furthermore, we investigate the coordination between ASE and ASB from multiple transcription factors events using a regulatory network framework. Correlation analyses and network motifs show mostly coordinated ASB and ASE.

Other authors
Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project.

Science 2010
We systematically generated large-scale data sets to improve genome annotation for the nematode Caenorhabditis elegans, a key model organism. These data sets include transcriptome profiling across a developmental time course, genome-wide identification of transcription factor–binding sites, and maps of chromatin organization. From this, we created more complete and accurate gene models, including alternative splice forms and candidate noncoding RNAs. We constructed hierarchical networks of…

We systematically generated large-scale data sets to improve genome annotation for the nematode Caenorhabditis elegans, a key model organism. These data sets include transcriptome profiling across a developmental time course, genome-wide identification of transcription factor–binding sites, and maps of chromatin organization. From this, we created more complete and accurate gene models, including alternative splice forms and candidate noncoding RNAs. We constructed hierarchical networks of transcription factor–binding and microRNA interactions and discovered chromosomal locations bound by an unusually large number of transcription factors. Different patterns of chromatin composition and histone modification were revealed between chromosome arms and centers, with similarly prominent differences between autosomes and the X chromosome. Integrating data types, we built statistical models relating chromatin, transcription factor binding, and gene expression. Overall, our analyses ascribed putative functions to most of the conserved genome.

Other authors
Fast and accurate identification of semi-tryptic peptides in shotgun proteomics.

Bioinformatics 2008
Motivation: One of the major problems in shotgun proteomics is the low peptide coverage when analyzing complex protein samples. Identifying more peptides, e.g. non-tryptic peptides, may increase the peptide coverage and improve protein identification and/or quantification that are based on the peptide identification results. Searching for all potential non-tryptic peptides is, however, time consuming for shotgun proteomics data from complex samples, and poses a challenge for a routine data…

Motivation: One of the major problems in shotgun proteomics is the low peptide coverage when analyzing complex protein samples. Identifying more peptides, e.g. non-tryptic peptides, may increase the peptide coverage and improve protein identification and/or quantification that are based on the peptide identification results. Searching for all potential non-tryptic peptides is, however, time consuming for shotgun proteomics data from complex samples, and poses a challenge for a routine data analysis.

Results: We hypothesize that non-tryptic peptides are mainly created from the truncation of regular tryptic peptides before separation. We introduce the notion of truncatability of a tryptic peptide, i.e. the probability of the peptide to be identified in its truncated form, and build a predictor to estimate a peptide's truncatability from its sequence. We show that our predictions achieve useful accuracy, with the area under the ROC curve from 76% to 87%, and can be used to filter the sequence database for identifying truncated peptides. After filtering, only a limited number of tryptic peptides with the highest truncatability are retained for non-tryptic peptide searching. By applying this method to identification of semi-tryptic peptides, we show that a significant number of such peptides can be identified within a searching time comparable to that of tryptic peptide identification.

Other authors
Advancement in protein inference from shotgun proteomics using peptide detectability.

Pacific Symposium Biocomputing 2007
A major challenge in shotgun proteomics has been the assignment of identified peptides to the proteins from which they originate, referred to as the protein inference problem. Redundant and homologous protein sequences present a challenge in being correctly identified, as a set of peptides may in many cases represent multiple proteins. One simple solution to this problem is the assignment of the smallest number of proteins that explains the identified peptides. However, it is not certain that a…

A major challenge in shotgun proteomics has been the assignment of identified peptides to the proteins from which they originate, referred to as the protein inference problem. Redundant and homologous protein sequences present a challenge in being correctly identified, as a set of peptides may in many cases represent multiple proteins. One simple solution to this problem is the assignment of the smallest number of proteins that explains the identified peptides. However, it is not certain that a natural system should be accurately represented using this minimalist approach. In this paper, we propose a reformulation of the protein inference problem by utilizing the recently introduced concept of peptide detectability. We also propose a heuristic algorithm to solve this problem and evaluate its performance on synthetic and real proteomics data. In comparison to a greedy implementation of the minimum protein set algorithm, our solution that incorporates peptide detectability performs favorably.

Other authors

Projects

Fraud Detection - Healthcare Insurance

Oct 2013

This was an interesting problem in identifying fraudulent entities in any step of the health insurance process from patient to pharmacist to pharmacy. This project was heavy on feature engineering. Anomaly detection and one-class classifiers were used in this unsupervised learning problem. The final solution involved an innovative approach of creating a network between all entities and overlaying the features onto the networks which allowed for more complex network analysis that enabled the…

This was an interesting problem in identifying fraudulent entities in any step of the health insurance process from patient to pharmacist to pharmacy. This project was heavy on feature engineering. Anomaly detection and one-class classifiers were used in this unsupervised learning problem. The final solution involved an innovative approach of creating a network between all entities and overlaying the features onto the networks which allowed for more complex network analysis that enabled the approach to identify fraudulent entities with individual signals to low to be detected otherwise.
Partial AUC Optimization through ensembles of decision trees

2013

When developing a new predictive model there are a multitude of evaluation metrics that can be used. Usually, most scientists have one favorite metric that they use consistently throughout different projects; however, the choice of evaluation metric can be as important as the model itself.

In the field of clinical informatics, where someone's healthcare might be affected by a model's prediction, the scores that are closer to the extreme values are the ones that might actually change a…

When developing a new predictive model there are a multitude of evaluation metrics that can be used. Usually, most scientists have one favorite metric that they use consistently throughout different projects; however, the choice of evaluation metric can be as important as the model itself.

In the field of clinical informatics, where someone's healthcare might be affected by a model's prediction, the scores that are closer to the extreme values are the ones that might actually change a doctor's decision. With this in mind, the overall AUC is not as important as the area under the ROC with low false positive rate.

In this project I developed and coded two solutions to optimize the partial AUC directly. The first solution was to use PSO (Particle Swarm Optimization) to train a Neural Network that directly improves the user specified partial AUC. The second solution was to train various random decision trees and have them combined into one ensemble model with respective weights; these weights are optimized through the PSO to improve the partial AUC.

The final result were two algorithms that scale well to big data (and are parallelizable) and improve user specified partial AUC.
Readmission Prediction

Mar 2012 - Jul 2013

Goal: Predicting probability of patients to have readmissions to hospitals after being discharged with specific conditions.

Methods: Data mining, ensemble learning, feature engineering, feature selection, sub-population specific feature selection, metrics evaluations, partial AUC optimization.

Results: The final model was able to outperform models trained in all sub-populations when compared on those sub-populations as well as outperform current market competitors by yielding…

Goal: Predicting probability of patients to have readmissions to hospitals after being discharged with specific conditions.

Methods: Data mining, ensemble learning, feature engineering, feature selection, sub-population specific feature selection, metrics evaluations, partial AUC optimization.

Results: The final model was able to outperform models trained in all sub-populations when compared on those sub-populations as well as outperform current market competitors by yielding actionable scores to populations sizes 300%-400% of competitors.

Using data mining and feature selection to build models to predict the probability of patients to readmit to hospitals after being discharged with specific conditions: Chronic Obstructive Pulmonary Disease, Pneumonia and AMI.

Various feature selection methods were used in order to reduce the number of features from a few thousand to about 100. An initial pass with a greedy backwards search was used followed by genetic algorithms used in sub-populations in order to capture features that were predictive for groups of the whole population but might have had weak predictive power for the entire population.

Various machine learning approaches were tested including:
- Neural networks
- Logistic regression
- SVMs
- Tree-based algorithms, bagging, forests, boosting, Alternating decision trees
- Ensembles: stacking, grading, voting, greedy-search

The final model used was a combination of bagging and random forests. This proved to yield the highest levels of accuracy and least overfitting.

The final result were models that had an AUC that were about 10% higher than the current competition. More importantly than AUC is the size of the actionable populations. In clinical work a doctor will not change their plans unless the prediction for the patient differs enough from the average patient, this is called "actionable population". The actionable population of my models covered populations that were 300%-400% of that of the competitors.

Honors & Awards

Top Kaggler (within top 0.4%)

Kaggle

Nov 2013

Ranked in 99.6% percentile out of ~225,000 data scientists on Kaggle
Wired Magazine - Best scientific figures 2012

Wired Magazine

Dec 2012

Wired magazine's choice award for top 10 scientific figures in 2012.
http://www.wired.com/2012/12/science-figures-2012/
Alzheimer’s Drug Discovery Foundation Young Investigator Award

Alzheimer’s Drug Discovery Foundation

2008
Genome Scholars Award

SACNAS, funded by the NHGRI (NIH)

2007

Awarded financial fellow from National Institutes of Health.
NLM (National Library of Medicine) fellowship award

NLM (National Library of Medicine)

2007
Mensa Membership

Mensa High IQ Society

2001

Languages

English

Native or bilingual proficiency
Portuguese

Native or bilingual proficiency

Recommendations received

14 people have recommended Pedro

Join now to view

More activity by Pedro

90% of AI project failures trace back to the same root cause. It's not the algorithm. It's not the compute. It's the data layer. - Bad…

90% of AI project failures trace back to the same root cause. It's not the algorithm. It's not the compute. It's the data layer. - Bad…

Shared by Pedro Alves

View Pedro’s full profile

See who you know in common
Get introduced
Contact Pedro directly

Join to view full profile

Other similar profiles

Yael Garten

Yael Garten

San Francisco Bay Area

Connect
Hossein Azari, PhD MBA

Hossein Azari, PhD MBA

New York, NY

Connect
Scott Clark

Scott Clark

Palo Alto, CA

Connect
Massimiliano Versace

Massimiliano Versace

Boston, MA

Connect
Laura Boccanfuso

Laura Boccanfuso

Columbia, SC

Connect
Matthew Zeiler

Matthew Zeiler

McLean, VA

Connect
Jeong-Yoon Lee

Jeong-Yoon Lee

Los Angeles Metropolitan Area

Connect
Chen-Ping Yu

Chen-Ping Yu

San Francisco Bay Area

Connect
Sabrina Ramonov 🍄

Sabrina Ramonov 🍄

Salt Lake City Metropolitan Area

Connect
Vamshi Ambati, PhD

Vamshi Ambati, PhD

San Francisco Bay Area

Connect
Cosimo Spera

Cosimo Spera

San Francisco Bay Area

Connect
Mohamed Aly

Mohamed Aly

Plano, TX

Connect
Miku Jha

Miku Jha

San Francisco Bay Area

Connect
Dmitry Ulyanov

Dmitry Ulyanov

London

Connect
Steven Sinofsky

Steven Sinofsky

Sioux Falls, SD

Connect
Jonathan Su

Jonathan Su

San Jose, CA

Connect
David Gutelius

David Gutelius

Greater Reno Area

Connect
Eric Simone 🏢✈🚄🚚

Eric Simone 🏢✈🚄🚚

Greater Chicago Area

Connect
James Horey

James Horey

Knoxville, TN

Connect
Tisson Mathew

Tisson Mathew

Lake Oswego, OR

Connect

Explore more posts

Explore top content on LinkedIn

Find curated posts and insights for relevant topics all in one place.

View top content

Add new skills with these courses

See all courses

Pedro Alves

San Mateo, California, United States 24K followers 500+ connections

About

Articles by Pedro

AI This Week: Agents Finally Show Up for Work — Musings from Silicon Valley

The Million-Token Blind Spot: Why Your AI Still Can't Find the One Fact That Actually Matters

Jensen Just Showed His Hand. The Rest of the Industry Is Following It.

AI This Week: The Pace Is Relentless — Musings from Silicon Valley

Relative Position Representations: The “How Far Apart?” Mindset That Finally Helped Transformers Handle Long Documents Without Losing the Thread

The Digital Yes-Man in Your Chat Window: Why AI Sycophancy Is Still Ruining Good Decisions in February 2026

LLM Fundamentals Roadmap: The Free GitHub Resource That Turns "Math Phobia" Into "AI Superpower" Overnight

The Over-Optimistic AI: Why Models Still Downplay Major Risks in 2026

Rotary Positional Embeddings: The Spin Trick That Let AI Models Read Novels Instead of Just Tweets

Learned Positional Embeddings: The Trainable Hacks That Gave Early AI Models Their Sense of Direction

Activity

Time to build with Thoth AI. https://lnkd.in/eWEnzY9B

Liked by Pedro Alves

The most dangerous thing in Silicon Valley isn't bad ideas. ​ It's good ideas that nobody questions anymore. ​ I grew up in Recife, Brazil. ​ No tech…

Shared by Pedro Alves

We made AI too easy to learn. ​ That might be a problem. ​ In 2001, building an ML model required writing the math from scratch. ​ No frameworks, no…

Shared by Pedro Alves

Experience

Thoth AI

-

-

-

-

-

-

-

-

-

Education

Yale University

-

-

-

Publications

Multiple-Swarm Ensembles: Improving the Predictive Power and Robustness of Predictive Models and Its Use in Computational Biology

IEEE/ACM Transactions on Computational Biology and Bioinformatics April 5, 2017

Being comfortable with no assumptions

CIO Review Jul 2016

Architecture of the human regulatory network derived from ENCODE data

Nature 2012

AlleleSeq: analysis of allele-specific expression and binding in a network framework.

Molecular Systems Biology August 2, 2011

Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project.

Science 2010

Fast and accurate identification of semi-tryptic peptides in shotgun proteomics.

Bioinformatics 2008

Advancement in protein inference from shotgun proteomics using peptide detectability.

Pacific Symposium Biocomputing 2007

Projects

Fraud Detection - Healthcare Insurance

Oct 2013

Partial AUC Optimization through ensembles of decision trees

2013

Readmission Prediction

Mar 2012 - Jul 2013

Honors & Awards

Top Kaggler (within top 0.4%)

Kaggle

Wired Magazine - Best scientific figures 2012

Wired Magazine

Alzheimer’s Drug Discovery Foundation Young Investigator Award

Alzheimer’s Drug Discovery Foundation

Genome Scholars Award

SACNAS, funded by the NHGRI (NIH)

NLM (National Library of Medicine) fellowship award

NLM (National Library of Medicine)

Mensa Membership

Mensa High IQ Society

Languages

English

Native or bilingual proficiency

Portuguese

Native or bilingual proficiency

Recommendations received

Gilberto Titericz

Petr Tsatsin

San Mateo, California, United States
24K followers 500+ connections

The most dangerous thing in Silicon Valley isn't bad ideas. It's good ideas that nobody questions anymore. I grew up in Recife, Brazil. No tech…

We made AI too easy to learn. That might be a problem. In 2001, building an ML model required writing the math from scratch. No frameworks, no…

90% of AI project failures trace back to the same root cause. It's not the algorithm. It's not the compute. It's the data layer. - Bad…