Pedro Alves

Pedro Alves

San Mateo, California, United States
24K followers 500+ connections

About

Chief Technology Officer @ Thoth AI | Building the world’s most advanced learning…

Articles by Pedro

Activity

Join now to see all activity

Experience

  • Thoth AI Graphic

    Thoth AI

    San Francisco Bay Area

  • -

    San Mateo, California, United States

  • -

    San Francisco Bay Area

  • -

    San Mateo, California, United States

  • -

    San Francisco Bay Area

  • -

    San Francisco Bay Area

  • -

    San Francisco Bay Area

  • -

    Redwood City, CA

  • -

    San Francisco Bay Area

  • -

    Lenexa,KS

Education

  • Yale University Graphic

    Yale University

    -

    -

    Working in various areas within the field of computational biology such as: gene networks, gene expression data, regulatory networks, single nucleotide polymorphisms. Focusing on the use of innovative approaches to machine learning, combining graph analysis information with predictive modeling, ensemble learning, and network analysis.

  • -

    -

  • -

    -

Publications

  • Multiple-Swarm Ensembles: Improving the Predictive Power and Robustness of Predictive Models and Its Use in Computational Biology

    IEEE/ACM Transactions on Computational Biology and Bioinformatics

    Machine learning is an integral part of computational biology, and has already shown its use in various applications, such as prognostic tests. In the last few years in the non-biological machine learning community, ensembling techniques have shown their power in data mining competitions such as the Netflix challenge; however, such methods have not found wide use in computational biology. In this work we endeavor to show how ensembling techniques can be applied to practical problems, including…

    Machine learning is an integral part of computational biology, and has already shown its use in various applications, such as prognostic tests. In the last few years in the non-biological machine learning community, ensembling techniques have shown their power in data mining competitions such as the Netflix challenge; however, such methods have not found wide use in computational biology. In this work we endeavor to show how ensembling techniques can be applied to practical problems, including problems in the field of bioinformatics, and how they often outperform other machine learning techniques in both predictive power and robustness. Furthermore, we develop a methodology of ensembling, Multi-Swarm Ensemble (MSWE) by using multiple particle swarm optimizations and demonstrate its ability to further enhance the performance of ensembles.

    See publication
  • Being comfortable with no assumptions

    CIO Review

    The article discusses a commonly employed technique in problem solving, which is making assumptions. Assumptions are a powerful tool that can aid in speeding up the time needed to search for a solution. One problem with assumptions is that people forget to revisit them when a solution is not found, there is also the problem of assumptions that are made subconsciously.

    See publication
  • Architecture of the human regulatory network derived from ENCODE data

    Nature

    Transcription factors bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the principles of the human transcriptional regulatory network, we determined the genomic binding information of 119 transcription-related factors in over 450 distinct experiments. We found the combinatorial, co-association of transcription factors to be highly context specific:…

    Transcription factors bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the principles of the human transcriptional regulatory network, we determined the genomic binding information of 119 transcription-related factors in over 450 distinct experiments. We found the combinatorial, co-association of transcription factors to be highly context specific: distinct combinations of factors bind at specific genomic locations. In particular, there are significant differences in the binding proximal and distal to genes. We organized all the transcription factor binding into a hierarchy and integrated it with other genomic information (for example, microRNA regulation), forming a dense meta-network. Factors at different levels have different properties; for instance, top-level transcription factors more strongly influence expression and middle-level ones co-regulate targets to mitigate information-flow bottlenecks. Moreover, these co-regulations give rise to many enriched network motifs (for example, noise-buffering feed-forward loops). Finally, more connected network components are under stronger selection and exhibit a greater degree of allele-specific activity (that is, differential binding to the two parental alleles). The regulatory information obtained in this study will be crucial for interpreting personal genome sequences and understanding basic principles of human biology and disease.

    Other authors
  • AlleleSeq: analysis of allele-specific expression and binding in a network framework.

    Molecular Systems Biology

    To study allele-specific expression (ASE) and binding (ASB), that is, differences between the maternally and paternally derived alleles, we have developed a computational pipeline (AlleleSeq). Our pipeline initially constructs a diploid personal genome sequence (and corresponding personalized gene annotation) using genomic sequence variants (SNPs, indels, and structural variants), and then identifies allele-specific events with significant differences in the number of mapped reads between…

    To study allele-specific expression (ASE) and binding (ASB), that is, differences between the maternally and paternally derived alleles, we have developed a computational pipeline (AlleleSeq). Our pipeline initially constructs a diploid personal genome sequence (and corresponding personalized gene annotation) using genomic sequence variants (SNPs, indels, and structural variants), and then identifies allele-specific events with significant differences in the number of mapped reads between maternal and paternal alleles. There are many technical challenges in the construction and alignment of reads to a personal diploid genome sequence that we address, for example, bias of reads mapping to the reference allele. We have applied AlleleSeq to variation data for NA12878 from the 1000 Genomes Project as well as matched, deeply sequenced RNA-Seq and ChIP-Seq data sets generated for this purpose. In addition to observing fairly widespread allele-specific behavior within individual functional genomic data sets (including results consistent with X-chromosome inactivation), we can study the interaction between ASE and ASB. Furthermore, we investigate the coordination between ASE and ASB from multiple transcription factors events using a regulatory network framework. Correlation analyses and network motifs show mostly coordinated ASB and ASE.

    Other authors
  • Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project.

    Science

    We systematically generated large-scale data sets to improve genome annotation for the nematode Caenorhabditis elegans, a key model organism. These data sets include transcriptome profiling across a developmental time course, genome-wide identification of transcription factor–binding sites, and maps of chromatin organization. From this, we created more complete and accurate gene models, including alternative splice forms and candidate noncoding RNAs. We constructed hierarchical networks of…

    We systematically generated large-scale data sets to improve genome annotation for the nematode Caenorhabditis elegans, a key model organism. These data sets include transcriptome profiling across a developmental time course, genome-wide identification of transcription factor–binding sites, and maps of chromatin organization. From this, we created more complete and accurate gene models, including alternative splice forms and candidate noncoding RNAs. We constructed hierarchical networks of transcription factor–binding and microRNA interactions and discovered chromosomal locations bound by an unusually large number of transcription factors. Different patterns of chromatin composition and histone modification were revealed between chromosome arms and centers, with similarly prominent differences between autosomes and the X chromosome. Integrating data types, we built statistical models relating chromatin, transcription factor binding, and gene expression. Overall, our analyses ascribed putative functions to most of the conserved genome.

    Other authors
  • Fast and accurate identification of semi-tryptic peptides in shotgun proteomics.

    Bioinformatics

    Motivation: One of the major problems in shotgun proteomics is the low peptide coverage when analyzing complex protein samples. Identifying more peptides, e.g. non-tryptic peptides, may increase the peptide coverage and improve protein identification and/or quantification that are based on the peptide identification results. Searching for all potential non-tryptic peptides is, however, time consuming for shotgun proteomics data from complex samples, and poses a challenge for a routine data…

    Motivation: One of the major problems in shotgun proteomics is the low peptide coverage when analyzing complex protein samples. Identifying more peptides, e.g. non-tryptic peptides, may increase the peptide coverage and improve protein identification and/or quantification that are based on the peptide identification results. Searching for all potential non-tryptic peptides is, however, time consuming for shotgun proteomics data from complex samples, and poses a challenge for a routine data analysis.

    Results: We hypothesize that non-tryptic peptides are mainly created from the truncation of regular tryptic peptides before separation. We introduce the notion of truncatability of a tryptic peptide, i.e. the probability of the peptide to be identified in its truncated form, and build a predictor to estimate a peptide's truncatability from its sequence. We show that our predictions achieve useful accuracy, with the area under the ROC curve from 76% to 87%, and can be used to filter the sequence database for identifying truncated peptides. After filtering, only a limited number of tryptic peptides with the highest truncatability are retained for non-tryptic peptide searching. By applying this method to identification of semi-tryptic peptides, we show that a significant number of such peptides can be identified within a searching time comparable to that of tryptic peptide identification.

    Other authors
    • Randy Arnold
    • David Clemmer
    • Yixue Li
    • James Reilly
    • Quanhu Sheng
    • Haixu Tang
    • Zhiyin Xun
    • Rong Zeng
    • Predrag Radivojac
  • Advancement in protein inference from shotgun proteomics using peptide detectability.

    Pacific Symposium Biocomputing

    A major challenge in shotgun proteomics has been the assignment of identified peptides to the proteins from which they originate, referred to as the protein inference problem. Redundant and homologous protein sequences present a challenge in being correctly identified, as a set of peptides may in many cases represent multiple proteins. One simple solution to this problem is the assignment of the smallest number of proteins that explains the identified peptides. However, it is not certain that a…

    A major challenge in shotgun proteomics has been the assignment of identified peptides to the proteins from which they originate, referred to as the protein inference problem. Redundant and homologous protein sequences present a challenge in being correctly identified, as a set of peptides may in many cases represent multiple proteins. One simple solution to this problem is the assignment of the smallest number of proteins that explains the identified peptides. However, it is not certain that a natural system should be accurately represented using this minimalist approach. In this paper, we propose a reformulation of the protein inference problem by utilizing the recently introduced concept of peptide detectability. We also propose a heuristic algorithm to solve this problem and evaluate its performance on synthetic and real proteomics data. In comparison to a greedy implementation of the minimum protein set algorithm, our solution that incorporates peptide detectability performs favorably.

    Other authors
    • Randy Arnold
    • Milos Novotny
    • Predrag Radivojac
    • Haixu Tang

Projects

  • Fraud Detection - Healthcare Insurance

    This was an interesting problem in identifying fraudulent entities in any step of the health insurance process from patient to pharmacist to pharmacy. This project was heavy on feature engineering. Anomaly detection and one-class classifiers were used in this unsupervised learning problem. The final solution involved an innovative approach of creating a network between all entities and overlaying the features onto the networks which allowed for more complex network analysis that enabled the…

    This was an interesting problem in identifying fraudulent entities in any step of the health insurance process from patient to pharmacist to pharmacy. This project was heavy on feature engineering. Anomaly detection and one-class classifiers were used in this unsupervised learning problem. The final solution involved an innovative approach of creating a network between all entities and overlaying the features onto the networks which allowed for more complex network analysis that enabled the approach to identify fraudulent entities with individual signals to low to be detected otherwise.

  • Partial AUC Optimization through ensembles of decision trees

    When developing a new predictive model there are a multitude of evaluation metrics that can be used. Usually, most scientists have one favorite metric that they use consistently throughout different projects; however, the choice of evaluation metric can be as important as the model itself.

    In the field of clinical informatics, where someone's healthcare might be affected by a model's prediction, the scores that are closer to the extreme values are the ones that might actually change a…

    When developing a new predictive model there are a multitude of evaluation metrics that can be used. Usually, most scientists have one favorite metric that they use consistently throughout different projects; however, the choice of evaluation metric can be as important as the model itself.

    In the field of clinical informatics, where someone's healthcare might be affected by a model's prediction, the scores that are closer to the extreme values are the ones that might actually change a doctor's decision. With this in mind, the overall AUC is not as important as the area under the ROC with low false positive rate.

    In this project I developed and coded two solutions to optimize the partial AUC directly. The first solution was to use PSO (Particle Swarm Optimization) to train a Neural Network that directly improves the user specified partial AUC. The second solution was to train various random decision trees and have them combined into one ensemble model with respective weights; these weights are optimized through the PSO to improve the partial AUC.

    The final result were two algorithms that scale well to big data (and are parallelizable) and improve user specified partial AUC.

  • Readmission Prediction

    -

    Goal: Predicting probability of patients to have readmissions to hospitals after being discharged with specific conditions.

    Methods: Data mining, ensemble learning, feature engineering, feature selection, sub-population specific feature selection, metrics evaluations, partial AUC optimization.

    Results: The final model was able to outperform models trained in all sub-populations when compared on those sub-populations as well as outperform current market competitors by yielding…

    Goal: Predicting probability of patients to have readmissions to hospitals after being discharged with specific conditions.

    Methods: Data mining, ensemble learning, feature engineering, feature selection, sub-population specific feature selection, metrics evaluations, partial AUC optimization.

    Results: The final model was able to outperform models trained in all sub-populations when compared on those sub-populations as well as outperform current market competitors by yielding actionable scores to populations sizes 300%-400% of competitors.

    Using data mining and feature selection to build models to predict the probability of patients to readmit to hospitals after being discharged with specific conditions: Chronic Obstructive Pulmonary Disease, Pneumonia and AMI.

    Various feature selection methods were used in order to reduce the number of features from a few thousand to about 100. An initial pass with a greedy backwards search was used followed by genetic algorithms used in sub-populations in order to capture features that were predictive for groups of the whole population but might have had weak predictive power for the entire population.

    Various machine learning approaches were tested including:
    - Neural networks
    - Logistic regression
    - SVMs
    - Tree-based algorithms, bagging, forests, boosting, Alternating decision trees
    - Ensembles: stacking, grading, voting, greedy-search

    The final model used was a combination of bagging and random forests. This proved to yield the highest levels of accuracy and least overfitting.

    The final result were models that had an AUC that were about 10% higher than the current competition. More importantly than AUC is the size of the actionable populations. In clinical work a doctor will not change their plans unless the prediction for the patient differs enough from the average patient, this is called "actionable population". The actionable population of my models covered populations that were 300%-400% of that of the competitors.

Honors & Awards

  • Top Kaggler (within top 0.4%)

    Kaggle

    Ranked in 99.6% percentile out of ~225,000 data scientists on Kaggle

  • Wired Magazine - Best scientific figures 2012

    Wired Magazine

    Wired magazine's choice award for top 10 scientific figures in 2012.
    http://www.wired.com/2012/12/science-figures-2012/

  • Alzheimer’s Drug Discovery Foundation Young Investigator Award

    Alzheimer’s Drug Discovery Foundation

  • Genome Scholars Award

    SACNAS, funded by the NHGRI (NIH)

    Awarded financial fellow from National Institutes of Health.

  • NLM (National Library of Medicine) fellowship award

    NLM (National Library of Medicine)

  • Mensa Membership

    Mensa High IQ Society

Languages

  • English

    Native or bilingual proficiency

  • Portuguese

    Native or bilingual proficiency

Recommendations received

14 people have recommended Pedro

Join now to view

More activity by Pedro

View Pedro’s full profile

  • See who you know in common
  • Get introduced
  • Contact Pedro directly
Join to view full profile

Other similar profiles

Explore top content on LinkedIn

Find curated posts and insights for relevant topics all in one place.

View top content

Add new skills with these courses