Python and the holy grail: A proteomics venture with scripting

Xin Li

Published Sep 4, 2021

Freaking happy with the data analysis! Got a case that requires to determine the sequence of a truncated protein by top-down. However, it turns out it is highly glycosylated and the fragmentation pattern is driving me nuts, because all I got are sky-high sugar peaks but nothing major from peptide backbone. I have this in-house script (like always) to do de novo for 3-4 amino acid sequence, but all the low mass in this case are buried by the sugars, sucks...

Then one day I was looking at all the peaks with multiple charges, I told myself this is definitely peptide chunks, but I just don't have any tools to read out the sequence since I don't know where is the N-terminal which can be anywhere in this protein with more than 400 AAs. And I also know that top-down always leads to truncated fragments that looks like not starting from the real N-terminal. So I have no choice but use every AA in the full sequence as pseudo N-term and from there adding in next AA and next next AA etc., to see if it can match with the peptide fragments. This idea is purely a brute-force attack but it is simple enough for my apprentice Python skill. Why not since all my de novo attempts failed?

Recommended by LinkedIn

Python Genetic Algorithms With Artificial Intelligence

Malini Shukla 7 years ago

Reproducing Images using a Genetic Algorithm with…

Ahmed Gad 6 years ago

The DNA of Modern Biotech: Why Python Is the Most…

Rishabh Dev Jain 9 months ago

So eventually I did it (and thanks to the experience from ProSightPC). The script needs some inputs: 1. the fasta 2. the modification table (from Uniprot) 3. deconvoluted mass of the multiple charged peaks. It will generate all possible proteoforms using the combination of the modifications, and iterate using every AA as pseudo N-term, add in the next AA, and check if it can match with the peak mass (with user defined mass tolerance), if not, keep adding the next AA from the sequence and check again, until it goes beyond the mass in query. If this pesudo N-term doesn't give any good matches, move on to next AA as the new pesudo N-term. So at 1:00 am, after testing on my Enolase top-down data (Enolase because we are also doing X-linking here, welcome to send your sample :) I got my hands on the cursed truncated glycoprotein. Thanks to my i7-9700 the brute-force attack works smoothly! In a few mins I got several good matches (May the HRAM be with you, young Skywalker), some matches constantly point to the same Threonine, the frequency of such observation is significant to lead to a theory.

603.7843 (+4): 2.16mDa, 705.2784 (+2): 4.98mDa, 811.5366 (+6): 8.92mDa, and within 10mDa, I got from b2 to b15. Again check the theoretical mass of the proposed truncated sequence, tallied with the intact mass result. Puzzle solved.

Python and the holy grail: A proteomics venture with scripting

Xin Li

Recommended by LinkedIn

More articles by Xin Li

Others also viewed

Monks, Your DNA, and Python

ROLE OF PYTHON IN THE INDUSTRY (FIELD OF BIOTECHNOLOGY)

Python to access Uniprot and ClustalO

The Best Python Toolboxes for Biosignal Processing 🐍

ML based sentiment analysis of movie reviews

Sizing the needed resources using Python

Image Analysis Using Python

House Prices - Regression Techniques

ClamSat: Constructing HARs in Python code to mimic systems and subsystems in a small satellite

Explore content categories

Recommended by LinkedIn

More articles by Xin Li

New MS toys in 2021

Others also viewed

Monks, Your DNA, and Python

ROLE OF PYTHON IN THE INDUSTRY (FIELD OF BIOTECHNOLOGY)

Python to access Uniprot and ClustalO

The Best Python Toolboxes for Biosignal Processing 🐍

ML based sentiment analysis of movie reviews

Sizing the needed resources using Python

Image Analysis Using Python

House Prices - Regression Techniques

ClamSat: Constructing HARs in Python code to mimic systems and subsystems in a small satellite

Explore content categories