**PROTREC is an algorithm for predicting and validating missing proteins in proteomics data based on
Kong W, Wong B J
H, Gao H, et al. PROTREC: A probability-based approach for recovering missing proteins based
on biological networks. Journal of Proteomics, 2022, 250: 104392. Here, we also include
other three common used network based methods (same as in the literature) for a better
comparison.**

- Probability (or p-value) score matrix
- Comparison heatmap
- Recovery rate
- Score distribution

- PROTREC
- FCS (Functional Class Scoring)
- HE (Hypergeometic Enrichment)
- GSEA (Gene Set Enrichment Analysis)

PROTREC is a novel probability-based scoring scheme that estimates the probability of a protein being present in a screen.

Specifically, to calculate PROTREC probability for a protein \(x\), we first find all the protein
complexes \(z_i\) containing protein \(x\). Not all complex we should consider since some
complex contains so less proteins. __The default complex size threshold is 5, you may change
it by changing `Min size of protein complex' in the parameter setting under `Run your
own datasets' section.__

Then, we calculate the probability of a complex \(z_i\in z\) being
present:\[p(z_i)=\frac{\sum\limits_{x_i\in L}(1-FDR)}{|z_i|}\]\(x_i\) denotes a protein inside
the complex \(z_i\). \(L\) denotes the set of proteins reported by the proteomic screen. If
\(x_i\) is reported by the proteomics screen \(L\), then its prior probability is \((1-FDR)\),
where \(FDR\) is the false discovery rate of \(L\). **The default \(FDR\) is 0.01, you may
change it by changing `FDR of proteomic screen' in the parameter setting under `Run your
own datasets' section.**

PROTREC assume protein \(x\) being present in a sample is dependent on the joint probability of
it being present if its complex is formed, and the probability it is present if its constituent
complex is not formed. Since there might be multiple protein complexes, PROTREC computes the
probability of a protein \(x\) being present in a sample being screened using each of the
complexes that the protein \(x\) is a member of and returns the maximum:\[p(x)=\max_{z_i\in
z}\{p(x|z_i)p(z_i)+p(x|\overline{z_i})p(\overline{z_i})\}\]This way, we can calculate all
protein's PROTREC score. We can sort the proteins by their score and predict unreported proteins
above a given PROTREC score threshold as predicted missing proteins. __By default, we use
0.95 as the cutoff. You may change your own cutoff by changing `PROTREC score threshold'
in the parameter setting under `Run your own datasets' section.__

FCS tests whether a network is significantly enriched given the observed proteins. Given a set of
observed proteins in a proteomics screen \(S\), and a list of component proteins \(M\) from
protein complex \(C\), an observed overlap \(O\), which is expressed as: \[O=\frac{|S\cap
M|}{|M|}\]To determine if the overlap \(O\) is significant, a set \(N\) of randomized complexes
of size are generated using a reference pool of unique proteins drawn from the complexes \(C\).
** Default we choose \(N\) as 1000, but you may change the number by changing `Number of
iterations' in the parameter setting under `Run your own datasets' section.**
Among the randomized complexes, a vector of null overlaps, \(N_j\) is generated. For the
\(j^{th}\) randomized complex, which comprises the set of proteins \(K_j\), \(N_j\), is defined
as follows:\[N_j=\frac{|S\cap K_j|}{|K_j|}\]The empirical p-value is the proportion of null
overlaps in \(N_j\) greater than or equal to the observed overlap \(O\). For the \(i^{th}\)
complex \(C_i\) in the complex vector, its p-value, \(pval_i\)
is:\[pval_i=\frac{\sum\limits_{j=1}^{N}[N_j\ge 0]}{N}\]If the FCS p-value falls below a
significance p-value threshold, then all member proteins of the complex, including the
unobserved ones, are predicted as present.

In HE, the set of observed proteins are compared against a vector of protein complexes. Given a total number of proteins \(N\), with \(M\) of these belonging to a complex and \(n\) of these proteins in the differential set, the probability \(P\) that \(b\) or more proteins from the differential set are associated by chance with the complex is given by:\[P(X\ge b)=\sum\limits_{i=b}^{min(n,M)}\frac{C_n^iC_{N-n}^{M-i}}{C_N^M}\]\(P(X\ge b)\) is the HE p-value. A complex is declared significant if the HE p-value falls below the threshold.

Gene Set Enrichment Analysis (GSEA) uses a Kolmogorov-Smirnov (KS) statistic. Here, the two-sample KS test is used to evaluate if the distribution of ranks based on the t-statistic of proteins in a complex differs from that of proteins outside the complex. Denoting proteins in the complex as the set \(C\) and proteins outside the complex as the set \(C'\), the KS-statistic \(KS_{C,C'}\)is expressed as:\[KS_{C,C'}=\max\limits_x|F_{1,C}(x)-F_{2,C'}(x)|\]where \(F_{1,C}(x)\) and \(F_{2,C'}(x)\) are respectively the fraction of proteins in \(C\) and \(C'\) whose rank is higher than the rank \(x\). The null hypothesis is rejected at a significance threshold if\[KS_{C,C'}\ge c(\alpha)*\sqrt{\frac{|C|+|C'|}{|C|*|C'|}}\]Where \(c(\alpha)\) is the critical value at a given alpha level.

Notably, __For FCS, HE and GSEA, the default p-value threshold is 0.05. You may change your
own cutoff by changing `p-val cutoff for FCS, HE and GSEA' in the parameter setting
under `Run your own datasets' section.__

One proteomics expression datasets are provided. The renal cancer dataset (RC) comprises 12 normal (RC_N) and 12 cancer (RC_C) samples. We provide RC_N sample as an example. Press the button you may check how does the sample looks like.

Protein complex is obtained from CORUM. We use CORUM complex 2018 release human dataset as our default complex database. You may check how the complex looks like by clicking here

You can submit your own dataset and protein complex to get the protein inference result. If it is your first time using this tool, you may refer to Example section to see how it works. If no file is submitted, the program will use default file. We encourage users to use the default protein complex.

Dataset (Default: example dataset)

Protein Complex (Default: CORUM database 2018 release)

PROTREC
ALL

Raw predicted score
matrix
0/1 predicted score
matrix

Recovery rate
Score distribution

Upload your reference list

Click here to get your verification key after putting in your email address.

Click here to send your data to us. We will execute your data and send results back to your email.

We are a research group comprised of biodata scientists, computational biologists and education technologists in the School of Biological Sciences and Lee Kong Chian School of Medicine, Nanyang Technological University.

Our lab is focused on the development of statistical approaches for analysing and resolving platform-specific idiosyncrasies in multi-omics data; identifying and resolving confounding issues such as batch effects, technical bias and missing values in high-dimensional data; and developing robust biomarker and drug target prediction techniques using a combination of machine learning and enhanced in silico validation techniques. Our lab is also interested in Bio-education, with an emphasis on the use of new AI-based technologies, text-mining and high-impact pedagogical practices (experiential learning), to enhance the quality of biological and biotechnological education. Here is more information about our lab.

Director, PhD Programmes, Lee Kong Chian School of Medicine

Director, Biomedical Data Science Graduate Programme

Co-Director, Centre for Biomedical Informatics

Head, Good Research Practice Office

Group Leader, Bio-Data Science and Education Laboratory