Proteomic Engineering
 

Automation, Engineering, and Science for Clinical Applications

People

Research

Publications

Downloads

Contact

 

 

 

 

Topic Menu

Home

Computation

Automation/Robot

 

 

 

 

 

Computation

Supplemental Information

Here is a brief description of some of methods examined.

- Support Vector Machines (SVM)

Support Vector Machines (SVM) [1] is a supervised learning technique that can be viewed as a Tikhonov Regularization problem with a hinge loss function.  That is, it can be expressed as:

 

where the loss function is:.  Here H is the Hilbert space and K is the Reproducing Kernel Hilbert Space (RKHS) used to define the norm.  C is the regularization constant.  X represents the biomarker peak value(s).  Y is the actual cancer  diagnosis whereas f(x) is the predicted cancer status. 

To solve this regularization problem, one can rewrite it as a constrained quadratic programming problem with Lagrange multipliers:

 

where: and K(xi,xj) is the kernel function

with constraint  is orthogonal to  (i.e. ) where:

This can be solved using normal quadratic programming techniques (e.g. as implemented in Matlab).

 

- Logistic regression

A logistic linear model was also designed to use protein markers as predictor features for disease state.  This model can be expressed as a linear model with a logit link function [2].  In this model, the coefficients can be found via logistic linear regression and used on an unseen test set for validation. 

 

- Principal Component Analysis (PCA)

In Principal Component Analysis (PCA), each principal component is an eigenvector consisting of weighted parameters (peaks in this case) [3].  The importance of a given principal component in terms of explaining the data variance is represented via eigenvalues which are determined as explained below. 

 

The principal components are found as follows.  First, the covariance matrix (e.g. ) of the data matrix  is calculated.  Next, the eigenvalues and eigenvectors of  are found.  The eigenvectors are sorted (to form a matrix, ) so that they are in descending order based on the eigenvalues.  Next, the first n eigenvectors in  (with largest eigenvalues) are selected based on a scree plot to form matrix .  A scree plot involves plotting the eigenvalues magnitudes for each eigenvector and comparing the difference between them in order to select those above a noise baseline (a lower slope magnitude is typically prevalent at less significant components).  Within each principal component, the eigenvlaue magnitudes are ranked and the corresponding biomarker peaks can then be determined. 

 

- General Bayesian Network approaches (i.e. basic idea of a Bayesian Network)

Bayesian probabilistic assumptions and relationships can be visualized through graphical models (known as Bayesian Networks) [4].  A Bayesian Network is essentially a graphical representation of probabilistic dependencies.  Let G={V, E} be a directed acyclic graphs (DAG) with V representing vertices and E being a vector of edges.  In such a graph, the vertices typically encode the variables and directed edges imply probabilistic dependence.  These dependencies help reduce the number of terms in the joint probability and hence reduce the amount of computation needed for inference.

 

- Naïve Bayesian classifier

A Naïve Bayesian Classifier (NBC) encodes attributes  X1 to XN (e.g. protein markers) as conditionally independent given their mutually exclusive classes Y (Cancer or Control).  These dependencies help reduce the number of terms in the joint probability and hence reduce the amount of computation needed for inference.  Through the calculation of conditional probabilities (i.e. probability of a protein marker being upregulated given its class is cancer), Bayes’ Rule can be used to calculate the probability that the class is cancer given new data features (i.e. new protein marker values from the test set). 

 



 

 

- Decision tree approach

Another method used was a decision tree approach.  A decision tree encodes rules that separate the dataset into the two classes.  It can be formed by using the C4.5 method [5, 6] which takes entropy metrics into account when choosing between rules to apply to different feature components.  For example, a rule might be: If gene marker X’s probe has a value greater than 0.5, then the class of the sample is more likely to be cancer.  Rules are combined as part of a tree.  Then, the tree can be pruned computationally and manually to reduce the number of rules at the expense of predictor accuracy.

 

- K-Nearest Neighbor (KNN)

K-nearest neighbor algorithms map new inputs to the closest k neighbors using a distance metric such as Euclidean distance or Signal-to-Noise Ratios (SNR).

Please see additional research presentation at Ciphergen Conference (by members of the current collaboration).
 
 

REFERENCES

[1]      C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, pp. 273-97.

[2]      P. McCullagh, Generalized linear models. New York: Chapman and Hall, 1989.

[3]      I. Joliffe, Principal Component Analysis. New York, NY: Springer-Verlag, 1986.

[4]      A. Gelman, J. C. Carlin, H. Stern, and D. B. Rubin, Bayesian data analysis. New York: Chapman & Hall, 1995.

[5]      J. R. Quinlan, "Simplifying decision trees," International Journal of Man-Machine Studies, vol. 27, pp. 221-234.

[6]      J. R. Quinlan, C4.5: Programs for Machine Learning. New York: Morgan Kauffman, 1993.

 

Authors/contributors:
Gil Alterovitz, Manuel Aivado, Towia Libermann, Marco Ramoni, Isaac S. Kohane

 
 

Computation Links

None 

 

 

Copyright (c) 2004-2006. All Rights Reserved.