Hello! Guest! Please Login or Register! Log Out

Microarray Data Analysis in BarleyBase


1. Data Transformation

2. Hierarchical Clustering

3. Partitioning Methods

4. Self-Organizing Maps (SOM)

5. Principal Components Analysis

6. Sammon's Non-Linear Mapping


1. Data Transformation

Prior to starting an analysis, certain data transformation can be done to improve analysis performance. These include log transformations, mean- or median- centering and scaling. Usually,  mean- or median- centering is done on log2- transformed data.

Different types of adjustments may be applied on top of one another in following sequence: log2 transformation, centering and scaling.

Transformation options are described below:

Log2 Transform: This is  taking the log2 transform of every expression value of the expression matrix. This adjustment may improve the normality of data distribution, which is required assumption for some statistical computations.

Mean-Centering: This will replace each value by [value – Mean (expression values of the probe sets across hybridizations)].

Median-Centering: This will replace each value by [value – Median (expression values of the probe sets across hybridizations)].

Scaling: This will divide each value by the standard deviation of expression values of the probe sets across hybridizations.

If both mean-centering and scaling are performed, this is equivalent to the "staandization".

[Back to Top]


2. Hierarchical Clustering

Hierarchical cluster analysis is conducted with R amap package.

2.1. Distance measures

Suppose we have two vectors x and y, their distances can be calculated in the following ways:

  • Non-centered Pearson:  sum(x_i y_i) / [sum(x_i^2) sum(y_i^2)].
  • Correlation--Centered Pearson:  1 - corr(x,y).
  • Euclidean: Usual square distance between the two vectors (2 norm).
  • maximum: Maximum distance between two components of x and y (supremum norm)
  • manhattan: Absolute distance between the two vectors (1 norm).
  • canberra: sum(|x_i - y_i| / |x_i + y_i|). Terms with zero numerator and denominator are omitted from the sum and treated as if the values were missing.

2.2. Clustering methods

  • Single Linkage: The distances are measured between each member of one cluster to each member of the other cluster. The minimum of these distances is considered the cluster-to-cluster distance. It adopts a `friends of friends' clustering strategy and tends to find "loose" clusters, and may suffer from "chaining" effect in microarray data analysis.

  • Average Linkage: The average distance of each member of one cluster to each member of the other cluster is used as a measure of cluster-to-cluster distance.

  • Complete Linkage: The distances are measured between each member of one cluster to each member of the other cluster. The maximum of these distances is considered the cluster-to-cluster distance. It tends to find compact, spherical clusters.

  • Ward's minimum variance method aims at finding compact, spherical clusters. The distance between two clusters is the ANOVA sum of squares between the two clusters added up over all the variables. At each generation, the within-cluster sum of squares is minimized over all partitions obtainable by merging two clusters from the previous generation. Ward's method tends to join clusters with a small number of observations, and it is strongly biased toward producing clusters with roughly the same number of observations. It is also very sensitive to outliers (Milligan 1980).
  • Centroid method: The distance between two clusters is defined as the (squared) Euclidean distance between their centroids or means. The centroid method is more robust to outliers than most other hierarchical methods but in other respects may not perform as well as Ward's method or average linkage (Milligan 1980). The centroid method was originated by Sokal and Michener (1958).
  • The other methods can be regarded as aiming for clusters with characteristics somewhere between the single and complete link methods.

Based on previous experience, Average linkage and complete linkage maybe the preferred methods for microarray data analysis.

2.3. Output:

  •  Dendrogram with all observations.
  •  Heatmap of all observations
  • The expression line graph and heatmap for each of the sub-clusters, for user-specified number of clusters.
  • The probe sets in each subcluster, which can be saved as data sets for refined analysis or comparison.

[Back to Top]


3. Partitioning Methods

There are two partitioning methods provided. Both need user to pre-define the number of clustering centers.

3.1 PAM, or partition round medoids, is one of the k-medoids methods. Different from usual k-means approach,  it also accepts a dissimilarity matrix, and it is more robust because it minimizes a sum of dissimilarities instead of a sum of squared Euclidean distances.


The PAM-algorithm is based on the search for 'k' representative objects or medoids among the observations of the dataset, which should represent the structure of the data. After finding a set of 'k' medoids, 'k' clusters are constructed by assigning each observation to the nearest medoid. The goal is to find 'k' representative objects which minimize the sum of the dissimilarities of the observations to their closest representative object.

The distance metric to be used for calculating dissimilarities between observations are "euclidean" and "manhattan". Euclidean distances are root sum-of-squares of differences, and manhattan distances are the sum of absolute differences.

3.2 K-means method chooses a predefines number of cluster centers to minimize the within-class sum of squares from the centers.  It uses Euclidean distance. When finished, all cluster centers are at the mean of their Voronoi sets (the set of data points which are nearest to the cluster centre). The algorithm of Hartigan and Wong (1979) is used. It is most appropriate for suitably scaled continuous variables.

The start points can be chosen with hierarchical clustering, which use "Euclidean" distance and "Average" linkage methods. Or it can be selected randomly during computation.

3.3. Output:

  •  For PAM, a plot showing average silhouette width over different cluster numbers, aiding in find the optimal cluster numbers.
  •  For each tried number of partition, draw a bivariate cluster plot visualizing a partition (clustering) of the data. All observation are represented by points in the plot, using principal components or multidimensional scaling. Around each cluster an ellipse is drawn.
  • The expression line graph and heatmap for each of the sub-clusters, for user-specified number of clusters.
  • The probe sets in each subcluster, which can be saved as data sets for refined analysis or comparison.

[Back to Top]


4. Self-Organizing Maps (SOM)

4.1. Description

It is proposed by Kohonen (1995), and used in microarray data analysis by Tamayo (1999). The implementation used by BarleyBase is GeneSOM,  the R packge by Jun Yan <jyan@stat.uiowa.edu>. Default settings are used, except for the x-dimension and y-dimension, which can be set by user.

The original paper for SOM can be viewed at:

Kohonen, Hynninen, Kangas, and Laaksonen (1995), SOM-PAK, the Self-Organizing Map Program Package (version 3.1). http://www.cis.hut.fi/research/papers/som_tr96.ps.Z

4.2. Output:

  •  A plot showing the SOM in a 2-dimension map with means and standard deviation bars, as well as number of observations in the nodes.
  • The expression line plot and heatmap for probe sets in each of the nodes.
  • The list of probe sets in each node, which can be saved as data sets for refined analysis or comparison.

[Back to Top]


5. Principal Components Analysis

Principal components analysis (PCA) is often used as a data reduction technique. It finds a new coordinate system for multivariate data such that the first coordinate, the linear combination of the columns of data matrix,  has maximal variance, the second coordinate has maximal variance subject to being orthogonal to the first, etc. A singular value decomposition (SVD) is carried out.

There are 8 methods provided according to the subject of SVD:

  • method = 1:  No transformation of data matrix. SVD is carried out on a sums of squares and cross-products matrix.
  • method = 2:  The observations are centered to zero mean. SVD is carried out on a variance-covariance matrix.
  • method = 3: The observations are standardized by centering to mean 0 and variance 1. SVD is carried out on a correlation matrix.
  • method = 4:  The observations are normalized by being range-divided, and then the variance-covariance matrix is used in SVD.
  • method = 5:  SVD is carried out on a Kendall (rank-order) correlation matrix.
  • method = 6:  SVD is carried out on a Spearman (rank-order) correlation matrix.
  • method = 7:  SVD is carried out on the sample covariance matrix.
  • method = 8:  SVD is carried out on the sample correlation matrix.

The results are shown as 2-D or 3-D scatter plots, where the first 2 or 3 principal components are used as the axes. The 3-D plots are shown from all 6 different sides.

[Back to Top]


6. Sammon's Non-Linear Mapping

6.1. Description:

It is one of the Multidimensional Scaling (MDS) methods. It finds a new, reduced-dimensionality, coordinate system for multivariate data such that the an error criterion between distances in the given space, and distances in the result space, is minimized.

It is provided to help users to get an idea on if there exists clear cluster structures with the data, and how many clusters likely.

6.2. Output:

The analysis is run 30 times, the 3 best results are show as 3-D scatter plots, each shown the plots from all 6 different sides. Unfortunetaly, this analysis is very slow.

[Back to Top]

 

Copyright@2001-2005 The BarleyBase Group
All rights reserved.

For problems with the webpages, contact barleybasewebmaster