Help for data analysis in WebArrayDB


1  Work Flow

2  Group Assignment

Currently WebArrayDB is dedicated to the differential analysis. Samples (arrays or specific channels) need to be assigned to different groups for the comparisons among groups. Genes differentially expressed among groups will be generated by the end of analysis.


3  Cross-platform Probe Alignment

3.1  Match probes

One great feature of WebArrayDB is its capability of cross-platform probe alignment. Probes from different platforms can be aligned by any IDs which were pre-defined as reference ID, such as gene symbol, GenBank ID, RefSeq IDs. After alignment, a data matrix will be made in which each column is from a channel of an array and columns in each row have a same aligned probe/gene. Currently only probes presented in all platforms will be kept in the data matrix for further analysis.

There are six ways to match probes in WebArrayDB:
1) quick alignment,
2) match by “idx” or “unique_id”,
3) match by shared “Reference IDs”,
4) match by “user-specified columns”,
5) match by probe-mapping files,
6) “automatic” match.

3.1.1  Quick alignment

In the case that all involved platforms have the same number of probes, all probes are from the printing material sharing the same sequence, and printed in the same order, users can choose quick alignment option. WebArrayDB will actually skip the step of matching and use these platforms as a same one. This might help to save much time since regular alignment can be very time consuming.

3.1.2  Match by “idx” or “unique_id”

The “idx” column contains the printing order (or logical positions) of probes, so probes will be matched by printing order if “idx” is chosen. “unique_id” is the “id” or “unique_id” columns in the probe file.

3.1.3  Match by shared Reference IDs

In order to use “reference IDs” for alignment, users must provide reference IDs in probe files when defined platforms. The steps are:

Probes are considered “identical” if they share the same reference IDs in WebArrayDB. When doing an analysis, users can choose to use one of these reference columns to align probes from different platforms, e.g., RefSeq IDs. That means probes in different platforms with the same RefSeq ID are considered as the same probe. In the eventual data matrix, these probes will be matched in one or more rows, depending on the method to match replicates.

“Identical” probes detected by reference IDs will be assigned a same “cross-platform ID” that is an integer unique in the database. Cross-platform IDs conform to the transitivity and evolution principles and will be used for “automatic” match.

3.1.4  Match by “user-specified columns”

The method is similar to that by shared “Reference IDs”. But the columns used for matching are not necessarily a reference column names. Theoretically any columns in the probe files can be used for matching. Especially users can select a different column for each involved platform to match probes as if these columns contain referenced IDs from one certain database. This method presents a very flexible way for matching.

3.1.5  Match by probe-mapping files

Steps to use probe-mapping files:

“Identical” probes detected by probe-mapping files will be assigned a same “mapping IDs” that is an integer unique in the database. Mapping IDs also conform to the transitivity and evolution principles as cross-platform IDs. They will be used for “automatic” match as well.

3.1.6  “automatic” match

Users can also choose the “automatic” option, in which WebArrayDB will use existing alignments by probe-mapping files or align probes by all available reference columns including gene_symbol, unique_id and other reference IDs, but the “idx” column in probe file won’t be used.

WebArrayDB will try mapping IDs first, if no matching probes found, try cross-platform IDs again. If still no match, the column “idx” will be used.

3.2  Match replicates

When a probe has replicates in one or more involved platforms, he probe alignment among different platforms can be complex due to its many-to-many relationship, including the cases that there are duplicate spots on the array. WebArrayDB has provided six options, "median", "mean", "log mean", "shortest", "longest" and "cartesian product", to deal with multiplex alignments. For example, if two platforms were aligned by RefSeq ID, and one gene is represented by two probes (A1 and A2) in platform A and represented by three probes (B1, B2 and B3) in platform B. When options "median", "mean" and "log mean" were chosen, the median, mean or log mean value of the probes for the same UniGene will be used to represent that gene. If the option "shortest" is chosen, there will be two matches: A1 vs B1, and A2 vs B2. The option "longest" will make one more match in addition to those done for "shortest": A3 vs B3, where "A3" is the log mean value of A1 and A2. For the option "cartesian product", each probes for the same UniGene in Platform A will make a match to all the probes for the same RefSeq in Platform B, resulting in 6 (2 x 3) matches in total for this example.

4  Data Normalization

Normalization is a to minimize systemic noise before implementing differential analysis and is strongly suggested. Users are encouraged to read details about each normalization options before deciding which one to use. In general, there are four steps of normalization.

  1. Background correction
  2. Within-array normalization
  3. Between-array normalization (within a platform)
  4. Cross-platform normalization

The first three steps will be done before probe alignment. and the last step will be done after cross-platform probe alignment.

Background correction, within-array normalization and between-array normalization (within platform) are also parts of functions provided in WebArray.

Cross-platform normalization means data normailzation for arrays from different platforms. All between-array normalization methods are included for cross-platform normalization, furthermore, another three cross-platform methods were implemented in WebArrayDB as well:

For the QD normalization, a parameter - “number of bin” has to be set. It is 2, 4, 8, … a number of power of 2. Its default is 8.

For homologous platforms (e.g. different developmental versions of an user-spotted slides with a few probes changed), all between-arrays normalization methods might be used. In such cases, between-array normalization within a platform is unnecessary.

5  Differential Analysis

5.1  Algorithms

Users have many options in algorithms for differential analysis in WebArrayDB: Student’s t-test, eBayes-moderated t-test, SAM, ANOVA/ANCOVA and non-parametric tests.

5.2  Blocked or paired data

In case that data cannot be treated as blocked/paired, WebArrayDB will omit this option and do a regular analysis based on intensity.

Blocked/paired data has different meaning according to the selected algorithms.

5.3  ANOVA model

ANOVA/ANCOVA can be used to investigate the effects of multiple factors/variables. Here variables diff from factors by the type of their values - the values of a variable are number (integer or float) while the type of factors is string even if they consist of digits. Using factors/variables defined by the user or found in the database, a user can define or help WebArrayDB to define the linear model for ANOVA/ANCOVA.

Note that the model will be a mixed-effect model in cast that any random-effect factor/variable are used. Mixed-effect model could be very time-costing in computation.

Factors/Variables falls into three categories:

The factor “group”
This is the basic factor WebArrayDB aims to investigate. When the user define two or more groups, a factor “group” is defined automatically.
Factors in the database
Information stored in databases can be used as factors. Typically, these are platform, sample, dye, array, individual (sample individual).
User-defined factors/variables
There is a table to allow users to define factors/variables if “ANOVA” is chosen as the algorithm.

Based on experiment designs, there are four options to build a model for ANOVA/ANCOVA in WebArrayDB.

5.3.1  Use “group” only

In this case, “group” is considered as the only factor that take effects on intensity data. This option is good for simple experiment designs.

5.3.2  Try to use factors in database

WebArrayDB will attempt to use the “group” factor and as many as possible other factors in databases to build a model for ANOVA. Currently the following factors will be tried:

When users select some of these factors, WebArrayDB will try to use them to build a model for ANOVA, any factor that are not suitable will be removed automatically.

5.3.3  User-defined factors/variables

WebArrayDB will use the “group” factor and all user-defined factors/variables to build the ANOVA model. This option presents a flexible way for analysis in case that the information in databases in insufficient for specific experiment designs.

5.3.4  User-defined model

There are several significant features/advantages in user-defined model:

5.4  Contrasts

This is to describe the comparisons users want to between “groups” . For example, if users put “group2 - group1”, it means the user wants to compare group2 and group1. In the analysis results, M will represent the log base 2 ratio of group2/group1. Multiple comparisons can be separated by “,” or “;”. In default if no information was filled in the Contrast box, all other groups will be compared to the first one, i.e. “group2 - group1; group3 - group1; group4 - group1” if there are total four groups.

Generally, a comparison is defined by group names separated by “+” and/or “-”. Don’t include replicates of a group name within one comparison. But these limitations are removed if you use LIMMA based analysis, which allows more flexible comparisons made by pairs of parenthesis “()”, “/” and numbers, e.g., experienced users can try something like “(group4 - group3) - (group2 - group3)”, or “group3 - (group2 + group1)/2”.

6  Other analysis tools

Some other analysis methods that has already been introduced into WebArrayDB, e.g., hierarchical clustering, heatmap, correspondence analysis, between group analysis, and genome or comparative genomic hybridization (CGH) plotting. These analyses can share a common option:

6.1  Cluster data

This analysis produces a clustering chart. Depending on the users’ requirements, WebArrayDB can cluster groups or data channels. A successful clustering requires at least three groups or data channels.

6.2  Heatmap

This analysis produces a heat map with a two-dimensional clustering. The groups (or data channels) are clustered in the horizontal direction. This should be a cluster chart similar to the “Cluster” above. Probes are clustered on the vertical direction.

6.3  Correspondence Analysis (COA)

COA finds outliers of probes. It requires at least two groups. Please refer to [5, 6].

6.4  Between Group Analysis (BGA)

BGA finds outliers of probes. It requires at least three groups. Please refer to [2, 3].

6.5  Plot genome

This function is designed for plotting intensity values or ratios along with locations of probes on the genome. Generally, the plotting is based on two-channel data - the first two groups, in which “group1” is used as the green channel (G, or the input channel), and “group2” as the red channel (R, or the output channel).

6.5.1  Options

The main options are listed below.

6.5.2  Outputs

At each plotting unit, i.e., a genome segment (see “Plot genome segment” in section 6.5), three charts will be plotted (see Figure 1 and Table 1):

  1. Ratios of intensity values. (Note that the ratios have been performed log base 2 transformation)
  2. Intensity values on the positive strand
  3. Intensity values on the negative strand

Figure 1: Genome charts

Table 1: Legends for genome plotting
Spots indicate the log base 2 transformed ratios for each probe.
Colors are used to indicate probe orientations and significance of p values.
red positive strand, with significant p values
pink positive strand, without significant p values
blue negative strand, with significant p values
light blue negative strand, without significant p values
▵, ▿Triangles indicate locations of genes. Upwards for start, downwards for end.
The name (or number) between a pair of triangles specifies a gene using “gene_symbol” or a part of it.
red , positive strand
blue , negative strand.
Squares specify locations with significant p values from differential analysis.
Colors are used to indicate probe orientations and regulated direction: output (group2) - input (group1).
red positive strand, down regulated
pink positive strand, up regulated
blue negative strand, down regulated
light blue negative strand, up regulated
Purple horizontal bars indicate adjacent probes associated with significant p values.
|Purple vertical bars indicate locations of potential transposons.
CurveIntensity values (after log base 2 transformation) of probes along the genome.
Colors are used to indicate probe orientations and regulated direction: output (group2) - input (group1).
brown curvethe input data (group1)
green curvethe output data (group2)

6.6  Transposon analysis

The purpose of “transposon analysis” is to identify the location of transposons on the genome. This analysis can be carried out only when the nucleic acids for hybridization were amplified by primers on transposons. Meanwhile the probe file for involved microarray platform must contain necessarry information for genome ploting.

6.6.1  Options

Transposon analysis is done on the basis of genome plotting (see section 6.5). While all options in section 6.5 are applicable for transposon analysis, two additional options are used as well:

6.6.2  Outputs

Transposon analysis outputs “genome plotting” (see Figure 1) with vertical bars indicating locations of transposons (see Table 1).

Five additional TAB-delimited files will be created too:

  1. The first file, with the name ended with “transposon_table_all_probes.txt”, is the biggest file for transposon analysis, containing all probes (that appear in the result file from differential analysis)
  2. The second file, with the name ended with “transposon_table.txt”, only contains the first two probes for each transposon. It is a subset of the first file.
  3. The third file, with the name ended with “transposon_table_all_genes.txt”, contains the first occurrence of every gene_symbol in the second file. If a gene_symbol is not found in the second file, the first occurrence in the first file will be used instead. It contains all genes, no matter if there is transposon or not. This file is a subset of the first file.
  4. The fourth file, with the name ended with “transposon_table_by_gene.txt”, contains the first occurrence of every gene_symbol in the second file. It is a subset of the latter. It is also a subset of the third file.
  5. The fifth file, with the name featured by “transposon_table_p_X.XX.txt”, contains the rows with significant p values (from differential analysis). It is a subset of the second file.

6.7  Bacterium CGH analysis

6.7.1  Options

The users can set up three break points to make four intervals (spaces) for the ratios. Ratios in different spaces will be plotted in distinguished colors: RED, BLUE, GRAY, and GREEN.

6.7.2  Outputs

Bacterium CGH analysis outputs a little different “genome plotting”, only ratios are plotted (see Figure 2). The differences in legends are also listed in Table 2.

Figure 2: Genome charts

Table 2: Legends for bacterium CGH plotting
Spots indicate the log base 2 transformed ratios for each probe.
Colors are used to indicate probe orientations and significance of p values.
red ratios below the first (lowest) break point.
blue ratios between the first two break points.
gray ratios between the last two break points.
green ratios above the last (highest) break point.

An additional TAB-delimited file with the name ended with “bacCGH_table.txt”, is produced with gene-wise (by “gene_symbol”) summary for number of probes, and intensity values (log base 2 transformed).

7  Miscellaneous options

Some global options can be defined in this section.

8  FAQs

8.1  How to deal with technical replicates?

Here is an example for using technical replicates. Assume a project involved three biological replicates of two conditions (i.e. 6 samples in total) which were compared on arrays, two channels, condition 1 versus condition 2. They also dye-swapped every sample, so now we end up with six arrays with two channels each, each sample has two technical replicates:

       samples for condition 1:     A1, A1, A2, A2, A3, A3
       samples for condition 2:     B1, B1, B2, B2, B3, B3

We will explain how to use ANOVA and eBayes-moderated t-test (in LIMMA) to compare the difference of samples between the two conditions.

8.1.1  Using ANOVA

Create two groups with samples in order:

       group1:    A1, A1, A2, A2, A3, A3
       group2:    B1, B1, B2, B2, B3, B3

Choose the third option for ANOVA model - Use following user-defined factors as well as "Group", and define a factor with the following values:

       Factor name:     facsamp
       Factor type:     random
       Data type:     string
       Factor values:     A1, A1, A2, A2, A3, A3, B1, B1, B2, B2, B3, B3

8.1.2  Using LIMMA

Create six groups:

       group1:    A1, A1
       group2:    A2, A2
       group3:    A3, A3
       group4:    B1, B1
       group5:    B2, B2
       group6:    B3, B3

Comparisons to make is:
       (group4 + group5 + group6 - group1 - group2 - group3)/3


Gary A Churchill. Fundamentals of experimental design for cdna microarrays. Nat Genet, 32 Suppl 2:490–495, Dec 2002.
Aedín C Culhane, Guy Perrière, Elizabeth C Considine, Thomas G Cotter, and Desmond G Higgins. Between-group analysis of microarray data. Bioinformatics, 18(12):1600–1608, Dec 2002.
Aedín C Culhane, Guy Perrière, and Desmond G Higgins. Cross-platform comparison and visualisation of gene expression data using co-inertia analysis. BMC Bioinformatics, 4:59, Nov 2003.
H. Liu, F. Hussain, C. L. Tan, and M. Dash. Discretization: An enabling technique. Data Mining and Knowledge Discovery, 6:393–423, 2002.
G. Perrière, J. R. Lobry, and J. Thioulouse. Correspondence discriminant analysis: a multivariate method for comparing classes of protein and nucleic acid sequences. Comput Appl Biosci, 12(6):519–524, Dec 1996.
Guy Perrière and Jean Thioulouse. Use of correspondence discriminant analysis to predict the subcellular location of bacterial proteins. Comput Methods Programs Biomed, 70(2):99–105, Feb 2003.
G. K. Smyth. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3:Iss. 1, Article 3, 2004.
G. K. Smyth, J. Michaud, and H. Scott. The use of within-array replicate spots for assessing differential expression in microarray experiments. Bioinformatics, 21(9):2067–2075, 2005.
Jrn Tdling Rainer Spang. Assessment of five microarray experiments on gene expression profiling of breast cancer. 2003.
V. G. Tusher, R. Tibshirani, and G. Chu. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A, 98(9):5116–5121, Apr 2001.

This document was translated from LATEX by HEVEA.