Help of WebArray

Content

Background Subtraction

Background correction methods for microarray expression intensities.


Edwards
This method is designed to produce positive corrected intensities. A log-linear interpolation method is used to adjust lower intensities as in (Edwards 2003).
half
This method is designed to produce positive corrected intensities. Any intensity which is less than 0.5 after background subtraction is reset to be equal to 0.5.
minimum
This method is designed to produce positive corrected intensities. Any intensity which is zero or negative after background subtraction is set equal to half the minimum of the positive corrected intensities for that array.
movingmin
The background estimates are replaced with the minimums of the backgrounds of the spot and its eight neighbors, i.e., the background is replaced by a moving minimum of 3x3 grids of spots.
none
The corrected intensities are equal to the foreground intensities, i.e., the background intensities are treated as zero.
normexp
This method is designed to produce positive corrected intensities. A convolution of normal and exponential distributions is fitted to the foreground intensities using the background intensities as a covariate, and the expected signal given the observed foreground becomes the corrected intensity. This results in a smooth monotonic transformation of the background subtracted intensities such that all the corrected intensities are positive.
subtract
This function simply subtracts the background intensities from the foreground intensities which is the usual background correction method.

Between-Array Normalization

Normalizes expression intensities so that the intensities or log-ratios have similar distributions across a series of arrays. Normalization between arrays is usually, but not necessarily, applied after normalization within arrays. An exception is vsn normalization.


Aquantile Normalization
Aquantile normalization ensures that the A-values (average intensities) have the same empirical distribution across arrays leaving the M-values (log-ratios) unchanged.
Gquantile Normalization
Gquantile normalization ensures that the green (first) channel has the same empirical distribution across arrays, leaving the M-values (log-ratios) unchanged.
PCA Normalization
Use principal component analysis to normalize data proposed by (Stoyanova, Querec et al. 2004). This normalization method was originally designed for a single color array experiment. In the case of dual color experiment, two channels are treated as two different single color arrays for the purpose of normalization.
Rquantile Normalization
Rquantile normalization ensures that the red (second) channel has the same empirical distribution across arrays, leaving the M-values (log-ratios) unchanged.
Quantile Normalization
Quantile normalization is to ensure that the intensities have the same empirical distribution across arrays and across channels. Quantile normalization was proposed by (Bolstad, Irizarry et al. 2003) for Affymetrix-style single-channel arrays and by (Yang and Thorne 2003) for two-color cDNA arrays.
none
No between-array normalization.
Scale Normalization
The scale normalization method is to scale the log-ratios to have the same median-abolute-deviation (MAD) across arrays. The scale normalization method was proposed by (Yang, Dudoit et al. 2002) and is further explained by (Smyth and Speed 2003).
vsn Normalization
vsn normalization uses the “vsn” function from the vsn package in Bioconductor.

Contrast Design

The contrast design will be used to generate contrast matrix for linear model analysis. It is required for affymetrix data analysis and is optional for dual color array analysis if no complex experiment design desired. Must be used along with design file.
For example:
If design matrix is
exp1exp2
10
10
01
01

And you create a contrast design such as
exp2-exp1
then you will get contrast matrix as
 exp2-exp1
exp1-1
exp21

Control Gene List File

Control Gene List File contains a list of control spots that will be used for Within-Array Normalization. The file should contain control spots´ Spot ID (spot number in array print order).
For example (download here):
11
38
45
57
58
108
116
155

Design File

This file is optional if no complex experiment design is desired, or no targets files were used.
A design file is a tab-delimitated text file that contains a design matrix for linear model analysis.
Example 1, direct comparison of two samples.
If targets file (download here) contains
SlideNumberFileNameCh1Ch2Date
1export_1.gprHeLa(Ctrl)HeLa(drug-treated)4/18/2005
2export_2.gprHeLa(drug-treated)HeLa(Ctrl)4/18/2005

and you want to compare drug-treated (HeLa) and Ctrl (HeLa) samples,
then design file (download here) should contain
exp1
1
-1
and M (ratio) you will get is log2(drug-treated/Ctrl).
Example 2, comparison of two samples using a common reference.
If targets file (download here) contains
SlideNumberFileNameCh1Ch2Date
1export_1.gprRNA(ref)HeLa(Ctrl)4/18/2005
2export_2.gprRNA(ref)HeLa(Ctrl)4/18/2005
3export_3.gprRNA(ref)HeLa(drug-treated)4/18/2005
4export_4.gprRNA(ref)HeLa(drug-treated)4/18/2005

then design file (download here) can contain
exp1exp2
10
10
01
01

and you will get two M (ratio), the first one is log2(drug-treated/ref) and the second is log2(drug-treated/Ctrl).

Expression measures

This function is to convert probe level data to expression values. Typically this is achieved through the following sequence:
1. reading in probe level data.
2. background correction.
3. normalization.
4. probe specific background correction, e.g. subtracting MM.
5. summarizing the probe set values into one expression measure and, in some cases, a standard error for this summary.

RMA
One of the quickest ways to compute expression using the affy package is to use the rma function. The function serves as an interface to a hard coded C implementation of the RMA method (Irizarry et al., 2003b). Currently the rma function implements RMA in the following manner
1. Probe specific correction of the PM probes using a model based on observed intensity being the sum of signal and noise.
2. Normalization of corrected PM probes using quantile normalization (Bolstad et al.,2003)
3. Calculation of Expression measure using median polish.

MAS 5.0
To obtain expression values that correspond to those from MAS 5.0. A detailed comparison between the MAS 5.0 values that are computed by affy package and by Affymetrix's software can be found at http://stat-www.berkeley.edu/~bolstad/MAS5diff/Mas5difference.html.
To obtain MAS 5.0 presnce calls you can use the mas5calls method.

dchip
Li and Wong's MBEI (dchip) is obtained by using bg.correct=FALSE, normalize.method="invariantset", pmcorrect.method="pmonly", summary.method="liwong".

User defined
The function expresso performs the steps background correction, normalization, probe specific correction, and summary value computation.
Important parameters for the expresso function are:
bgcorrect.method. The background correction method to use. The available methods are "mas", "none", "rma" and "rma2".
normalize.method. The normalization method to use. The available methods are "constant", "contrasts", "invariantset", "loess", "qspline", "quantiles" and "quantiles.robust".
pmcorrect.method. The method for probe specific correction. The available methods are "mas", "pmonly" and "subtractmm".
summary.method. The summary method to use. The available methods are "avgdiff", "liwong", "mas", "medianpolish" and "playerout".

Gene List File

This file is required if intensity files are created by "qunatarray".
A gene list file is tab-delimitated text file containing a list of gene IDs and associated information. The file must contain data columns (case sensitive) labeled “Block”, “Column”, “Row”, “ID” and “Name”.
For example (download here):
BlockRowColumnIDName
111 MM00001 NM_018903
112 MM00005 NM_000067

GenePix Array List (GAL) file (download example here) is acceptable too.

genome/chromosome location file

Genome/chromosome location file is optional and will be used for genome mapping.
It is a tab-delimited text file containing a list of genes with information about their genome location. The file should have two or more columns:

  1. The first column should be the gene SpotID (corresponding to spot number in array print order).
  2. The second column should be the chromosome number. For single chromosome genome, put “1” for all genes. For multi-chromomosome genome, use digits to replace “X” or “Y”. For example, use “23” to replace chromosome X and “24” to replace chromosome Y in human genome, use use “20” to replace chromosome X and “21” to replace chromosome Y in mouse genome, use “21” to replace chromosome X in rat genome.
  3. The third column is optional. If available, it will be used as the order of the probes on genome, otherwise all the probes should be sorted according to their location on genome, for example, from the start of chromosome 1 to chromosome 22, chromosome X and end of chromosome Y for human genome.
It is not necessary to use all the spots on the array for this file.
For example (download here):
SpotIDChr
27641
86331
112671
65911

Genome Mapping

Genome mapping will be executed only if a genome/chromosome location file is provided.
In genome/chromosome location file, the spots are sorted based on their chromosome location. The microarray data was subjected to statistical analysis as described by (Wang, Yu et al. 2005) and (Clark, Edwards et al. 2003). We first constructed a quadratic loess curve, which can be viewed as a locally weighted polynomial regression curve through each data set. The parameter alpha in loess function which controls the degree of smoothing can be adjusted with “span”. We then identified those regions in which contiguous segments of the loess curve were consistently greater than (or less than) “threshold 1” times standard deviations away from the mean of the all the data points. Having located these regions of interest, we used the Mann-Whitney U test to determine whether each selected region differed significantly (p<“threshold 2”) from the set of data points from regions that had not been selected for examination by this test.

Intensity File

A series of text files contain spot intensities exported from the image analysis program, including Affymetrix GCOS, Agilent Feature Extraction, ArrayVision, GenePix, ImaGene, QuantArray, Stanford Microarray Database (SMD), SPOT and spot.close.open. Data from some other image analysis programs can be read only if using appropriate column names.


Affymetrix Intensity File
Affymetrix CEL file.
Agilent Feature Extraction
Text files contain spot intensities exported from the image analysis program Agilent Feature Extraction.
ArrayVision
Text files contain spot intensities exported from the image analysis program ArrayVision.
GenePix
Text files contain spot intensities exported from the image analysis program GenePix. The appropriate column names taken are “F633 Mean”, “F543 Mean“, “B633 Mean“, “B543 Mean“.
ImaGene
Text files contain spot intensities exported from the image analysis program ImaGene.
QuantArray
Text files contain spot intensities exported from the image analysis program QuantArray.
Stanford Microarray Database (SMD)
Text files contain spot intensities exported from Stanford Microarray Database (SMD).
SPOT
Text files contain spot intensities exported from the image analysis program SPOT.
spot.open.close
Text files contain spot intensities exported from spot.open.close.
User Defined microarray Intensity File
Text files contain spot intensities. Must be tab-delimited text file. The files must have appropriate column names (case sensitive), including “ID”(spot number in array print order), “ch1.Intensity”(foreground intensity for channel 1), “ch2.Intensity”(foreground intensity for channel 2), “ch1.Background”(background intensity for channel 1), “ch2.Background”(background intensity for channel 2).
For example (download here):
IDch1.Intensitych1.Backgroundch2.Intensitych2.Background
1564061930
210057585448
3392040023

Microarray Data Analysis


Option 1: Linear Model Statistical Analysis
Using microarray statistical analysis package provided by LIMMA package in Bioconductor, it computes moderated t-statistics and log-odds of differential expression by empirical Bayes shrinkage of the standard errors towards a common value.
The package limma uses an approach called linear models to analyze designed microarray experiments. This approach allows very general experiments to be analyzed just as easily as a simple replicated experiment. The approach is outlined in (Smyth 2004) and (Yang and Speed 2002). The approach requires one or two matrices to be specified. The first is the design matrix which specifies which comparisons you would like to make between the samples. For very simple experiments, you may not need to specify the contrast matrix.
If you have data from Affymetrix experiments, from single-channel spotted microarrays or from spotted microarrays using a common reference, then linear modeling is the same as ordinary analysis of variance or multiple regressions except that a model is fitted for every gene. With data of this type you can create design matrices as one would do for ordinary modeling with univariate data. If you have data from spotted microarrays using a direct design, i.e., a connected design with no common reference, then the linear modeling approach is very powerful but the creation of the design matrix may require more statistical knowledge.
For statistical analysis and assessing differential expression, limma uses an empirical Bayes method to moderate the standard errors of the estimated log-fold changes. This results in more stable inference and improved power, especially for experiment with small numbers of arrays (Smyth 2004).
In addition, False discovery rate (FDR) is estimated using spacings LOESS histogram (SPLOSH) (Pounds and Cheng 2004). Spacings LOESS histogram (SPLOSH) is proposed for estimating the conditional FDR (cFDR), the expected proportion of false positives conditioned on having k 'significant' findings.
Option 2: Normalize data only
The microarray data will be normalized and a text file will be generated. No figures will be plotted. If the output file is opened with MS-Excel, then make sure to manually move the first row one cell to the right.
The output file will contain following columns:
Block: same as “Block” column in the gene list file.
Row: same as “Row” column in the gene list file.
Column: same as “Column” column in the gene list file.
ID: same as “ID” column in the gene list file.
Name: same as “Name” column in the gene list file.
M: the log-differential expression ratio. M=log2(R) - log2(G). R: red, ch2; G: green, ch1.
A: the log-intensity of the spot, a measure of overall brightness of the spot. A=(log2(R) + log2(G))/2. R: red, ch2; G: green, ch1.

Output Data Format

The statistically analyzed microarray data can be either not ranked (the same order as array print order) or ranked by M (ratio), p value or B statistics.
If the output file is opened with MS-Excel, then make sure to manually move the first row one cell to the right.
The output file will contain following columns:
Block: same as “Block” column in the gene list file.
Row: same as “Row” column in the gene list file.
Column: same as “Column” column in the gene list file.
ID: same as “ID” column in the gene list file.
Name: same as “Name” column in the gene list file.
M: the log-differential expression ratio. M=log2(R) - log2(G). R: red, ch2; G: green, ch1.
A: the log-intensity of the spot, a measure of overall brightness of the spot. A=(log2(R) + log2(G))/2. R: red, ch2; G: green, ch1.
t: penalized t-statistics.
p: p-values corresponding to the t-statistics.
B: B statistics, the log-odds of differential expression.
fdr: Estimated FDR incurred by setting threshold at the corresponding p value.
fp: Estimated number of false positives incurred by setting threshold at the corresponding p value.
fn: Estimated number of false negatives incurred by setting threshold at the corresponding p value.
CGH: statistical significant genes detected by CGH Analysis. 1 means genes present or with multi-copy numbers. -1 means genes absent or less copy number. 0 means no difference. “NA” means genes are not used for analysis.
“M”, “A”, “t”, “p” and “B” are calculated with linear modal statistical analysis.
“fdr”, “fp”, “fn” are estimated with SPLOSH.

Plots

Array image plot
Creates an image of colors that represent the values of a statistic for each spot on a spotted microarray. This function can be used to explore any spatial effects across the microarray. Four figures will be plotted for each array, including channel 1 foreground, channel 1 background, channel 2 foreground and channel 2 background.
Box plot
M box-plot will be produced before and after between-array normalization. Each box represents one single array.
CGH plot
Two figures will be plotted for CGH analysis. Figure 1 plots loess curve and figure 2 plots raw hybridization ratios. x-axis is genome/chromosome location and y-axis is log2 hybridization ratio. The y-axis scale can be adjusted with “figure 1 y-axis scale” and “figure 2 y-axis scale”.
Density plot
Creates a plot of the densities of single-channels from two-color microarray data.
Three figures will be plotted. The first one is to plot un-normalized data site, the second one is to plot within-array normalized data and the last one is to plot within-array and between-array normalized data.
Histogram plot
Histogram of PM intensities for Affymetrix arrays. Each curve represents one array.
M-A plot
Creates an MA-plot with color coding for control spots. Two figures will be plotted for each array, one if before within-array normalization and the other one is after.
Printtiploess plot
For each single array, M-A plot and M box-plot will be produced before and after print-tip loess normalization. Superimposed on M-A plot are individual lowess smoother lines, one for each pin group. In the box plot, each box represents one print pin group.
Results chart
Contains four figures for each statistical analysis. M-A plot, M-B plot, histogram of M and histogram of A.
RNA degradation plot
Uses ordered probes in probeset to detect possible RNA degradation. Within each probeset, probes are numbered directionally from the 5' end to the 3' end. Probe intensities are averaged by probe number, across all genes. Then plotAffyRNAdeg simply shows these means for each chip.

See Output Data Format for explanation of “M”, “A” and “B”.

Spot Quality Weight

The weight for each spot between 0 and 1 which indicates the reliability of the acquired intensities at that spot.
If the SPOT image analysis program is used and the size of an ideal perfectly circular spot is known to be 100 pixels, then “100” gives full weight to spots with area 100 pixels and down-weights smaller and larger spots. Spots which have zero area or are more than twice the ideal size are given zero weight.
With GenePix data, weight 0.1 will be given to any spots which receive a negative flag from the GenePix program.
With Quantarray data, weights are set equal to the column “Ignore Filter”, the weights are 0 for spots to be ignored and 1 otherwise.

Spot Type File

Spot Type File (STF) is another optional tab-delimited text file which allows you to identify different types of spots from the gene list. This file is only used for M-A plot and different types of spots can be plotted in different color.
The STF should have a column “SpotType” giving the names of the different spot-types. The STF should also have column “ID” and “Name”, which are the same column name in the gene list file. Another column “Color” contains color information to be associated with the spot types.
For example (download here):
SpotTypeIDNameColor
Promoter Seq MM*NM*black
cDNA clonesCDNA**red
Neg.CtrlC_SAL**blue
Blankblank**yellow

In this example, the asterisks are wildcards which can represent anything. Be careful to use upper or lower case as appropriated and don´t insert any extra spaces.

Targets File

This file is optional if no complex experiment design desired. Design file must be used along with this file.
The targets file is a tab-delimitated text file containing information about the samples used as targets in the microarray experiment. The files should have a row of column headers and should have one row for each array in the experiment. It must have a column named “FileName” that lists the files from image analysis program containing raw foreground and background intensities for each slides.
Normally there are columns labeled as “Cy3” (or “Channel 1”) and “Cy5” (or “Channel 2”) or similar specifying which samples are hybridized to each channel of each array. Other columns may contain any other information associated with the arrays or targets used in the experiment.
Example 1 (download here):
SlideNumberFileNameCh1Ch2Date
1export_1.gprHeLa(Ctrl)HeLa(drug-treated)4/18/2005
2export_2.gprHeLa(drug-treated)HeLa(Ctrl)4/18/2005

Example 2 (download here):
SlideNumberFileNameCh1Ch2Date
1export_1.gprRNA(ref)HeLa(Ctrl)4/18/2005
2export_2.gprRNA(ref)HeLa(Ctrl)4/18/2005
3export_3.gprRNA(ref)HeLa(drug-treated)4/18/2005
4export_4.gprRNA(ref)HeLa(drug-treated)4/18/2005

Within-Array Normalization

Normalize the expression log-ratios for one or more two-color spotted microarray experiments so that the log-ratios average to zero within each array or sub-array. Normalization is intended to remove from the expression measures any systematic trends which arise from the microarray technology rather than from differences between the probes or between the target samples hybridized to the arrays.
This function normalizes M-values (log-ratios) for dye-bias within each array. Apart from “none” and “median”, all the normalization methods make use of the relationship between dye-bias and intensity. The loess normalization methods were proposed by (Yang, Dudoit et al. 2001; Yang, Dudoit et al. 2002). (Smyth and Speed 2003) give a detailed statement of the methods.


Composite Loess Normalization
If a suitable set of control spots is available which are known not to be differential expressed, the loess curve through the control spots offers security that the curve is not biased by differentially expressed genes. To be of most use in loess normalization, the control spots should span as wide a range of intensities as possible.
Global Loess Normalization
The global loess normalization is a simpler form of print-tip loess normalization and it does not take into account of sub-array variation.
Median Normalization
This function normalizes the median of M-values (log-ratios) within each array to be zero.
none
No within-array normalization.
Print-tip Loess Normalization
Print-tip loess normalization is recommended as a routine normalization for most of the two-color microarray experiments.
The purpose of this normalization is to adjust spatial and intensity trend (dye-bias) by considering separate loess curves for each of the print tip groups. Each M-value is normalized by subtracting from it the corresponding value of the tip group curve.
Robust Splines Normalization
Normalize the M-values for a single microarray using robustly fitted regression splines and empirical
Bayes shrinkage.
This function implements an idea similar to print-tip loess normalization but uses regression splines
in place of the loess curves and uses empirical Bayes ideas to shrink the individual print-tip curves
towards a common value. This allows the technique to introduce less noise into good quality arrays
with little spatial variation while still giving good results on arrays with strong spatial variation.

Within-Array Replicate Spots

Number of replicate sub-arrays within each array. This is based on the assumption that all the genes are replicated the same number of times and the spacing between the replicates is constant. Spacing 1 means the genes are printed consecutively. For arrays with within-array replicate spots, limma uses a pooled correlation method to make full use of the duplicated spots(Smyth, Michaud et al. 2005).

Reference

Bolstad, B. M., Irizarry, R. A., et al. (2003). "A comparison of normalization methods for high density oligonucleotide array data based on variance and bias." Bioinformatics 19(2): 185-93.
Clark, J., Edwards, S., et al. (2003). "Genome-wide screening for complete genetic loss in prostate cancer by comparative hybridization onto cDNA microarrays." Oncogene 22 (8): 1247-52.
Edwards, D. (2003). "Non-linear normalization and background correction in one-channel cDNA microarray studies." Bioinformatics 19 (7): 825-33.
Pounds, S. and Cheng, C. (2004). "Improving false discovery rate estimation." Bioinformatics 20 (11): 1737-45.
Smyth, G. K. (2004). "Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments." Statistical Applications in Genetics and Molecular Biology 3 (1): Article 3.
Smyth, G. K., Michaud, J., et al. (2005). "Use of within-array replicate spots for assessing differential expression in microarray experiments." Bioinformatics.
Smyth, G. K. and Speed, T. (2003). "Normalization of cDNA microarray data." Methods 31 (4): 265-73.
Stoyanova, R., Querec, T. D., et al. (2004). "Normalization of single-channel DNA array data by principal component analysis." Bioinformatics 20 (11): 1772-84.
Wang, Y., Yu, Q.J., et al. (2005). Survey of differentially methylated promoters in prostate cancer cell lines." Neoplasia 8: 746-780
Yang, Y. H., Dudoit, S., et al. (2001). "Normalization for cDNA microarray data." In _Microarrays: Optical Technologies and Informatics_, M. L. Bittner, Y. Chen, A. N. Dorsel, and E. R. Dougherty (eds), Proceedings of SPIE. 4266: 141-152.
Yang, Y. H., Dudoit, S., et al. (2002). "Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation." Nucleic Acids Res 30 (4): e15.
Yang, Y. H. and Speed T. (2002). "Design and analysis of comparative microarray experiments." In T.P.Speed(ed.), Statistical Analysis of Gene Expression Microarray Data. CRC Press.
Yang, Y. H. and Thorne N. P. (2003). "Normalization for two-color cDNA microarray data." In: D. R. Goldstein (ed.), _Science and Statistics: A Festschrift for Terry Speed_, IMS Lecture Notes - Monograph Series 40: 403-418.