Asuragen Glossary of Terms
MicroRNA (miRNA)
miRNA Biology
Transcription miRNAs are initially expressed as part of transcripts
termed primary miRNAs (pri-miRNAs) (Lee 2002). They are apparently transcribed by
RNA Polymerase II, and include 5' caps and 3' poly(A) tails (Smalheiser 2003, Cai
2004). The miRNA portion of the pri-miRNA transcript likely forms a hairpin with
signals for RNA–specific nuclease cleavage.
Hairpin release in the nucleus The dsRNA-specific ribonuclease
Drosha digests the pri-miRNA in the nuclease to release hairpin, precursor miRNA
(pre-miRNA) (Lee 2003). Pre-miRNAs appear to be approximately 70 nt RNAs with 1–4
nt 3' overhangs, 25–30 bp stems, and relatively small loops. Drosha also generates
either the 5' or 3' end of the mature miRNA, depending on which strand of the pre-miRNA
is selected by RISC (Lee 2003, Yi 2003).
Export to the cytoplasm Exportin-5 (Exp5) seems to be responsible
for the export of pre-miRNAs from the nucleus to the cytoplasm. Exp5 has been shown
to bind directly and specifically to correctly processed pre-miRNAs. It is required
for miRNA biogenesis, with a probable role in coordination of nuclear and cytoplasmic
processing steps. (Lund 2003, Yi 2003).
Dicer processing Dicer is a member of the RNase III superfamily
of bidentate nucleases that has been implicated in RNA interference in nematodes,
insects, and plants. Once in the cytoplasm, Dicer cleaves the pre-miRNA approximately
19 bp from the Drosha cut site (Lee 2003, Yi 2003). The resulting double-stranded
RNA has 1–4 nt 3' overhangs at either end (Lund 2003). Only one of the two strands
is the mature miRNA; some mature miRNAs derive from the leading strand of the pri-miRNA
transcript, and with other miRNAs the lagging strand is the mature miRNA.
Strand selection by RISC To control the translation of target mRNAs,
the double-stranded RNA produced by Dicer must strand separate, and the single stranded
mature miRNA must associate with the RISC (Hutvagner 2002). Selection of the active
strand from the dsRNA appears to be based primarily on the stability of the termini
of the two ends of the dsRNA (Schwarz 2003, Khvorova 2003). The strand with lower
stability base pairing of the 2–4 nt at the 5' end of the duplex preferentially
associates with RISC and thus becomes the active miRNA (Schwarz 2003).
miRNA Regulation of Translation Virtually all of the miRNAs that
have been studied in animals reduce steady state protein levels for the targeted
gene(s) without impacting the corresponding levels of mRNA (Olsen 1999). The mechanism
by which miRNAs reduce protein levels is not fully understood, but one study involving
the C. elegans lin-4 miRNA/lin-14 mRNA pair indicates that lin-4 miRNA does not
affect the poly(A) tail length, transport to the cytoplasm, nor entry into polysomes
of the lin-14 mRNA (Olsen, 1999). If this observation holds true for all animal
miRNAs, then downstream steps such as translational elongation, translational termination,
or protein stability are likely influenced by miRNAs. Mounting evidence suggests
that miRNAs function via a similar enzyme complex as siRNAs.
miRNA Data Analysis
Deliverables
Project Description: The Project Description which includes the
Asuragen RNA description (RNA_desc), hybridization ID (hyb_ID), and the experimental
parameters for your samples are provided. It also provides a key, which can be used
to associate your samples names with the Asuragen ID for all data, results, and
figure files.
BioArray QC: Information on the sample background, signal intensities
and threshold values for each array is reported. A determination of the % of miRNA
determined to be present in a given sample is also provided.
Raw Signal: The “raw signal” for each miRNA obtained by subtracting
the maximum of the local background and negative control signals from the foreground
signal averaged across the two technical replicate spots for each miRNA on the array.
Array Normalized Signal: The median-scaled, log2-transformed
intensities for each miRNA on the array. The median of the present signals for each
array (spots above the threshold value) is used for the scaling. Cells that are
blank correspond to those miRNA that were below the signal threshold.
Global Normalized Signal: Global normalization is generated by
computing the Variance Stabilization Normalization (VSN; Huber et al., 2002) of
all the arrays within the project. These numbers provide the basis of further figures
and analysis. Please note that the values are in a generalized logarithm base 2
(glog2). To convert to a generalized fold-change, differences in glog2
values should be exponentiated base 2.
DEM (Differentially Expressed microRNA): The miRNA for which significant
differences in expression are identified between groups. Significance is defined
by statistical analysis (ANOVA or t-test), with the false discovery rate set to
0.05. The mean values, differences in expression in glog2 scale, p-values
with significant flags, and miRNA annotations for a complete pair-wise comparison
of those genes are reported.
Statistics: The VSN transformed expression glog2 values
within the project are summarized. Also reported are the mean, maximum, and minimum
expression intensities of the samples within each experimental group. Here we also
report the % present calls for all miRNA in the experiment.
mRNA
mRNA Normalization/Summarization
MAS 5.0 The Affymetrix MAS 5.0 Algorithm calculates the signal
value from the combined, background-adjusted PM and MM values of the probes in one
probe set.
The process is outlined as follows:
- Cell intensities are preprocessed for global background
- An ideal mismatch value is calculated and subtracted to adjust
the PM intensity
- The adjusted PM intensities are log-transformed to stabilize the
variance
- The Tukey’s biweight estimator is used to provide a robust mean
of the resulting values
- Signal is output as the antilog of the resulting value
- Finally, the signal is scaled using a trimmed mean
The MAS 5.0 algorithm occurs on a chip-by-chip basis and is not applied across an
entire set of chips.
RMA Robust Multi-array Average (RMA) adjusts gene expression values
obtained from hybridization of Affymetrix® GeneChip®
arrays, proposed by Irizzary et al. (2003). The method fits a robust linear model
to the probe-level data, analyzing each hybridized chip in the context of other
chips in the experiment. The algorithm consists of three steps—a model-based background
correction stage that neutralizes the effects of background noise, a subsequent
quantile normalization stage that aligns expression values to a common distribution,
and finally, an iterative median polishing procedure summarizes the data and generates
a single expression value for each probe set.
GC-RMA GC-RMA is a modification of the RMA algorithm replacing
the model used in the background correction stage with a more sophisticated computation
that uses each probe’s sequence information to adjust the measured intensity for
the effects of non-specific binding due to the differences in bond strength between
the two types of base pairs. It also takes into account the optical noise present
in data acquisition for an even greater accuracy and sensitivity. The two steps
of the RMA algorithm following background correction, namely, the global, cross-chip
normalization and summarization through median-polishing, remain unchanged.
PLIER The PLIER method improves expression estimate by accounting
for experimentally observed patterns in probe behavior and by handling error at
the appropriate low and high signal values.
Benefits include:
- Higher reproducibility of signal (lower coefficient of variation)
without loss of accuracy
- Higher differential sensitivity for low expressors, specifically
below 2 picomolar concentration
- Dynamic estimation of most informative probes to determine signal
This method was developed by building upon many of the concepts that have been published
recently in the field of GeneChip data analysis, including model-based expression
analysis and robust multichip analysis. It also builds upon the signal algorithm
provided in MAS 5.0 by taking into account experimental data in weighting probes
to determine the overall probe set signal. Like other model-based approaches, PLIER
accounts for the difference between probes by means of a parameter called probe
affinity. (Probe affinity represents the strength of signal produced at a specific
concentration for a given probe.) PLIER estimates the signal for the entire probe
set more accurately by utilizing these inherent probe affinities, empirical probe
performance, and by handling error appropriately across low and high concentrations.
Probe affinities are calculated using experimental data across multiple arrays.
PLIER also utilizes an error model that assumes error depends on the probe, rather
than on the signal alone.
All of the methods listed above seek to normalize the signal values across arrays
Bioinformatics: Clustering/Classification
Clustering
Clustering is a method that groups genes or samples into groups such that units
within a cluster are more similar to each other than they are to cases in other
clusters. Clustering is a useful exploratory technique for gene-expression data
when there is an expectation that there are patterns of gene expression but the
exact nature of that pattern is unknown, as it groups similar objects together and
allows the biologist to identify potentially meaningful relationships between the
objects (either genes or experiments or both). This differs from classification
were the identity (or the within group pattern) of the groups are known beforehand.
The following clustering methods can be employed:
Hierarchical Clustering Hierarchical clustering creates a hierarchical
tree of similarities between the samples called dendrogram or heatmap. The most
usual implementation is the agglomerative hierarchical clustering, which starts
with a family of clusters with one sample each, and merges the clusters iteratively,
based on some distance measure, until there is only one cluster left. Array qualities
can be roughly assessed using hierarchical clustering. Ideally common samples should
cluster into similar classes.
K-means Clustering The K-means method is known as a partitional
method. This method permits the user to predefine the number of clusters after which
the algorithm partitions the data iteratively until a solution is found.
Principal Components Analysis (PCA) PCA is designed to capture
the variance in a dataset in terms of principal components, reducing the dimensionality
of the data from many thousands to only a handful of the most informative components.
Affymetrix File Terms
GCOS: GeneChip Operating Software automates the control of GeneChip
Fluidics Stations and Scanners and acquires data, manages sample, and experimental
information. It also performs analysis of gene expression files utilizing the Affymetrix
Statistical Algorithm. This software generates the files listed below.
Experiment File (*.EXP): This file contains the parameters of the
experiment such as Array Type, Experiment Name, Equipment Parameters, Sample Description,
and others.
Image Data File (*.DAT): This file is the pixilated image file
generated by the scanner from the array after processing on the Fluidics Station.
Cell Intensity File (*.CEL): The cell file contains the processed
cell intensities from the primary image in the *.DAT file. Asuragen uses this file
for further analysis.
Probe Array Results File (*.CHP): The .chp file is the output file
from the GeneChip Operating Software expression analysis of the probe array utilizing
Affymetrix Statistical Algorithm. The chip file contains the data that can be used
for other analyses.
Report File (*.RPT): The report file is generated from the .chp
file. This expression report summarizes information about expression analysis settings,
probe set hybridization intensity data, and other quality metrics for sample and
array performance.
Data Transfer Tool File (*.DTT): This file contains your complete
raw data, packaged, and ready to be imported into your copy of Affymetrix GCOS software.
It consists of compressed files containing any combination of .dat, .cel and .chp
files of a database / project / sample / experiment.
Library Files (.cif, .cdf, .psi): These files are unique to each
probe array type and contain information for scanning and analysis parameters, array
design, and probe information.