Weizmann Institute of Science
GeneAnnot:
Version 2.2
Release: May, 2012
Affymetrix GeneChips
HG-U95Av2-E
HG-U133A-B
HG-U133 Plus2
Synchronized with
GeneCards Version 3.08
Contact us
|
GeneAnnot Overview
General
The GeneAnnot system explores the many-to-many relationship
between probe-sets and genes, by directly comparing the
individual probe sequences with publicly available cDNAs
and predicted genes from GenBank, RefSeq and Ensembl. The
transcript sequences are further identified as GeneCards genes
using the GeneLoc system which merges LocusLink and Ensembl gene indices on
the basis of their genomic position.
Algorithm
GeneAnnot is implemented for the human HG-U95, HG-U133 and HG-U133 Plus 2.0 array set.
However, it is applicable to any oligonucleotide set. The algorithm was
executed as a three step procedure:
- Probe-to-transcript mapping. Probes were mapped to
full length transcripts or ESTs as follows: all 25mer probe
sequences from the array set (typically 11-16 per probe-set)
were downloaded from the Affymetrix web site and compared,
using the BLAT program, to all transcript
sequences from the following resources:
- Human non-genomic sequences from GenBank's 'primate' division.
- NCBI RefSeq sequences.
- Ensembl transcripts.
Probe/transcript matches were accepted if the probe alignment
was in the mRNA orientation, and had no more than one
mismatch. For probe-sets with no matching transcripts, the
EST accessions of their representative sequences were stored.
- Transcript-to-gene mapping. Transcripts were mapped
to GeneCards genes if possible, or otherwise to UniGene clusters, as follows:
- RefSeq, Ensembl and about half
of the GenBank transcripts were mapped to their corresponding
LocusLink/Ensembl genes, and these were further
associated with GeneCards genes according to GeneLoc.
- GenBank entries for which there was no information on the
corresponding LocusLink gene were annotated as follows:
their genomic coordinates were retrieved from UCSC, and
GeneLoc was used to generate a link to a GeneCards entry,
whenever at least one GeneLoc-recorded exon overlapped
with the UCSC coordinates.
- ESTs were mapped to their
associated UniGene cluster.
- Summarized probeset-to-GeneCards annotation. A probe is marked
as associated with a GeneCards gene if it matches at least
one of the transcripts related to that gene. Each probeset-to-gene
pair may be 'connected' via a variable number of probes.
Also, each probe-set may be 'connected' to more than one gene.
The quality of connection between a probe-set and a gene is defined by the following scoring system:
- The Sensitivity score
is the fraction of probes in a probe-set that match a respective gene.
Namely, it is the number of matching probes in the given probe-set to a certain gene,
divided by the total number of probes in this probe-set
(which is usually 16 in probe-sets of U95A-E array set and 11 in
probe-sets of U133A-B and U133 Plus 2.0 array sets).
- The Specificity score
indicates to what extent probes of a probe-set bind to genes.
It sums up the number of matching probes while giving lower weight to
probes that match additional genes, and eventually divided by the total
number of probes that matched any gene.
- Genes number is the total number of genes that
match a given probe-set.
Example of scores calculation:
|