an academic website of the in association with
GeneAnnot Home Overview What's New Custom CDFs Publications GeneAnnot Team
 


Weizmann Institute of Science

GeneAnnot: Version 2.2
Release: May, 2012
Affymetrix GeneChips
HG-U95Av2-E
HG-U133A-B
HG-U133 Plus2

Synchronized with
GeneCards Version 3.08


Contact us Click here to email us


GeneAnnot Overview


General

The GeneAnnot system explores the many-to-many relationship between probe-sets and genes, by directly comparing the individual probe sequences with publicly available cDNAs and predicted genes from GenBank, RefSeq and Ensembl. The transcript sequences are further identified as GeneCards genes using the GeneLoc system which merges LocusLink and Ensembl gene indices on the basis of their genomic position.

Algorithm

GeneAnnot is implemented for the human HG-U95, HG-U133 and HG-U133 Plus 2.0 array set. However, it is applicable to any oligonucleotide set. The algorithm was executed as a three step procedure:

  1. Probe-to-transcript mapping. Probes were mapped to full length transcripts or ESTs as follows: all 25mer probe sequences from the array set (typically 11-16 per probe-set) were downloaded from the Affymetrix web site and compared, using the BLAT program, to all transcript sequences from the following resources:
    • Human non-genomic sequences from GenBank's 'primate' division.
    • NCBI RefSeq sequences.
    • Ensembl transcripts.
    Probe/transcript matches were accepted if the probe alignment was in the mRNA orientation, and had no more than one mismatch. For probe-sets with no matching transcripts, the EST accessions of their representative sequences were stored.

  2. Transcript-to-gene mapping. Transcripts were mapped to GeneCards genes if possible, or otherwise to UniGene clusters, as follows:
    1. RefSeq, Ensembl and about half of the GenBank transcripts were mapped to their corresponding LocusLink/Ensembl genes, and these were further associated with GeneCards genes according to GeneLoc.
    2. GenBank entries for which there was no information on the corresponding LocusLink gene were annotated as follows: their genomic coordinates were retrieved from UCSC, and GeneLoc was used to generate a link to a GeneCards entry, whenever at least one GeneLoc-recorded exon overlapped with the UCSC coordinates.
    3. ESTs were mapped to their associated UniGene cluster.

  3. Summarized probeset-to-GeneCards annotation. A probe is marked as associated with a GeneCards gene if it matches at least one of the transcripts related to that gene. Each probeset-to-gene pair may be 'connected' via a variable number of probes. Also, each probe-set may be 'connected' to more than one gene. The quality of connection between a probe-set and a gene is defined by the following scoring system:
    • The Sensitivity score is the fraction of probes in a probe-set that match a respective gene. Namely, it is the number of matching probes in the given probe-set to a certain gene, divided by the total number of probes in this probe-set (which is usually 16 in probe-sets of U95A-E array set and 11 in probe-sets of U133A-B and U133 Plus 2.0 array sets).
    • The Specificity score indicates to what extent probes of a probe-set bind to genes. It sums up the number of matching probes while giving lower weight to probes that match additional genes, and eventually divided by the total number of probes that matched any gene.
    • Genes number is the total number of genes that match a given probe-set.


    Example of scores calculation: