an academic website of the in association with
Home Overview Multiple Query EGC List Sources Publications GeneTide Team
 


What's New

Weizmann Institute of Science

Version: v1.3
Release:
24 April 2007
Synchronized with: GeneCards v2.36
GeneLoc v2.36
GeneAnnot v1.5


Contact us Click here to email us


  GeneTide - Terra Incognita Discovery Endeavor

GeneTide is an automated system for annotation of human transcripts - mRNA and ESTs, and the eulcidation of de-novo genes.


Top

  Background

To date, GeneCards, like many other major gene indices including its two major resourcesLocusLink and Ensembl, is based mainly on full length mRNA sequences and genes predicted from genomic data. Since sequencing full length mRNA is both time consuming and costly, the mRNA sequences for many genes are not yet available, which results in the absence of numerous genes from the gene index.

Since the early 1990's high throughput methods have generated a substantial number (>5 million) of Expressed Sequence Tags (ESTs), which now offer the most extensive window to the entire human transcriptome, and to the genes coded within it. Unfortunately, given their fragmentary nature (typically 400-600 bases) and inaccurate information (1-3% sequencing errors), assigning each of these ESTs to genes has been elusive. Previous and ongoing projects designed to address this problem have resulted in various gene lists that exhibit only partial overlap.


Top

  Goals

GeneTIDE aims to integrate various data resources in order to create a comprehensive list of human genes. This is done by association between the set of over ~5.5 million human ESTs currently available from dbEST and mRNA sequences from GenBank to the set of ~35,000 human genes as defined in GeneCards. Heretofore transcripts (mRNA & EST) can be :

  1. Proven to belong to an existing GeneCards gene

  2. Used to define de-novo genes

  3. Demonstrated to be an artifact or contaminated(genomic DNA, vector, etc.), and should therefore be discarded.


Top

  Methods

  GeneTIDE's generation process consists of two major stages.

  1. Association of transcripts with existing GeneCards genes. This is done by integration of data from several resources.

    • Resources :

      • UniGene - UniGene is an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Some of these UniGene clusters are associated with LocusLink gene identifiers, which are in turn connected to a specific GeneCards gene. This gene is therefore applied to all the transcript members of the cluster.

      • DoTS - DoTS (Database Of Transcribed Sequences) is a human and mouse transcript index created from all publicly available transcript sequences. The input sequences are clustered and assembled to form the DoTS Consensus Transcripts (dots) that comprise the index. Similarly to UniGene clusters, the "dots" are associated with a Locuslnk gene and hence a GeneCards gene.

      • AceView - AceView offers an integrated view of the human and nematode genes as reconstructed by alignment of all publicly available mRNAs and ESTs on the genome sequence. The asscoation of each transcript with a gene is extracted and incroporated into GeneTIDE.

      • GeneAnnot - GeneAnnot is an in-house automated system for annotation of Affymetrix HG-U95 probe sets, which is revised and improved by direct sequence comparison of probes to GenBank, RefSeq and Ensembl mRNA sequences. As probe sets are dervied from transcripts (mRNA or EST) the annotation given by GeneAnnot to a specific probe set, is applied in GeneTIDE to the transcript from which it was derived.

      • GeneLoc - Genomic locations for the transcripts in question, which were obtained using BLAT were downloaded from UCSC's genome browser database and compared with data from GeneLoc, an exon-based system which forms part of the GeneCards suite of databases and integrates data from LocusLink and Ensembl to create a unified location for each gene. The genes that are located on the same genomic region as found by BLAT for a specific EST were recorded.

    • Scoring scheme :

      The various gene annotations obtained for each transcript by the five aforementioned methods (UniGene, DoTS, AceView, BLAT & GeneLoc, and GeneAnnot) are integrated into a single scoring scheme which is composed of two major parameters each ranging between 0 and 1. Consensus (Co) and Uniqueness (Uq) are defined for a pair (E,G), where E is an EST and G is a GeneCards gene. A third value, score, summarizes these two parameters into one. For a more detailed view on how these parameters are calculated click here.

  2. Definition of new transcript clusters as a basis for new genes Previously unannotated remaining transcripts are subjected to a second stage of analysis. In order to define groups of transcripts that form a new gene, clustering data from UniGene, DoTS and AceView is used. Groups of transcripts (ESTs & mRNA), with more than 1 transcript (for quality consideraions), for which every member transcript belongs to the same UniGene, DoTS and AceView clusters are defined as an EST based Gene Candidate (EGCs).

    EGCs are annotated with various parameters that help determine a probabilty level for whether they truly form a new, previously undefined, gene. Alternatively, an EGC may be shown to be a representative of an exon previously unknown (e.g. due to alternative splicing) of an existing gene. The annotation includes :

    • Identifiers of the UniGene, DoTS and Aceview clusters which contain the EGC's transcripts
    • Genomic location of each transcript
    • Indication of whether the transcripts are spliced or not
    • Indication of whether the transcripts include sequences that were detected by RepeatMasker as contaminated
    • Expression vectors according to GeneNote data of probe sets derived from this EGC's member transcripts
    • The identity and distance (in base pairs) of the GeneCards gene closest to the majority of member transcripts

Top

  Searching GeneTIDE

GeneTIDE is a transcript centered database. Therefore, most annotation data is given from a transcript point of view.

  Simple query

GeneTIDE can be queried by using one of the following search keys :

  • Genbank ID - Genbank accession i.e. AA001049,AA844004

  • U95 Probe Set - The name of an Affymetrix U95 probe set i.e. 1001_at

  • U133 Probe Set - The name of an Affymetrix U133 probe set i.e. 201042_at

  • GeneCards gene - The symbol of an exisiting GeneCards gene i.e. FMR1,CFTR

  • TIDE EGC number - The accession of an EGC i.e. EGC18988, EGC7750

  • UniGene Hs. number - The name of a UniGene cluster i.e. Hs.2, Hs.7245

  • DoTS transcript - The name of a DoTS cluster i.e. DT.104964

Top

  Multiple Sequences Queries

Multiple sequences queries are useful for finding the gene associated with more than one transcript. One common use for this function is the annotation of microarray probe sets. The idea is to assign a probe set with the same gene asssigned to the transcript from which it was derived by GeneTide.

In order to use batch queries either upload a file containing GenBank IDs (see example) or simply paste the GenBank IDs into the query window. After submitting the multiple sequences query you will get a link to the result file which can be downloaded for your use. Click here for more details on the output file format.

Top

  EGC List

The EGC list displays all the EST based GeneCards Candidates (EGCs) defined by GeneTide in the current version. The list is sorted by the size (number of transcripts) of the EGCs. Foreach EGC various annotation parameters are displayed. This includes :

  1. EGC - The name of the EGC.
  2. Size - Number of transcripts contained within the EGC.
  3. RepMask - The number of transcripts within the EGC found by RepeatMasker to contain repetitive elments.
  4. Expressed - Number of transcripts that had probe sets derived from them, and shown according to GeneNote data to have an expression pattern.
  5. Spliced - The number of transcripts, among the EGC's members, which have a known splice site.
  6. CNG_GC_ID - Common Nearest Gene (CNG) is the gene that has the greatest number of transcripts among the EGC's members that this gene is their nearest gene. Here the GC ID of the gene is displayed.
  7. CNG_Symbol - The gene name (or symbol) of the Common Nearest Gene.
  8. CNG_Symbol - The gene name (or symbol) of the Common Nearest Gene.
  9. CNG_distance - The average distance (in bp), of transcripts for which the Common Nearest Gene is their nearest gene, from that gene.


Top

  Interpreting the results

Query results will be displayed in accordance with the search key used :

  Genbank ID

Top

  General links

The first table offers links to commonly requested EST related data :

Top

  Validation Level

An indication to the probability that this is avalid transcript and not an artifact or contamination. This value ranges from 0 to 4, corresponding to the following verbal descriptions :

For annotation purposes, we consider levels 3 and 4 as valid transcripts, levels 0 and 1 as invalid transcripts, and level 2 as inconclusive.

The validation level is caclulated in the following manner. Each of the three resource (UniGene, DoTS and AceView) that has not filtered this transcript out of its analysis earns the transcript one point. Next, RepeatMasker was applied to the transcript's genomic location. If this area is announced contaminated by any repetitive element by RepeatMasker, the transcript loses an additional point. The overall highest (most valid) score is 4.

Top

  Resources Annotation

A summary table showing the gene association given by each of the five resources (UniGene, DoTs, AceView, Geneannot, BLAT&GeneLoc) to this transcript.

Top

  Genomic location based data

Information regarding the queried gene acquired by aligning it against the genome using BLAT.

  Results Summary

Top

  Gene Association -

The results summary displays the "bottom line" of assocations between the transcript in question and existing GeneCards genes. The table shows all genes that were associated with the transcript via any of the resources. This is followed by an indication of which resources support the specific association, and the Concensus(Co) and Uniqueness(Uq) values that summarize them. Score is a value ranging between 0 and 1 that recapitualtes the previous two values into one. Rank is the position of the specific gene among all other genes assoicated with this transcript. The ranks are sorted according to score and Uniqueness(Uq).

  Contamination level -

The probability that this transcript is contaminated is given in terms of 3 categoriesn according to the contmaination score calculated earlier :

  • Not contaminated : Contamination levels 0-1
  • Inconclusive : Contamination level 2
  • Contaminated : Contamination levels 3-4

  U95 Probe Set

Querying Affymetrix U95 GeneChip probe sets will yield the GenBank ID result page for the transcript from which the requested probe set was derived.

  U133 Probe Set

Similarly to U95.

  GeneCards gene

Querying for a GeneCards gene symbol, will yield a table in which all transcripts associated with this GeneCards gene are displayed. The transcript are sorted in descending Rank, Score and Uniqueness(Uq) values. Each table row shows the GenBank accesion of the transcript, which resources support its association with the current gene, and the Co, Uq, Score values of this association as they appear in the transcript's own annotation page.

  TIDE EGC number

EST based Gene Candidates (EGCs) are clusters of transcripts, suggested by GeneTIDE as putative novel genes, accompanied by a large variety of annotations aiming to determine the likeliness of this EST being a true new gene.

Top

  EGC gene properties

The first table offers links to the UniGene, DoTS and AceView clusters from which this EGC was constructed.

Top

  Member transcripts

A list of transcripts constituting this EGC. For each tracnscript we show :

  • Transcript type - EST or full length mRNA.
  • Genomic location - Chromosome, strand and coordinates.
  • Splicing - Is the trnascript spliced or not.
  • RepeatMasker - Is the transcript contaminated by a repetitive element, according to RepeatMasker.

Top

  Expression patterns

If any of the transcripts withing this EGC had Affymetrix U95 GeneChip probe sets derived from them, the expression pattern of these probe sets across 12 normal human tissues according to GeneNote data will be displayed here. For more information regarding GeneNote click here.

Top

  Common Nearest Gene

This field shows the identity of the GeneCards gene which for biggest number of transcripts from this EGC, is their nearest gene.

  UniGene Hs. number

Displays a short description of the requested UniGene cluster, and the GenBank accession contained within this cluster, along with hyperlinks to the indiviual GeneTIDE pages of these transcripts.

  DoTS transcript

Displays a short description of the requested 'DoTS Transcript' cluster, and the GenBank accessions contained within it, along with hyperlinks to the indiviual GeneTIDE pages of these transcripts.

  Multiple Sequences Queries

The output file generated after conducting a multiple sequences query is divided into up to 3 parts - Annotated transcripts, EGCs, and Unknowns.
Sample output:

genbank_id gc_id gene_symbol Co Uq score rank BE874217 GC01M019135 CAPZB 0.6 0.833333 0.726101 1 BE874217 GC18M043619 MADH2 0.2 0.5 0.380789 2 #########EGC############# AA001179 EGC6762 AA001185 EGC5458 AA001199 EGC9493 AA001205 EGC9460 #########UNKNOWN######### DE874217

The first section, Gene associations is equivalent to the result obtained from the integrated scores table that is created for querying each single GenBank ID individually. In the second section, EGC transcripts that belong to a newly defined EST based GeneCards Candidate (EGC) are displayed in the first column, and the EGC identifier is displayed in the second column. In the third section, Unknown, appears a list of queried GeneBank IDS that were neither associated to a known gene nor belong to an EGC.


Top

  About the author

GeneTide was created by Maxim Shklar (maxim.shklar@weizmann.ac.il , shklar@bigfoot.com) during 2003-2004, as part of the thesis for a M.Sc degree in Life Science - Bioinformatics at the Feinberg Graduate School , Weizmann Institute of Science, Rehovot, Israel.

  Current GeneTide team

GeneTide is currently maintained by Shany Ron, Tsippi Iny Stein and Ohad Greenshpan from the
GeneCards team
.
GeneTide is a member of the GeneCards suite of databases.


  Last updated : 25 July 2005

Back to the top

Back to GeneTide Home page