Genome Research (2003): Comparative Gene Prediction in Human and Mouse
SUPPLEMENTARY MATERIALS FOR
Comparative gene prediction
in human and mouse
G. Parra, P. Agarwal, J.F. Abril,
T. Wiehe, J.W. Fickett and R. Guigó *.
Genome Research 13(1):108-117 (Jan 1, 2003)
[ PubMed ] [Abstract] [Full Text]
* To whom correspondence should be adressed.
Email: rguigo@imim.es. Ph: +034 93-224-0877.
Summary |
SGP2 is a program to predict genes by comparing anonymous genomic sequences from two different species. It combines tblastx (WU-Blast), a sequence similarity search program, with geneid, an ab initio gene prediction program. In assymetric mode, genes are predicted in one sequence from one species (the target sequence), using a set of sequences (maybe only one) from the other species (the reference set). Essentially, geneid is used to predict all potential exons along the target sequence. Scores of exons are computed as log-likelihood ratios, function of the splice sites defining the exon, the coding bias in composition of the exon sequence as measured by a Markov Model of order five, and of the optimal alignment at the amino acid level between the target exon sequence and the counterpart homologous sequence in the reference set. From the set of predicted exons, the gene structure is assembled (eventually multiple genes in both strands) maximizing the sum of the scores of the assembled exons.
CONTENTS
- SGP2: describing the algorithm.
- GFF2PS: how we made PostScript files for this page.
- SGP2 test sets
- Gene predictions
SGP2 Test Sets |
IMOG dataset
This is a list of 15 pairs of single gene sequences, with little overlap with the Sanger Center data set [Jareborg et al., Genome Research 9(9):815, 1999]. The gene accession-pairs associates a gene id to the corresponding human-mouse pair (first the human sequence, then the mouse).
You can donwload a tarball containing all of above from here.
BI dataset
These are three pairs of multigene sequences. Annotation is not available for all the sequences, and of unknown reliability.
You can donwload a tarball containing all of above from here.
SCIMIT dataset
This set contains 129 pairs of single gene sequences and combines non-overlaping sequences from IMOG (see above), the Sanger Center [Jareborg et al., Genome Research 9(9):815, 1999] and the MIT [Batzoglou et al., Genome Research 10(7):950, 2000] data sets. The gene accession-pairs associates a gene id to the corresponding human-mouse pair (first the human sequence, then the mouse).
You can donwload a tarball grouping all of above from here.
Gene predictions: FINISHED HOMOLOGOUS SEQUENCES |
Finished Orthologous
SGP2 predictions on the eigth human/mouse homologous sequences browsed from http://pipeline.lbl.gov/TESTS/ (including MHC). Unfortunately, that URL is no longer available. We just added a column (see Sequences and Annotations) to the table appearing below, which contains the human and mouse fasta files and the corresponding human annotations we have obtained from there.
Each human sequence was compared against the corresponding homologous mouse sequence.
We introduced few changes in the PostScript maps:
- We show the real length of annotated genes (taking into account first and last UTR coords), but we still display only the anontated CDS's.
- As we ran our programs on the original masked sequence, we are displaying only the masked regions for each sequence without labeling them (in the central axes of each block).
- We also included Twinscan results for the human/mouse homologous set.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
FA+RS | The set of human and mouse fasta sequences (masked and unmasked), plus the human RefSeqs mapped onto the human sequences in GFF format. There are two GFF files for each region, the *.pipeline_refseq.gff having the original annotations produced at Berkeley, and the *.Korf_refseqs.gff which contains the subset of hand-curated annotations for the same regions (except for the MHC region that was too big). Those annotations were curated by Ian Korf, see further information at: http://sapiens.wustl.edu/~ikorf/annotation/ |
TBX | contains the raw tblastx (WU-Blast) results of each human sequence against the corresponding homologous mouse sequence. tblastx has been run with -nogap , and the blosum62 matrix, were penalty for aligning with stop codons have been set to -500. |
HSP | contains the resulting hsp's in GFF format (but with frames 1,2,3 as in blast). |
GFF | `General Feature Format' (GFF) is described on the Sanger Centre gff definition page. |
GTF2 | `Gene Transfer Format' (GTF), this borrows from GFF, but has additional structure that warrants a separate definition and format name. GTF2 is based on Ensembl GTF, and is described in detail at this link. |
A4/A3 | contains a PostScript map showing SGP2 predictions, altoghether with geneid and genscan predictions, tblastx matches, repeat locations and, when available, annotations of the known genes. Maps were obtained using gff2ps . Two sizes are provided to be printed into a4 or a3 paper size, but we recommend a3 to visualize with ghostview or similar programs. |
Finished human vs. mouse reads
SGP2 predictions on the eigth human sequences browsed from http://pipeline.lbl.gov/TESTS/ (including MHC) against the mouse WGS 3X (a database of about 13 milion mouse reads). Unfortunately, that URL is no longer available, see previous table for the sequences and annotation of the ortologous datasets.
We used here human fasta sequences that were masked slightly different than those used in the human/mouse orthologous section.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
TBX | contains the raw tblastx (WU-Blast) results of each human sequence against a database of about 13 milion mouse reads. tblastx has been run with -nogap , and the blosum62 matrix, were penalty for aligning with stop codons have been set to -500. |
HSP | contains the resulting hsp's in GFF format (but with frames 1,2,3 as in blast). |
GFF | `General Feature Format' (GFF) is described on the Sanger Centre gff definition page. |
GTF2 | `Gene Transfer Format' (GTF), this borrows from GFF, but has additional structure that warrants a separate definition and format name. GTF2 is based on Ensembl GTF, and is described in detail at this link. |
A4/A3 | contains a PostScript map showing SGP2 predictions, altoghether with geneid and genscan predictions, tblastx matches, repeat locations and, when available, annotations of the known genes. Maps were obtained using gff2ps . Two sizes are provided to be printed into a4 or a3 paper size, but we recommend a3 to visualize with ghostview or similar programs. |
Gene predictions: HUMAN CHROMOSOME 22 |
This section contains SGP2 predictions on human chromosome 22. Chromosome 22 annotation was compiled by Victoria Haghighi from the Columbia Genome Center. The data was downloaded from http://www.cs.columbia.edu/~vic/sanger2gbd.
There are two sets of SGP2 predictions. The first one are raw prediction along the whole Chromosome 22 sequence (Homology Only). The second one is a set of predictions confined to regions void of annotated genes or pseudogenes (Homology + Evidences). The goal is here predicting novel genes minimizing chimeric predictions. In this case, annotations are taken from the Combined Gene + CDS Set (879 genes).
|
TBX | contains the raw tblastx (WU-Blast) results of each human sequence against a database of about 19 milion mouse reads (WGS). tblastx has been run with the following parameters: -nogap , Z=3000000000 , E=0.01 , W=5 , B=10000 , V=10000 , -hspmax=4 , -topcomboN=4 , -filter=xnu , and a modified blosum62 matrix were penalty for aligning with stop codons have been set to -500. |
HSP | contains the resulting hsp's in GFF format (but with frames 1,2,3 as in blast). |
SR | similarity regions in GFF format (but with frames 1,2,3 as in blast), as they were projected from the HSPs (see how they are obtained and how they influence the exons score in the SGP2 algorithm description page). |
GFF | `General Feature Format' (GFF) is described on the Sanger Centre gff definition page. |
GTF2 | `Gene Transfer Format' (GTF), this borrows from GFF, but has additional structure that warrants a separate definition and format name. GTF2 is based on Ensembl GTF, and is described in detail at this link. |
Whole-Genome Gene-Predictions |
The results of SGP2 on human and mouse genomes are available from our new Gene-Prediction section. Follow these links to download them:
Homo sapiens | SGP2 results on H.sapiens based on M.musculus MGSC version-3 assembly Version of the Human genome used: golden_path_20011222 (22nd of December 2001). Version of the Mouse genome used: goldenPath assembly (mmFeb2002-MGSCv3-February, 2002). Predictions were obtained on the masked version of the genome. These are the predictions for the v3 of the mouse genome assembly. NOTE: These SGP2 predictions combine geneid predictions with tblastx comparison of the Human genome against the Mouse genome. |
|
---|---|---|
SGP2 results on H.sapiens based on M.musculus Sanger Phusion assembly Version of the Human genome used: golden_path_20010806 (6th of August 2001). Version of the Mouse genome used: sanger_phusion_20011109 (9th of November 2001). Predictions were obtained on the masked version of the genome. NOTE: These SGP2 predictions combine geneid predictions with tblastx comparison of the Human genome against the Mouse genome. |
||
Mus musculus | SGP2 results on M.musculus based on H.sapiens December Golden Path assembly Version of the Mouse genome used: goldenPath assembly (mmFeb2002-MGSCv3-February, 2002). Version of the Human genome used: golden_path_20011222 (22nd of December 2001). Predictions were obtained on the masked version of the genome. NOTE: These SGP2 predictions combine geneid predictions with tblastx comparison of the Mouse genome (v3) against the Human genome. |