Processing and analysis of Illumina sequencing data
- Illumina Sequencing Analysis Pipeline
- Read processing with non-Illumina software
- Application Specific Files
- File Formats
- Data Storage
Illumina Sequencing Analysis Pipeline
Illumina provides a standard pipeline used to conduct image analysis and base calling for all types of samples. This process converts the images taken by the Illumina GA sequencer to digital intensity levels, and subsequently uses these intensity levels for base calling.
Data were processed with the following Illumina pipeline versions:
- 2008-01-07 to 2008-03-12: version 0.2.6
- 2008-04-25 to 2008-11-20: version 0.3.0
- 2008-12-23: version 1.0
- 2008-11-25 to 2009-05-28: version 1.3.2
- 2009-05-28 to 2009-11-19: version 1.4.0 (within SCS 2.3)
- 2009-11-19 to 2010-04-07: version 1.5.0 (within SCS 2.5)
- 2010-04-07 to 2010-07-30: version 1.6.0 (within SCS 2.6)
- 2010-07-30 to 2010-08-05: version 1.7.0 (within SCS 2.6)
- 2010-08-05 to present: version 1.7.0 (within SCS 2.8)
Two types of files are created containing the read sequences. The seq.txt file is a file containing only the sequence information. The qseq file contains both sequence and base by base quality information. Please consult us for directions on how to interpret the quality information. A summary of standard files is provided below.
In general only the seq.txt files are provided to users, but qseq files are available on request.
NOTE: For applications (e.g. miRNA sequencing) where the fragment length is lower than the read length, the adapter sequence must be located and trimmed. For this reason we recommend using the sequence files we provide with the adapter sequence removed, as described below.
Read data are passed through the Illumina alignment and sequence analysis program, GERALD, where they are aligned to a reference sequence using the Eland software. Depending on the sample, the reference sequence will be a genome sequence, RefSeq dataset, or mirbase sequences, etc.
NOTE: For the Illumina Sequencing Analysis Software and MIRO, the most recent version of genomes has been used from May 2008 to May 7, 2009. The only exception to this is human, which was updated in our references to hg_19 on April 30th, 2009. As genomes are update and changed, a record of when which version was used will appear here.
This software creates files containing alignment locations and information, which can be provided to the user at their request. One file of interest may be the s_[lane#]_sequence.txt file, which contains only reads passing quality filtering, in FastQ format, and is available upon request. Reads pass filtering when the base-calling software is confident in the first X bases (12 by default) of the read.
Before providing any data, we review the results of the GERALD analysis to be sure the reads are of high quality and sufficient in number. We do so by reviewing intensity level and sequencing error reports, as well as reviewing the number of reads which match the reference unambiguously, ambiguously, and not at all. If all these standards are passed, we then pass the sequence data on to the user. Eland is a convenient tool to assess the quality of read datasets. However, it has limitations (e.g. it tolerates only two mismatches, does not allow indel errors, has read length limitation, and trims all reads to the same length). We therefore recommend to users to consider the analysis of their data employing further mapping algorithms without these restrictions (e.g. SeqMap, Gem, or others).
Details regarding the Illumina Sequencing Analysis Software were provided by the Illumina Sequencing Analysis Software User Guide For Pipeline Version 1.3 and CASAVA Version 1.0 by Illumina, Inc.
Read processing with non-Illumina software
We generally use the output from the MIRO package (Kofler et al. manuscript in prep.) which uses Eland, GEM, or SeqMap as alignment programs.
The first step involved is to trim the adapter for samples with small insert sizes (e.g. miRNA and small RNA samples, NlaIII or DpnII Digital Gene Expression). No sequence in the resulting file contains the adapter sequence, apart from rare cases where the adapter sequence was not detected. In this file, only one copy of identical reads is kept (along with the the read count).
After preprocessing, the reads are aligned using the most appropriate alignment software (Eland, GEM, SeqMap). Three separate files are created containing reads which mapped unambiguously (1 place in the reference), ambiguously (2 or more places), or not at all, as well as a statistics file describing the results. MIRO also possesses an option to map reads to multiple references, and a set of these 3 files is created for each reference used. All of these files are provided to users.
A URL from which the aforementioned files can be downloaded is emailed to the user once available. If such an email is not received, please contact Debayan Datta.
Application Specific Files
Below is a description of standard files available for a given application. Any additional analysis should be discussed with Heinz Himmelbauer.
ChIP-Seq and whole-genome sequencing
These reads are generally aligned using the GEM alignment program (Ribeca, Guigo et al., in preparation), called from within MIRO. Bed and wig files are then generated for all ChIP-Seq samples. These files can be loaded into various genome browsers to view the location of the reads sequenced. They can also be used in conjunction with other software, such as that of Genomatix, and various programs designed for peak calling.
Several different files are created by MIRO during analysis of miRNA samples. By default, the following files are created for quantile normalization, linear-scaling normalization and for non-normalized data:
1. A heatmap pdf with a colour coded representation of miRNA abundance, and a file containing a legend for the heatmap.
2. A lineplot pdf. This is primarily intended for viewing miRNA abundance for samples which represent progressive time points.
3. Scatterplot pdfs showing the correlation between samples.
4. Tables containing a list of miRNAs, and the number of reads found for each miRNA.
The following files may be created upon request:
1. A text file containing clusters of overlapping reads.
2. A text file containing clustered reads using a sliding window approach
3. Logos may be created for selected miRNAs (see WebLogo 3.0)
4. miRNA profiles which show the position of the reads on the reference sequence. The RNA folding prediction is also displayed.
5. 'bed' and 'wig' files for display with the UCSC Genome Browser
6. Files which are compatible with miRDeep for de novo miRNA identification
In depth descriptions of all these files, and the columns in each table, can be found in the MIRO documentation
File Formats
4 1 23 1566 ACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCG
The first 4 columns are an ID that provides location details of the cluster in the flow cell, followed by the read sequence. Seq.txt file size typically ranges between 250 Megabytes and 1 Gigabyte. One file is generated per lane.
qseq.txt (Gerald)
SOLEXA 90403 4 1 23 1566 0 1 ACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCG `aaaaa```aZa^`]a``a``a]a^`a\Y^`^^]V` 1
Relevant information is contained in the last 4 columns. They are:
-
read number (1 or 2 for paired end runs)
- sequence
- quality string
- filtering (1 for passed, 0 for failed).
These files typically range between 500 Megabytes and 1.5 Gigabytes in size each. A single file will be created for each lane if requested.
s_[lane]_sequence.txt (Gerald)
@SOLEXA:4:1:22:769#0/1 ATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGC +SOLEXA:4:1:22:769#0/1
aaba_^``^^\_`_`Z_^_^Z^_^[^\WZW`^UW_Z
Essentially the same information as in the qseq file, but in FastQ format and only including reads passing filtering. These files typically range between 250 Mb and 1 Gigabyte in size each.
unambiguous/ambiguous.txt (Miro)
3070||Count=45 GCCAAGCCCGTTCCCTTGGCTGTGGTTTCGCTTGATAGTA chr8 1 2 F 70764945
The columns in these files are as follows:
-
ID||Count, where count equals the total number of identical reads found
- Sequence
- Reference sequence (Chromosome number, transcript name, etc)
- Number of locations in the reference to which the read was mapped (1 for unambiguous, 2 or more for ambiguous)
- Number of mismatches in alignment
- Strand
- Location within reference sequence
These files typically range between 50 and 100 Megabytes in size each. Typically these are created for each lane, but can be pooled by sample if desired.
Bed files (Miro)
track name=region 1 description="unknown" useScore=1 chr22 23901283 23901298 id:715 600 -
The columns in these files are as follows:
-
Reference ID (eg chromosome)
- Start
- End
- Sequence ID
- Score
- Strand (+ or -)
These files typically range between 100 and 500 Megabytes. For more details on bed files created, visit, http://seq.crg.es/download/software/Miro/doc/Hit2Bed.html .
Wig files (Miro)
track type=wiggle_0 name="unknown" description="variableStep format" visibility=full variableStep chrom=chr7 span=1000 156001 30
The columns in these files are as follows:
-
Start
- Number of sequences found at Start + span (ie 156001 to 15700 0)
These files typically range between 5 and 50 Megabytes. For more details on bed files created, visit http://seq.crg.es/download/software/Miro/doc/Hit2Wiggle.html .
Data Storage
All images are backed up onto tapes, and can be restored if need be for repeated image analysis. According to current policy, all images will be stored on tape indefinitely. All other data will be stored long-term on our servers. Users will be notified before any of these data are removed.