Second- versus first-generation sequencing
Материал из Zbio
Sometimes, when people talk about Second-Generation (SG-) sequencing they imagine that it is something like First-Generation (FG-) sequencing, just more powerful. Let's use a common analogy "nucleotide sequence ~ text in a book" to demonstrate, how far this analogy is from reality.
To reformulate typical laboratory-scale tasks for first- and second-generation sequencing in book's terms let's take as a "book" something similar to the Nature journal: ~150 letters per string, ~60 strings per page, ~160 pages per journal.
[править] FG-sequencing of a BAC clone
|task||Short-gun sequencing of a 40kb BAC-clone||Determine a text of a ~4.5 pages article|
|library preparation||Prepare a random library with ~700nt insertions.||Take ~1000 reprints of this article and cut them randomly, each fragment ~4,5 strings.|
|sequencing with 5x coverage||Take ~300 clones and sequence them. It will give about 200kb of raw nucleotides.||Collect and read ~300 of "~4,5 string" fragments.|
|bioinformatics[fgs 1]||Combining of all reads into one contig||Combining of all text fragments into one uninterrupted text.|
- ↑ Computer would help in generating a contig, but it is obvious, that this task may be solved manually.
[править] SG fragment-resequencing (35nt reads) of a human genome
|task||Resequencing of a human genome with ~5x coverage||Read (~2.5 coverage) a text of a 2-volume superbook (one volume is from mother, another is from father), each volume corresponds to ~2x103 individual journals[sgs 1].|
|library preparation||Prepare a random library||Take ~20 of real genomes (each genome is 2-volume superbook) and put them through a paper shredder, generating random pieces one string wide and 35 letters long (~3.5x50mm2).|
|sequencing||To provide ~5x coverage of the genome it is necessary to collect ~19Gb (Illumina GA II) or 33Gb (SOLiD 3+) of raw nucleotides||Collect and read ~400mln of 35-letter phrases (~3000kg).|
|bioinformatics[sgs 2]||Perform alignment to the reference genome, make conclusions about state of known SNP-loci, try to find new SNP's and structural variations.||Find original position in the superbook for all collected 35-letter phrases. Make a conclusion about the state of all text changes (homozygous, heterozygous), try to characterize all text rearrangements.|
- ↑ Human genome: ~3x109bp of haploid sequence packed into ~20 chromosomes. In "journal" terms this would be a superbook, containing ~2000 individual journals, organized into 20 ~0.5 meter thick volumes (10m book-shelf, with total weight ~600kg).
- ↑ Special sequencing libraries, SNP database, powerfull computer and good programm algorithms are absolutely necessary for this work. Primitive shotgun library would not help in revealing structural variations and in analysis of repetitive sequences.
FG-sequencing is able to read directly only short clones. The formulation "to read sequence of a BAC clone" using FG-sequencing is a slang, because direct reading of 40kb is impossible. However, this phrase is not too misleading, while standard reliable technologies of reconstruction of such clones are available. A phrase "to read human genome using SG-sequencing" is totally misleading, because:
- there are no standard algorithms of sequencing;
- absolutely accurate 100% full sequence can't be generated using nowadays technologies;
- accuracy and completeness of the sequence depend on sequence coverage, library construction technology and analysis algorithms.
[править] No clones, only libraries
All SG-sequencing platforms use clonally-amplified DNA-libraries for sequencing. Individual clones are never handled or stored separately. A "minimal unit" for SG-sequencing is a DNA-library. To obtain nice sequencing results it is necessary to use a library with
- desirable clone length distribution, and
- with enough complexity.
- All SG-systems libraries are collections of relatively short DNA fragments (<600bp Illumina and <300bp SOLiD) flanked by adaptors with known sequence.
- All SG-platforms rely on array-based sequencing. Clones (454, Illumina, SOLiD) or individual molecules (Helicos) are distributed on a two-dimensional surface. In case of Illumina individual clone is a result of on-surface-PCR amplification: ~1000 DNA molecules on the area of ~1µm2. In case of SOLiD, a clone is a ~1µm paramagnetic bead bearing ~10,000 DNA molecules.
- Sequencing reaction is performed step-by-step for all clones simultaneously. All platforms except for 454 use fluorescent-based reading. 454 rely on luminescence. Each step results in specific fluorescence (luminescence) of clones. CCD-camera makes photos of two-dimensional surface (a different process for 454). Optical filters (channels) are used to visualize individual fluorophores. Images are always black-and-white. Each sequencing step results in thousands of images
- In the process of base-calling positions of all clones are determined and for each clone:
- fluorescent intensities in all optical channels are recorded;
- nucleotide (colour dinucleotide for SOLiD) and read quality is determined.
- Sequence analysis. Normally, some reference genome is used for analyzing of sequencing data. De novo assembly is possible for short (~106) genomes. Strong computing facilities for data analysis and storage are important for SG-sequencing.