Материал из Zbio

Перейти к: навигация, поиск

 Problem list:

  • list of house-keeping genes
  • normalization theory

Mol.biology // Next-generation sequencing



  • just now only expression profiling is in the "standard analysis"
  • structure analysis is taken into account in libraries preparation and sequencing (narrow fragment length, PE-sequencing), but there is no any analysis
  • SNP's, InDels, etc. — also no analysis (but also no special activity for it)


  • annotation
  • functional conclusions

Нужно различать меры, направленные на "сравнение образцов в рамках одного проекта" и меры, позволяющие сравнивать данные разных. Во втором случае и анализ?аннотация гораздо сложнее и дело сильно осложняется отсутствием корпоративных стандартов: даже если вы придумаете очень полезное нововведение далеко не факт, что оно будет принято в отрасли.

[править] RNA types

rRNAtranscription in nucleolus
  • free: cytoplasmic proteins
  • membrane-bounded: membrane and expressed proteins

RNA content of eucaryotic // procaryotic cell Number of protein-coding genes not differ too much in higher eucaryotes.

[править] Removal of rRNA

ribo minus RNase oligo(dT) cellulose, particles, etc.

  • bias toward 3' end

synthesis of long cDNA: strong secondary structure decrease synthesis on rRNA. So, 90-95% rRNA result in ~50% of seq.reads.

our results oligo(dT) ~ ribo minus

Question: are there any differences?

RNA fragmentation: decrease of influence of secondary structure

[править] mRNA purification schemes

[править] rRNA

  • rRNA level is not a constant. In 90-95% range.
  • normally nobody control it
  • several types of rRNA molecules: 5S, 5.8S, 18S, 28S

Biological replicas: NGS is a stable technology. It does not require "biological replicas" in sense of "independent measurements of the same thing to increase the measurement accuracy". If "biological replicas" are identical they are useless. BR should be used for studying of variable process, to distinguish "natural variability" and changes as a result of disease, stimulation, etc. Some exp. objects are significantly unniform (tumor, f.e.).

Epub 2010 Jun 2.

Sequential ligation of 5'/3' adapters: bias against long molecules with secindary structure (50-100nt, does not work at all for >100nt[1]) because of intramolecular self-ligation.

[править] Comparison with other technologies

RNA-Seq coexists with microarrays and RT-PCR. SAGE is not considere, because it has now advantages if compare with RNA-Seq[2]

dinamic range
  • good: ~104-5[3]
  • may be extended, limited by
» number of sequencing reads
» background material (gDNA, unprocessed nuclear RNA) in sequencing library
  • smallest: ~102
  • can't be extended, limited by
» background: unspecific sorbtion of labelled cDNA to array
» cross-hybridization
» saturation of high signals
  • largest: ~106[4]
  • limited by background material (gDNA, unprocessed nuclear RNA)in the sample
  • depends on seququencing scale
  • 1 RNA per ??? cells for 12x106 reads
1 RNA per ??? cellshighest: 1 RNA per ??? cells
  • about 1/Sq.Root(number of hits)
  • highle dependson expression level
  • ~30% ???
  • depends on hybridization reproducibility
~30% ???
  • whole-transcriptome analysis
  • it is difficult to exclude some genes (highly expressed genes, rRNA, tRNA, etc.)
  • limited number of genes
  • tailing arrays for hypothesis- or annotation-free analysis
"one gene" — "one reaction"
sample throughput
  • low: ~10-50 per week per researcher
  • complex library preparation and long time for sequencing
  • high: ~10-40 per day per researcher
  • fast and automatable protocol
  • highest ~100-1000 samples per day per researcher
  • completely automatable protocol
de novo analysis for non-model organismspossibleimpossible
absolute measure of geneexpression (copy number per cell)possiblepractically imposiblepossible
combining of results of different laboratories
  • the same protocol for library preparation: trivial
  • different protocols for library preparation: possible
  • the same microarray system: possible, but difficult
  • different microarray systems: very difficult
genomic sequence variationslow influencemay influence results of analysis
annotation of new genes
  • possible with "several nucleotides" resolution
  • only tiling arrays
  • low resolution
resolving transcripts from repeated sequences
  • limited by distinguishable regions
SNP-arraytwo color RT-PCR assay
allele-specific expression
  • it is possible to use hypothesis-free approachfor search of new mprinted genes
  • it is difficult to restrict analysis by some particular genes
  • large SNP-arrays for search of new cases
  • allele-specific arrays for analysis of particular genes
  • quickly-developing technology
  • price droping down in last 4 years
both technology and price are on the plato have specialized application area


  • "RNA-Seq" and "Microarray hybridization (MH)" have overlapped application areas. Microarray hybridization technology is older, but RNA-Seq has more perspectives. In near future a lot of MH-procedures would switch to RNA-Seq
  • Microarray hybridization has some obvious advantages if compare with RNA-Seq in such areas as
» environmental or medical tests
» ???
  • RT-PCR have specialized application area where both RNA-Seq and microarrays does not work well:
» analysis of limited number of genes in a very large number of samples
» hyposesis checking
» specific medical tests
» etc.

[править] Typical protocol

  1. RNA-sequencing
  2. alignment to genome
  3. not-aligned sequences: alignment to in-silico collection of splice junctions
  4. counting of number of reads within of annotated regions for known genes
  5. normalization relative to
    » total number of mapped reads (normalization w/o using of other sample) // RPKM: Reads Per Kilobase per Million of mapped reads
    » expression profiling of most of genes from list of house-keeping genes (sample-to-sample normalization)
  6. comparison of normalized expression level in different conditions

[править] Expression profiling

Comparison of RNA content in different conditions. Accuracy of RNA-Seq expression profiling is ~ 1/"square root of number of hits". The higher expression level the more accurate is the expression profiling.

Normalization for measurement of expression levels of low-expressed genes

  • Typical results:

started from 30µg of total RNA: polyA+RNA: ??ng; library complexity
sequencing: one PE-seq. line on Illumina
?? seq. reads
?? mapped reads

Hystogram: number of hits per gene (from large to low)
Number of genes in different hit ranges

  • normally "one sequencing read" is "one hit" for expression profiling. In this respect long sequencin reads are unpractical, because even relatively short sequences (40-50bp) are good enough for alignment. Long sequences cost more, but give about the same information. "Transcriptome alignment" is easy if compare with "genome alignment", because transcriptome is ~1% of the genome.

Question: why we do not align sequences first against transcriptome, and only the rest — against the genome?

  • for expression profiling PE- or MP-sequencing have no obvious advantages if compare with single-read sequencing. Opposite, they somehow decrease sequence efficiency, because second read from the same clone can't be treated as an independent hit.

  • expression profiling may be done without preliminary DNA analysis for non-model organisms. PE-sequencing simplifies gene reconstruction. Requires de novo transcripts assemply. Obtained transcripts may be compared with annotated genes for homology search and function prediction. Difficult. The same problems as for resequencing.

  • sequence variants (SNP's, InDel's) may influence the sequencing alignment and influence expression level.

  • MMR (Multilocation Mappable Reads) match to different locations (10-40% of reads???):
» comparison of expression levels of the same gene in different conditions: should be excluded from analysis
» absolute expression measurement: should be spread out among genes in proportion determined by unique sequences (Uniquely Mappable Reads, UMR)

  • different software packages -- slightly different number of hits.

[править] Differential expression

main goal of a lot of studies. Упрощаем задачу, вместо описания - сравнение и занимаемся только тем, что различно.

  • статистическая значимость в зависимости от числа хитов

[править] Differential splicing

- exon skipping
- alternative 5' or 3' border
- different 5' or 3' exones (promoters and terminators)
- mutually exclusive exones
- intron retention

There is no "typical results". By default, structural analysis does not performed.

Question: are there any programs which use PE for structural analysis? What is the output?

Two types of data are used for prediction of splice-variants:

  • coverage of particular exons
  • PE-sequencing (not represented even in reviews, but looks as a best source of information)
  • alignment to collection of splice junctions (only part of data are useful)

Very limited results for "whole-transcriptome" splice-characterization. Only limited predictions for:

  • highly expressed genes
  • predefined list of genes with gene-specific manual analysis.

  • PE-sequencing helps to analyse splice junctions. It give information for Differential splycing only when reads belong to different exons. We sudgest to use PE-fragments about the same length as mean size of exones. For mammalians mean size of exones is ~170bp:
» it is dangerous to make PE-fragment length larger, because small genes will be missed. Also length variation will rize with the size of the fragments
» if to make PE-fragments length shorter less amount of fragments would overlap splice junction.

  • the longer seq. reads help to recognize "unpredicted junctions". For mapping to in-silico collection of splice junctions length is not important.

  • as for SNP's the database of transcript forms would be very helpful

  • coverage distributed not equaly along genes. Variation is higher, than for genomic sequencing. Apart from the same factors as for genomic sequencing:
» different library-preparation protocols have different biases (during fragmentation, amplification). There are also special protocols for sequencing of only 5' or 3' regions
» 3' bias for cDNA synthesis from oligo(dT) primers
» non-even distribution of reads for random-primed cDNA synthesis on long RNA (because single-stranded RNA form some secondary structure which prevents primer binding)
» different "random primers" differ from each other because of oligosynthesis biases
» degraded RNA after polyA+ selection: 3'-bias

  • normally, total RNA isolated from tissue and nobody split it to nuclear and citoplasmic fractions. It means, that processing is not finished for some part of RNA. It is difficult to distinguish "different functional forms" and "not finished processing". Quite often some particular intrones have a lot of hits, but it is unclare how to interpret this.

[править] Fusion transcripts

There is no "typical results". By default, fusion transcript analysis does not performed.

  • important for cancer research
  • PE-reads and double gel selection are obligatory

[править] Annotation of new genes, correction of old annotations

There is no "typical results". By default, Annotation of new genes does not performed.

Questions: -- никто толком не занимается, на полном серьёзе обсуждается наш, высосанный из пальца, подход

  • alhorithms for discovery of new genes?
- some minimal density of hits in intergenic (intronic?) region
- some minimum length (70bp?)
- required visual inspection amd manual curation
  • what is the rules for annotation? How to present new annotation to the database?
  • data of different experiments from different laboratories may be collected to distinguish weak transcripts
  • it is possible, that gene have a different structure in different tissues/conditions. SO, it is unclear, what is "the variant" and what is "error in description"

[править] SNP recognition, allele-specific expression, RNA editing

There is no "typical results". By default, analysis does not performed.

Question: what required to decide, that some substitution is SNP? number of reads?

Limited results for "whole-transcriptome" SNP's characterization. Predominantely for highly expressed genes.

Longer reads are better for analysis of short InDel's.

[править] Recognition of viruses and microorganisms, control of sample contamination

  • virus- or microorganism-specific reads
  • about control?

[править] Notes

[править] Strand-specificity of RNA-Seq

Several strand unspecific RNA-Seq protocols were published early. They should not be used now. There are enough new protocols which are strand-specific and not more complex than old ones:

  • prokariotes and yests: there are a lot of overlapping genes. It is much more difficult to understand transcription picture w/o information about transcription direction
  • for all organisms:
» antisense small RNAs transcribed in the promotor and terminator areas of genes (especially in eukaryotes)
» annotation of new transcripts is much easy when direction of transcription is known

[править] Absolute level of gene expression

In most RNA profiling experiments it is necessary to compare expression levels of the same gene in different conditions (relative expression level). Accuracy of intergenic comparison in RNA-Seq is much lower. But there are quite few biological questions, which need comparison of expression of different genes or absolute level of gene expression. Even rough estimation obtained from Microarray hybridization is normally enough.

  • number of hits (seq. reads) should be normalized according to target size (length of gene, and, may be, (i) minus length of repeats, (ii) taking into account 3' bias of the method)

[править] Literature

wide but not deep review of RNA-Seq

  1. Vivancos AP, Gu"ell M, Dohm JC, Serrano L, Himmelbauer H. Strand-specific deep sequencing of the transcriptome. Genome Res. 2010 Jul;20(7):989-99.
    protocol for ligation-bases strand-specific RNA-Seq library preparation
  2. dNGS SAGE analysis
    Some people say, that SAGE analysis produce more hits if compare with "conventional" RNA-Seq. It is not true: "one sequencing read" is "one hit" for both technologies. There is a special kit from Illumina for this analysis. But technology is
    • completely dead for model organisms with the reference genome, because it has a lot of disadvantages and only one questionable advantage if compare with "conventional" RNA-Seq.
    • have some sense for organisms w/o reference genome. If compare with "RNA-Seq": grooping of reads is easy; but the result of the analysis are "differentially expressed SAGE tags", w/o any functional hypothesis.
    • for organisms w/o reference genome grooping of reads is much easier
    • it is possible to perform analysis of data on a weak computer w/o any specialized programs
    • library preparation protocol is more complex
    • depending on selected restriction enzyme some of the genes will be excluded from the analysis
    • no data about structure of transcripts
    • no data about SNP's
  3. RNA-Seq dinamic range: let suppose, that among 107 sequencing reads one gene will be heated 105 times (1%) in one sample and only once in another. In this case expression level difference is about 105 / 1 ~ 105
  4. RT-PCR dinamic range: let suppose, that cycle number difference for RT-PCR amplification of the same gene between two samples is ~30 cycles. In this case expression level difference is about 230 ~ 106
Источник — «http://molbiol.ru/wiki/RNA-Seq»

Личные инструменты


molbiol.ru  ·  redactor@molbiol.ru  ·  реклама

 ·  Викимарт - все интернет-магазины в одном месте  ·  Доска объявлений Board.com.ua  · 
--- сервер арендован в компании Hetzner Online, Германия ---
--- администрирование сервера: Intervipnet ---

molbiol.ru - методы, информация и программы для молекулярных биологов     Rambler's Top100 Rambler