A Benchmark Study on Error Assessment and Quality Control of CCS Reads Derived f

<p><strong>A Benchmark Study on Error Assessment and Quality Control of CCS Reads Derived from the PacBio RS</strong></p>
<p> </p>
<h2 id="idm139969854568656title">Abstract</h2>
<p id="P2">PacBio RS, a newly emerging third-generation DNA sequencing platform, is based on a real-time, single-molecule, nano-nitch sequencing technology that can generate very long reads (up to 20-kb) in contrast to the shorter reads produced by the first and second generation sequencing technologies. As a new platform, it is important to assess the sequencing error rate, as well as the quality control (QC) parameters associated with the PacBio sequence data. In this study, a mixture of 10 prior known, closely related DNA amplicons were sequenced using the PacBio RS sequencing platform. After aligning Circular Consensus Sequence (CCS) reads derived from the above sequencing experiment to the known reference sequences, we found that the median error rate was 2.5% without read QC, and improved to 1.3% with an SVM based multi-parameter QC method. In addition, a <em>De Novo</em> assembly was used as a downstream application to evaluate the effects of different QC approaches. This benchmark study indicates that even though CCS reads are post error-corrected it is still necessary to perform appropriate QC on CCS reads in order to produce successful downstream bioinformatics analytical results.</p>
<p><strong>Keywords: </strong>PacBio, CCS read, quality control (QC), pass number, quality value (QV), SVM regression, assembly</p>
<p><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3811116/#">Go to:</a></p>
<h2 id="S1title">Introduction</h2>
<p id="P3">The PacBio RS platform, a newly emerging third-generation DNA sequencer produced by Pacific Biosciences, Inc., is based on a real-time, single-molecule, nano-nitch technology [<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3811116/#R1" id="__tag_337753756">1</a>–<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3811116/#R3" id="__tag_337753747">3</a>]. Besides several advantages over earlier generation sequencers, such as no PCR-amplification, single molecule sequencing, and shorter turn-around time, the most distinct feature of PacBio is the very long reads that are produced ranging up to ~10 kb for raw reads and ~2.5 kb for the error corrected, Circular Consensus Sequence reads (see definition in next paragraph) [<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3811116/#R4" id="__tag_337753750">4</a>]. In contrast, the earlier generation sequencers typically generate much shorter reads with median lengths of ~100–200 bp for Illumina and ~500 bp for 454 [<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3811116/#R1" id="__tag_337753754">1</a>,<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3811116/#R3" id="__tag_337753757">3</a>,<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3811116/#R5" id="__tag_337753762">5</a>–<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3811116/#R8" id="__tag_337753758">8</a>].The longer reads produced by the PacBio platform is a key progression in the high-throughput sequencing field, which is expected to benefit many aspects of genomic projects in the near future. For example, assembling a genome with highly repetitive DNA, closing gaps in genome assemblies, phasing analysis of DNA polymorphisms, discovering rare isoforms of a highly conserved gene family, and identification of rare gene alternative splicing, which all remain challenging tasks using the shorter reads derived from earlier generation sequencers, would benefit from this approach [<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3811116/#R9" id="__tag_337753752">9</a>–<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3811116/#R11" id="__tag_337753764">11</a>].</p>
<p id="P4">Although PacBio’s longer reads provide new power to researchers, careful error and Quality Control (QC) of the reads is essential to effectively use such power. Regardless of the ~15% error rate reported for the raw sub reads of PacBio [<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3811116/#R1" id="__tag_337753763">1</a>,<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3811116/#R10" id="__tag_337753765">10</a>], one of the standard outputs from the platform is the Circular Consensus Sequence(CCS) read (the throughput is ~10–20 k per SMRT cell), which is an error-corrected consensus read derived from the multiple alignment consensus of sub reads belonging to the same single-molecule circular sequencing [<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3811116/#R1" id="__tag_337753749">1</a>–<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3811116/#R3" id="__tag_337753761">3</a>,<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3811116/#R5" id="__tag_337753766">5</a>].The Pass number is a unique feature of the PacBio platform when forming CCS reads [<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3811116/#R4" id="__tag_337753746">4</a>]. It represents how many rounds the same single-molecule is sequenced in a hairpin structure during the PacBio circular sequencing procedure [<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3811116/#R1" id="__tag_337753760">1</a>,<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3811116/#R2" id="__tag_337753745">2</a>,<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3811116/#R4" id="__tag_337753753">4</a>]. Since the CCS reads are post error-corrected, users often optimistically treat them as high