PHA4GE Bioinformatics Pipelines & Visualization Working Group
Libuit KG, Lunn S, Carleton H, Khan W, Kanwar S, van Heusden P, Amrosio F, Lemmer D, Mboowa G, Macori G, Southgate J
Overview
Next-generation sequencing (NGS) has expanded the approach of genomic analysis for pathogen surveillance systems. The demand for NGS continues to grow, with the need for high throughput, lower costs, and better quality of data.
However, the quality of NGS sequencing data can be affected by library preparation and sequencing processes, systematic variation in quality scores across sequence reads, biases in sequencing due to base composition, and less-than optimal library fragment sizes and indexes. Such factors can negatively impact the quality of raw sequencing data for downstream analyses.
In an attempt to assist with quality control (QC) measures, the Bioinformatics Pipelines and Visualization Working Group of the Public Health Alliance for Genomic Epidemiology (PHA4GE) has drafted this living document to help define the QC challenges for SC2 genomic analysis and suggest a QC systems solutions to address them.
Please note that the QC guidelines in this document are simply an attempt to highlight the most accessible solutions as per the opinions of our working group and in no way represent a comprehensive system for QC guidance and bioinformatic solutions. If this document fails to include a valuable public health resource or in some way mischaracterizes a resource mentioned, we encourage community collaboration through pull-requests and/or raised GitHub issues.
Contents
- Process Control For Bioinformatics QC Checkpoints
- QC Acceptance Criteria
- QC Metric Definitions
- Additional QC Resources and Materials
Process Control For Bioinformatics QC Checkpoints
The focus of this document is on the quality control (QC) of tiled amplicon sequencing–through the Artic V3 protocol, for example–a common method for generating SC2 sequencing data. These sequencing experiments generate thousands of amplicon reads that represent fragments of the original SC2 genome present in a sample and–as discussed in this working group’s Bioinformatics Solutions for SARS-CoV-2 Genomic Analysis Guidance Document–assembling a contiguous SC2 genome from these amplicon read data is a critical step in providing insight from sequenced samples.
Throughout this process, quality control checkpoints should be conducted at different stages of bioinformatics analysis, including QC of raw read data, pre-processing stages (trimming and filtering), and alignment/assembly.
In this context, raw read data refers to the fastq read files generated by the NGS platform and processed reads are read files that have had adapter sequences removed, trimmed based on size and quality, and dehosted. Alignment QC refers to the examination of the BAM or VCF files generated during the consensus genome assembly process and Consensus Assembly QC refers to an assessment of the fasta assembly file itself.
Future updates to this document will include QC guidance for SARS-CoV-2 genomic epidemiology analysis and wastewater sequencing data.
QC Acceptance Criteria
When performing QC checks on SARS-CoV-2 genomic data, it can be helpful to establsh acceptance thresholds to determine how and when data will be reported and utilized to inform public health decision-making. Below are this working group’s suggested QC thresholds for SARS-CoV-2 genomic data as well as various resources and metric definitions to assist in public health laboratories implmenting SARS-CoV-2 sequencing and analysis protocols.
PHA4GE Suggested Thresholds
Read QC Metrics | |
---|---|
Number of Reads | Protocol dependent, (e.g. 100,000 reads from Artic Amplicons sequenced on Illumina MiSeq) |
Percent Human Reads | <20% |
Alignment QC Metrics | |
Average Read Depth | ≥100x |
Percent mapped reads to Wuhan reference genome | ≥65% |
Coverage at a Single Base to Make a Base Call | ≥50x |
Percent Agreement | 80% |
Average base quality of aligned reads | >15 |
Assembly QC Metrics | |
Percent reference coverage | >83% |
Number of Ns | <5,000bp |
Assembly length unambiguous | >24,000bp |
NTC percent coverage | <10% |
Lineage defining mutations | ≥60% |
S-gene coverage | ≥99% |
S-gene frameshifts sequence | 0 |
S-gene ambiguous bases | <10% |
QC Metric Definitions
Read QC Metrics
Different sequencing platforms use different technologies to determine the nucleotide sequence of the genetic material that they are processing, but all of these technologies converge on the fastq file format. For example, Illumina uses a sequencing-by-synthesis approach which involves assembling copies of each read using fluorescently tagged nucleotides and taking high resolution pictures of each read as each nucleotide is added to the read. These images are then captured in binary base call (BCL) files, and BCL files are converted into fastq files using the bcl2fastq program. On the other hand, Oxford Nanopore Technologies sequencing platforms run single strands of nucleic acids through nano-scale protein pores. An electric current is run across the pore, and the changes in current are detected as each nucleotide passes through the pore. The raw electric signal is captured in the fast5 file format and converted into fastq file format using the basecalling program guppy. Due to the nature of these sequencing platforms there are different considerations when assessing the quality of the raw sequence data (the fastq files).
Term | Definition |
---|---|
Reads | Fragments of sequence DNA base pairs that are generated during sequencing; also referred to as the raw data generated from a sequencing platform |
Number of Reads | Count of reads generated in an NGS run |
BCL Files | Raw image files produced by Illumina instruments, converted to fastq via bcl2fastq program |
FAST5 Files | Raw electrical signal files produced by Oxford Nanopore Technologies sequencing equipment, converted to fastq via basecalling software (guppy is the current industry standard) |
Basecalling | The computational process of translating raw electrical signal files (FAST5) or flowcell images (BCL) to nucleotide sequence Performance of neural network basecalling tools for Oxford Nanopore sequencing |
FASTQ Files | The common “raw” sequence files containing nucleotide sequences and their associated quality scores • The quality scores contained within a fastq file are encoded as ASCII characters so that they require one bit per score making the string of nucleotide sequences and the string of quality scores equal in length • The quality score (Q Score) represents the probability of an accurate base assignment at the associated nucleotide position • Q scores range from 0 to 40 and are mathematically equivalent to: Q = -10log10P• Quality Scores for Next-Generation Sequencing – illumina • Measuring sequencing accuracy – illumina • Q Scores for Illumina and ONT sequencing will differ dramatically • An excellent Illumina run will have an average Q Score of 27-30 • An excellent Nanopore run will have an average Q Score of 12-15 • Low Q Scores indicate poor sequencing quality which will impact all downstream analyses |
Ambiguity / Mixed Sites | The percent of each read where the base called is ambiguous IUPAC Codes |
Sequence GC Content | The GC content of reads should be normally distributed |
Raw vs Processed Reads | It is typical for some reads to be removed during quality filtering. Based on the known characteristics of the sample, one should be able to predict a reasonable proportion of the reads to be removed. |
Percent Human Reads | Percentage of human read data sequenced in an NGS run. |
Alignment QC Metrics
Consensus-genome assembly approaches have been widely adopted for SARS-CoV-2 genomic analysis. In this approach, read data are aligned to a reference genome–usually [Wuhan-1 (MN908947.3)] (https://www.ncbi.nlm.nih.gov/nuccore/MN908947.3)–and each position in the alignment are assessed to determine the consensus basecall supported by the read data at each position. The alignments are captured in a BAM file that can be used to assess critical quality control metrics; additionally VCF files can be produced from an alignment to call variant positions relative to the reference genome–VCF files can also be inspected to assess quality of identified variant positions.
Consensus Assembly QC Metrics
An examination of the resulting assembly quality is also critical as these assemblies often inform critical downstream analysis, such as lineage and clade assignments and genomic epidmiology investigations.
Additional QC Resources and Materials
- ncov-tools – Tools and plots for performing quality control on coronavirus sequencing results.
- Quality Management Systems Tools & Resources – Process Management – US CDC Quality Management Systems for SARS-CoV-2 NGS Data
- TheiaCoV QC output Video – Video tutorial for assessing SARS-CoV-2 genomic characterization with Theiagen’s TheiaCoV workflows
- StaPH-B Glossary – US State Public Health Bioinformatics (StaPH-B) working group’s bioinformatics glossary of terms
- PHA4GE Bioinformatics Solutions – This working groups list of bioinformatics solutions for SARS-CoV-2 bioinformatics
- ECDC: Guidance for representative and targeted genomic SARS-CoV-2 monitoring – European CDC Guidance Document for SARS-CoV-2 genomic analysis