Participate in ethics and data sharing community  | ​  Learn More 

Quality Control Solutions for SARS-CoV-2 Genomic Analysis

PHA4GE Bioinformatics Pipelines & Visualization Working Group
Libuit KG, Lunn S, Carleton H, Khan W, Kanwar S, van Heusden P, Amrosio F, Lemmer D, Mboowa G, Macori G, Southgate J

Current Version

Overview

Next-generation sequencing (NGS) has expanded the approach of genomic analysis for pathogen surveillance systems. The demand for NGS continues to grow, with the need for high throughput, lower costs, and better quality of data.

However, the quality of NGS sequencing data can be affected by library preparation and sequencing processes, systematic variation in quality scores across sequence reads, biases in sequencing due to base composition, and less-than optimal library fragment sizes and indexes. Such factors can negatively impact the quality of raw sequencing data for downstream analyses.

In an attempt to assist with quality control (QC) measures, the Bioinformatics Pipelines and Visualization Working Group of the Public Health Alliance for Genomic Epidemiology (PHA4GE) has drafted this living document to help define the QC challenges for SC2 genomic analysis and suggest a QC systems solutions to address them.

Please note that the QC guidelines in this document are simply an attempt to highlight the most accessible solutions as per the opinions of our working group and in no way represent a comprehensive system for QC guidance and bioinformatic solutions. If this document fails to include a valuable public health resource or in some way mischaracterizes a resource mentioned, we encourage community collaboration through pull-requests and/or raised GitHub issues.

Contents

Process Control For Bioinformatics QC Checkpoints

The focus of this document is on the quality control (QC) of tiled amplicon sequencing–through the Artic V3 protocol, for example–a common method for generating SC2 sequencing data. These sequencing experiments generate thousands of amplicon reads that represent fragments of the original SC2 genome present in a sample and–as discussed in this working group’s Bioinformatics Solutions for SARS-CoV-2 Genomic Analysis Guidance Document–assembling a contiguous SC2 genome from these amplicon read data is a critical step in providing insight from sequenced samples.

Throughout this process, quality control checkpoints should be conducted at different stages of bioinformatics analysis, including QC of raw read data, pre-processing stages (trimming and filtering), and alignment/assembly.

In this context, raw read data refers to the fastq read files generated by the NGS platform and processed reads are read files that have had adapter sequences removed, trimmed based on size and quality, and dehosted. Alignment QC refers to the examination of the BAM or VCF files generated during the consensus genome assembly process and Consensus Assembly QC refers to an assessment of the fasta assembly file itself.

Future updates to this document will include QC guidance for SARS-CoV-2 genomic epidemiology analysis and wastewater sequencing data.

QC Acceptance Criteria

When performing QC checks on SARS-CoV-2 genomic data, it can be helpful to establsh acceptance thresholds to determine how and when data will be reported and utilized to inform public health decision-making. Below are this working group’s suggested QC thresholds for SARS-CoV-2 genomic data as well as various resources and metric definitions to assist in public health laboratories implmenting SARS-CoV-2 sequencing and analysis protocols.

PHA4GE Suggested Thresholds

Read QC Metrics
Number of ReadsProtocol dependent, (e.g. 100,000 reads from Artic Amplicons sequenced on Illumina MiSeq)
Percent Human Reads<20%
Alignment QC Metrics
Average Read Depth≥100x
Percent mapped reads to Wuhan reference genome≥65%
Coverage at a Single Base to Make a Base Call≥50x
Percent Agreement80%
Average base quality of aligned reads>15
Assembly QC Metrics
Percent reference coverage>83%
Number of Ns<5,000bp
Assembly length unambiguous>24,000bp
NTC percent coverage<10%
Lineage defining mutations≥60%
S-gene coverage≥99%
S-gene frameshifts sequence0
S-gene ambiguous bases<10%

QC Metric Definitions

Read QC Metrics

Different sequencing platforms use different technologies to determine the nucleotide sequence of the genetic material that they are processing, but all of these technologies converge on the fastq file format. For example, Illumina uses a sequencing-by-synthesis approach which involves assembling copies of each read using fluorescently tagged nucleotides and taking high resolution pictures of each read as each nucleotide is added to the read. These images are then captured in binary base call (BCL) files, and BCL files are converted into fastq files using the bcl2fastq program. On the other hand, Oxford Nanopore Technologies sequencing platforms run single strands of nucleic acids through nano-scale protein pores. An electric current is run across the pore, and the changes in current are detected as each nucleotide passes through the pore. The raw electric signal is captured in the fast5 file format and converted into fastq file format using the basecalling program guppy. Due to the nature of these sequencing platforms there are different considerations when assessing the quality of the raw sequence data (the fastq files).

TermDefinition
ReadsFragments of sequence DNA base pairs that are generated during sequencing; also referred to as the raw data generated from a sequencing platform
Number of ReadsCount of reads generated in an NGS run
BCL FilesRaw image files produced by Illumina instruments, converted to fastq via bcl2fastq program
FAST5 FilesRaw electrical signal files produced by Oxford Nanopore Technologies sequencing equipment, converted to fastq via basecalling software (guppy is the current industry standard)
BasecallingThe computational process of translating raw electrical signal files (FAST5) or flowcell images (BCL) to nucleotide sequence
Performance of neural network basecalling tools for Oxford Nanopore sequencing
FASTQ FilesThe common “raw” sequence files containing nucleotide sequences and their associated quality scores
• The quality scores contained within a fastq file are encoded as ASCII characters so that they require one bit per score making the string of nucleotide sequences and the string of quality scores equal in length
• The quality score (Q Score) represents the probability of an accurate base assignment at the associated nucleotide position
• Q scores range from 0 to 40 and are mathematically equivalent to:
     Q = -10log10P• Quality Scores for Next-Generation Sequencing – illumina
• Measuring sequencing accuracy – illumina
• Q Scores for Illumina and ONT sequencing will differ dramatically
     • An excellent Illumina run will have an average Q Score of 27-30
     • An excellent Nanopore run will have an average Q Score of 12-15
• Low Q Scores indicate poor sequencing quality which will impact all downstream analyses
Ambiguity / Mixed SitesThe percent of each read where the base called is ambiguous
IUPAC Codes
Sequence GC ContentThe GC content of reads should be normally distributed
Raw vs Processed ReadsIt is typical for some reads to be removed during quality filtering. Based on the known characteristics of the sample, one should be able to predict a reasonable proportion of the reads to be removed.
Percent Human ReadsPercentage of human read data sequenced in an NGS run.

Alignment QC Metrics

Consensus-genome assembly approaches have been widely adopted for SARS-CoV-2 genomic analysis. In this approach, read data are aligned to a reference genome–usually [Wuhan-1 (MN908947.3)] (https://www.ncbi.nlm.nih.gov/nuccore/MN908947.3)–and each position in the alignment are assessed to determine the consensus basecall supported by the read data at each position. The alignments are captured in a BAM file that can be used to assess critical quality control metrics; additionally VCF files can be produced from an alignment to call variant positions relative to the reference genome–VCF files can also be inspected to assess quality of identified variant positions.

TermDefinition
Sequence AlignmentA method of arranging nucleic acid (DNA/RNA) or protein sequences to identify regions of similarity or conservation that may be of function, structural, or evolutionary relationships. Pairwise sequence alignment consists of two sequences whereas multiple sequence alignment consists of more than three sequences
Sequencing DepthThe number of reads that cover a particular nucleotide, section/amplicon of the genome, or average across the reference sequence
• Ideally a min depth of 10X for Illumina or 20X for Nanopore would be reached
• Uniform depth of coverage is better
• Nonuniform depth may be indicative of differential amplification of amplicons, or amplicon dropout
    • This can be assessed using bedtools
Percent AgreementPercentage of base call concordance in reads mapped at a designated position in the reference genome
CoverageWhat percent of the reference sequence is covered by the reads that have been produced
• This metric is typically used in conjunction with depth
Percent Mapped ReadsPercentage of read data mapped to a specified reference genome
Average Base Quality of Aligned ReadsMean Phred score of read data mapped to a reference genome

Consensus Assembly QC Metrics

An examination of the resulting assembly quality is also critical as these assemblies often inform critical downstream analysis, such as lineage and clade assignments and genomic epidmiology investigations.

TermDefinition
Length of the AssemblyShould be similar to that of reference. If it is not, why? Have there been large insertions/deletions, gene duplications, etc.
Total Number of N’sThe total number of ambiguous basecalls in the assembly
Length of Strings of N’sWhile the total number of N’s is important, the length of the strings of N’s can indicate issues with upstream laboratory workflows. If a string of N’s is consistently reported over a specific region of the genome, then one can cross reference the primer binding loci in the bed file to see if one amplicon is dropping out or amplifying at a lower rate than the other amplicons. This could be due to amplification bias, resulting from a large differential in the GC content between the amplicons. This may also indicate that you have a mixed population and there may be a subpopulation with a different sequence in the ambiguous region.
Percent Reference CoveragePercentage of the Wuhan-1 reference genome represented in the consensus assembly
Number of NsNumber of ambiguous base calls (Ns) incorporated into the consensus assembly
Assembly Length UnambiguousNumber of unambiguous base calls (ATCGs) incorporated into the consensus assembly
NTC Percent CoveragePercentage of the Wuhan-1 reference genome represented in the consensus assembly of a non-template control (NTC; i.e. negative control)
Lineage Defining MutationsPercentage of lineage-specific mutations represented in the consensus assembly
Number of NsNumber of ambiguous base calls (Ns) incorporated into the consensus assembly
S-gene CoveragePercentage of the SARS-CoV-2 S-gene represented in the consensus assembly
S-gene FrameshiftsS-gene insertion or deletion events represented in the consensus assembly
S-gene Ambiguous BasesNumber of ambiguous base calls (Ns) incorporated into the s-gene of the consensus assembly

Additional QC Resources and Materials

Subscribe to the PHA4GE Newsletter

We're committed to your privacy. PHA4GE uses the information you provide to us to contact you about our relevant content. You may unsubscribe from these communications at any time.

Follow PHA4GE

Related Articles

PHA4GE Newsletter – November 2024

This edition reflects on a year of progress in public health genomics, highlighting tools like AMRColab, groundbreaking data-sharing initiatives, and achievements such as Pathoplexus’s recognition in open research.

Insights from a Volunteering Intern at PHA4GE

Ghislaine van Vlijmen reflects on her three-month voluntary internship with PHA4GE, where she contributed to organizational sustainability and assisted in multiple portfolios. Her experience highlights the power of partnerships in advancing bioinformatics solutions for public health.