PHA4GE Bioinformatics Pipelines & Visualization Working Group
Libuit KG, Southgate J, Ünal G, Maguire F, Smith E, Kapsak S, van Heusden P, Wright S, Neher R, Diallo A
Overview
Genomic analysis of Mpox virus (MPXV) samples by public health laboratories is a critical component in understanding the global outbreak. The integration and awareness of appropriate bioinformatics tools to support these endeavors are potential challenges.
In an attempt to assist this integration process, the Bioinformatics Pipelines and Visualization Working Group of the Public Health Alliance for Genomic Epidemiology (PHA4GE) has drafted this living document to help define the major bioinformatics challenges for MPXV genomic analysis and suggest various open-source and freely available bioinformatics resources to address them.
Please note that the bioinformatics resources listed in this document are simply an attempt to highlight the most accessible solutions as per the opinions of our working group and in no way represent a comprehensive list of all available MPXV bioinformatics resources. If this document fails to include a valuable public health resource or in some way mischaracterizes a resource mentioned, we encourage community collaboration through pull-requests and/or raised GitHub issues.
Background
Mpox is a viral zoonosis which belongs to genus Orthopoxvirus in the family Poxviridae. The virus can be transmitted to humans from animals. After the eradication of smallpox in 1980, mpox emerged and became the most important Orthopoxvirus for public health aspects. The virus is an enveloped double-stranded DNA virus and has two distinct genetic clades: the central African (Congo Basin) clade and the west African clades. Historically known as the Congo Basin can cause more severe disease and more transmissible WHO. The clinical presentation of this virus is similar to smallpox but some vaccination with smallpox can help individuals for cross-immunity. Lethality rate varies %1-10 and transmission between humans mainly occurs either direct contact or body fluids and via droplets Berthet, N. et al..
MPXV is a linear DNA genome of ≈197 kb. Like other Orthopoxviruses, the central coding region sequence (CRS) at MPXV is between ≈56000–120000 and is highly conserved. The genes in the terminal end of MPXV genome responsible for immunomodulation, host range and pathogenicity and also contains at least 4 ORF in the ITR region Kugelman, JR et al..
Public Mpox Case Databases
This repository contains dated records of curated Mpox cases from the 2022 outbreak (April – ), a data dictionary, and a script used to pull contents from a spreadsheet into JSON and CSV files.
The downloadable data file contains information on the number of mpox cases reported by EU/EEA countries or collected throughout epidemiologic intelligence at ECDC. Each row contains the corresponding data for a country, day of reporting, number of cases and source of information (data are in long format). The file is updated twice a week. You may use the data in line with ECDC’s copyright and data usage policy.
This report provides an overview of the total number of cases of mpox identified by ECDC and the WHO Regional Office for Europe through IHR mechanisms and official public resources and case-based data through The European Surveillance System (TESSy) up to 9 August 2022. The first summary table and maps (first two tabs) describe the number of cases identified through the different platforms. The following figures and tables describe national case-based data for surveillance of mpox reported in TESSy from all the countries and areas of the WHO European Region, including the 24 countries of the European Union (EU) and the additional three countries of the European Economic Area (EEA).
Bioinformatics Challenges for Public Health
The PHA4GE Bioinformatics Pipeline and Visualization Working Group has defined four key public health bioinformatics challenges for genomic analysis of SC2 samples:
- Generating consensus assemblies
- Submission of sequence data to international accessible databases
- Screening for Variants of Concern
- Performing Phylogenetic analysis of MPXV datasets
Open-Access/Source Bioinformatics Solutions & Resources
Video resources
Sequencing resources
Generating consensus assemblies
- TheiaCoV workflows (for Illumina SE/PE, ONT, and fasta files) with MPXV input variables
- Supports amplicon and metagenomic data
- GalaxyProject MPXV analysis effort
- Only supports Illumina PE metagenomic data
- Nextflow workflow from the Utah PHL
- Supports amplicon and metagenomic data
- Epi2Me
- Only supports metagenomic data – Viral-Recon:
- Workflow for raw read quality control, de-hosting, assembly, variant calling, and consensus generation for illumina and nanopore mpox data. Currently does not include pre-built support for mpox (e.g. reference genome, reference annotations, nextclade dataset, and amplicon schemes) but these can be user-supplied on the command line and should be appropriate to the sequencing method (e.g. for amplicon sequencing using the reference used to create the amplicon scheme and for metagenomic sequencing, to be consistent with Nextstrain, you can use NC_063383.1.fasta, NC_063383.1.gff, with the nextclade dataset nextclade_hMPXV_B1_pseudo_ON563414_XXXXXXX).
Submission of sequence data to international accessible databases
- Sample Metadata Specifications
- Preparation and/or Submission of Samples
- Terra_2_NCBI workflow (only SRA/BioSample at the moment) for programmatic submission of raw read data analysed on Terra to SRA and BioSample
- NCBI guide to submit consensus sequences using BankIt
- Assess Data Quality Prior to Submission
Screening for Variants of Concern
- Nextclade
- assignment of consensus sequences to nextstrain clades, quality control, and mutation effect annotation. References pre-built for inferred ancestral mpox, the human mpox clade, and the specific B.1 human mpox clade.
Performing Phylogenetic analysis of MPXV datasets
- Augur
- A bioinformatics toolkit for phylogenetic analysis which constructs phylogenetic trees that can be visualized in NextStrain
- Nextstrain Mpox build workflow
- Workflow to perform contextualized phylogenetic analysis of mpox consensus sequences (by default using the human mpox reference genome NC_063383.1)
- Taxonium
- Tool for exploring large phylogenetic trees – Mpox sequences from GenBank
Publicly available data
To help getting started with phylogenetic analysis, Nextstrain provides MPXV data available on NCBI in aggregated form:
Pairwise alignments with Nextclade against the reference sequence MPXV-M5312_HM12_Rivers, insertions relative to the reference, and translated ORFs are available: