Working Group

Data Structures

Building the backbone of genomic data
through shared standards.

The PHA4GE Data Structures Working Group develops standardized data models and interchange formats for microbial genomics. Covering sequence data, contextual metadata, results, and workflow metrics. Our goal is to make public health sequencing more interoperable, transparent, and reproducible, so organizations everywhere can use and govern their own data.

What you’ll work on

We collaborate on practical standards and guidance that help data move reliably between tools, programs, and countries. Focus areas include:

  • Metadata standards and conventions (including contextual metadata: lab, clinical, epidemiologic, environmental/exposure)

  • Ontologies and shared terminology

  • Results reporting formats and “views” that improve comparability across tools

  • Workflow parameters and metrics for reproducible bioinformatics

  • APIs and interoperability for exchange with repositories and platforms

  • Data security and encryption considerations

  • Identity and access management (role/resource-based access)

Working Group deliverables include consensus recommendations, schemas/specifications, and (where useful) reference implementations that can be adopted by public health institutions and software teams.

Recent and ongoing work includes:

  • AMR detection output specification to harmonize AMR results across tools

  • SARS-CoV-2 contextual data specification to standardize sample metadata, lab results, epi/clinical fields, and sequencing/bioinformatics methods and metrics

Why join

If you work in public health genomics, bioinformatics, data standards, surveillance systems, repositories, or reporting, you can help shape standards that make genomic data more usable for real-world decisions.

By joining, you can:

  • Contribute expertise to globally relevant specifications

  • Help align tools and reporting across the ecosystem

  • Collaborate with an international community working on applied public health needs

Overview

Inception: April 2020

# of Members: 110+

Chairs

Emma Griffiths

Simon Fraser University

Finlay Maguire

Dalhousie University

Projects

Resources

The Minimal Pathogen Agnostic Contextual Data Specification defines an international, ontology-based minimal metadata standard for public health and One Health genomic surveillance. Designed for use across pathogens, sequencing types, and global initiatives, the framework supports interoperable, timely, and privacy-conscious data sharing for both single isolate and metagenomic sequencing. By standardizing essential fields such as sample identifiers, geographic location, collection date, organism, and sequencing purpose, this specification enhances traceability, comparability, and decision-making in public health emergencies while promoting FAIR data principles.

The PHA4GE QC Contextual Data Tags Specification provides standardized, ontology-based annotations for labeling public health pathogen sequence datasets with known quality control (QC) issues. Designed to improve transparency, discoverability, and responsible reuse of lower-quality genomic data, the specification defines five structured QC fields—including controlled vocabulary for quality determinations and issues—that can be included in public repository submissions such as NCBI SRA. Organism-agnostic and sequencing technique-agnostic, these FAIR-aligned tags support training, validation, and optimization of public health genomics workflows while enhancing communication between data submitters and users.

The PHA4GE SARS-CoV-2 Contextual Data Specification provides a standardized, open-source framework for collecting, structuring, and sharing high-quality metadata to support COVID-19 genomic surveillance. Developed by the Public Health Alliance for Genomic Epidemiology (PHA4GE), the package includes a color-coded Excel collection template, ontology-mapped controlled vocabularies, JSON schema, reference guides, and detailed submission protocols for GISAID, NCBI, and ENA. Designed to promote FAIR data principles, interoperability, and global data sharing, this specification enhances the consistency, usability, and public health impact of SARS-CoV-2 sequence metadata.

The hAMRonization Workflow is a proof-of-concept pipeline designed to harmonize antimicrobial resistance (AMR) detection outputs from multiple bioinformatics tools into a single, standardized report. By running leading AMR gene detection tools—including AMRFinderPlus, RGI, ResFinder, SRST2, DeepARG, and more—against genomic assemblies or sequencing reads, the workflow uses hAMRonization parsers to collate results into a unified, interoperable format. Built with Snakemake and available via Conda or containerized environments (Docker/Podman), this workflow streamlines comparative AMR analysis and supports reproducible, FAIR-aligned genomic surveillance.

The PHA4GE Genotyping Contextual Data Specification is a draft, ontology-based framework designed to standardize how microbial genotyping methods and results are reported and shared. Developed by the PHA4GE Data Structures Working Group, it provides machine-readable, FAIR-aligned attributes that improve the consistency, comparability, and interoperability of genotype data across laboratories, public repositories, and public health platforms. By harmonizing genotyping metadata—including methods, databases, software, and confidence values—the specification reduces duplication of effort and enhances searchable, reusable genomic surveillance data.

The Primer Schemes repository is a versioned, community-driven resource for standardized tiled amplicon primer scheme definitions used in pathogen sequencing. Designed to eliminate ambiguity in naming and versioning, it promotes FAIR data principles by improving the findability, accessibility, interoperability, and reusability of primer schemes and associated sequencing data. With machine-readable indexing, structured scheme specifications, and validation via the Primaschema tool, this repository supports consistent, transparent, and reproducible genomic surveillance workflows across pathogens including SARS-CoV-2, MPXV, and Nipah virus.

The HPAI Contextual Data Specification is a draft, ontology-based framework designed to standardize and harmonize contextual data for Highly Pathogenic Avian Influenza (HPAI) virus surveillance. Developed in collaboration with PHA4GE, it supports consistent, interoperable, and FAIR data collection through DataHarmonizer templates, field and term reference guides, and curation SOPs. This evolving specification enables improved data quality, integration, and global collaboration across public health, food, environmental, and host-specific surveillance efforts.

During the 2022 and 2024 global Mpox outbreaks, a standardized contextual data specification was developed to support public health genomic surveillance of MPXV. The specification defines ontology-based fields and controlled vocabularies for harmonized capture of sample metadata, epidemiological, clinical, laboratory, and methodological information, with emphasis on geo-temporal context, data provenance, and sampling strategy. Implemented within the open-source DataHarmonizer platform, the MPXV specification enables structured curation, validation, and transformation of surveillance data and is currently in use in Canada, with international applicability and extensibility to other pathogens.

This publication presents the PHA4GE wastewater contextual data specification, an ISO-compatible, ontology-based standard developed with global partners to support interoperable wastewater genomic surveillance. Implemented through open-source tools and shared frameworks, the specification enables harmonised data integration and serves as a model for broader environmental and metagenomic surveillance standards.

This publication describes the development of standardized contextual data quality-control (QC) tags by PHA4GE to support the responsible sharing and reuse of lower-quality or purpose-specific genomic datasets. Implemented using ontologies and adopted by public health networks such as FDA’s GenomeTrakr, the tags improve dataset discoverability, interpretation, and transparency across public repositories.

This publication describes the development of a standardized output specification and the hAMRonization tool to harmonise antimicrobial resistance (AMR) detection results across diverse bioinformatic tools. Developed with international public health laboratories, hAMRonization enables interoperable, unified AMR reporting and supports scalable integration into genomic surveillance workflows.

PHA4GE has developed an AMR gene detection output standard to address inconsistencies across existing tools and reference databases. Supporting parsers and automated pipelines enable harmonised outputs, benchmarking, and improved reuse of AMR surveillance data.

PHA4GE developed a SARS-CoV-2 contextual data specification package, implemented through a structured collection template and supporting protocols, to harmonise and support submission of genomic and contextual data to public repositories. Adoption of the standard improves data interoperability, reuse, and integration, and is supported by NCBI’s BioSample database.

This publication describes the development of an open, harmonised SARS-CoV-2 contextual data specification extending the INSDC pathogen package. The specification supports interoperable metadata collection, data submission to public repositories, and improved reuse and integration of genomic data for COVID-19 surveillance and research.

This publication describes the development of an open, harmonised SARS-CoV-2 contextual data specification created by PHA4GE to support interoperable metadata collection and submission to public repositories. Implemented through standardised templates, protocols, and tools, the specification improves data consistency, reuse, and integration, and is now supported by NCBI’s BioSample database to enhance global COVID-19 genomic surveillance.

Members

David Aanensen | University of Oxford. Wellcome Sanger Institute | United Kingdom

Kolawole Ojo Adekunle | Landmark University Omu-Aran Kwara State | Nigeria

Olusola Afuwape | University of Lagos | Nigeria

Mohammed Alarawi | Livestock and Fisheries development program | Saudi Arabia

Brian Alcock | McMaster University | Canada

Amjad Ali | National University of Sciences and Technology | Pakistan

Ridhuan Ali | Institute for Medical Research (IMR), Ministry of Health Malaysia | Malaysia

Ben Amos | Consultant to SEDRI-LIMS and Fleming Fund projects | United Kingdom

Dominique Anderson | SANBI, UWC | South Africa

Saadia Andleeb | National University of Sciences and Technology | Pakistan

Zohaib Anwar | Simon Fraser University | Canada

Mahjoub Aouni | Faculty of Pharmacy | Tunisia

Jaisy Arikkatt | Public Health Virology, Queensland Health | Australia

Bryce Asay | Centers for Disease Control and Prevention | United States

Adeepta Banerjee | Gujarat Biotechnology University | India

Charlotte Barclay | Simon Fraser University | Canada

Arindam Basu | University of Canterbury | New Zealand

Pankaj Bhatt | Michigan State University | United States

Allison Black | Department of Epidemiology, School of Public Health, University of Washington | United States

Emily Bordeleau | Simon Fraser University | Canada

Daniel Bridges | National Malaria Elimination Centre, Lusaka; Malaria Control & Elimination Partnership for Africa | Zambia

Rhiannon Cameron | Centre for Infectious Disease Genomics and One Health, Simon Fraser University | Canada

Josefina Campos | INEI-ANLIS “Dr Carlos G. Malbrán”, Buenos Aires | Argentina

Li Chin Chai | Sarawak Infectious Disease Centre | Malaysia

Leonid Chindelevitch | Imperial College London | United Kingdom

Evan Christensen | University of Utah Department of Biomedical Informatics | United States

Bede Constantinides | University of Oxford | United Kingdom

Natacha Couto | University of Oxford, Pandemics Science Institute, CGPS | Portugal

Carla Cummins | EMBL-EBI | United Kingdom

Tim Dallman | World Health Organization – International Pathogen | Netherlands

Miguel de Diego Fuertes | Universiteit Antwerpen | Belgium

Sabrina Di Gregorio | Instituto de Investigaciones en Bacteriología y Virología Molecular, Facultad de Farmacia y Bioquímica, Universidad de Buenos Aires | Argentina

Amadou Diallo | Institut Pasteur Dakar | Senegal

Delaney Ding | University of Florida | United States

Brody Duncan | McMaster University | Canada

Jolene Farrell | The University of Melbourne | Australia

Michael Feldgarden | NIH | United States

Daniel Fornika | BCCDC | Canada

Bastiaan Franssen | Robert Koch Institute | Germany

Linda Frisse | NIH/NLM/NCBI | United States

Praneeth Gangavarapu | Scripps Research | United States

Barbara Ghiglione | Universidad de Buenos Aires (UBA) | Argentina

Emma Griffiths | BCCDC | Canada

Jennifer Guthrie | Public Health Ontario | Canada

Mike Hamilton | Thermo Fisher Scientific | United States

Simon Harris | Bill and Melinda Gates Foundation | United Kingdom

Felix Hartkopf | Robert Koch Institute | Germany

Jane Hawkey | Monash University | Australia

Erik Hjerde | The Arctic University of Norway | Norway

Emma Hodcroft | Institute for Social and Preventive Medicine at the University of Bern | Switzerland

Ayiagnigni Njoya Thomas Hugues Yannick | Private | Cameroon

Lee Katz | CDC | United States

Chris Kent | University of Birmingham | United Kingdom

Kuhle Kitsili | South African National Bioinformatics Institute | South Africa

Terje Klemetsen | UiT The Arctic University of Norway | Norway

Rintu Kutum | Ashoka University | India

Danai Kwenda | University of Pretoria | South Africa

John Lees | Imperial College London | United Kingdom

Phoenix Logan | Chan Zuckerberg Initiative | United States

Tshikala Eddie Lulamba | SANBI | South Africa

Duncan MacCannell | CDC | United States

Sade Monica Magabotha | National Institute of Communicable Diseases, Centre for Enteric Diseases (Bacteriology) | South Africa

Finlay Maguire | Dalhousie University | Canada

Godwin Marokutimi | Ambrose Alli University, Edo State Nigeria | Nigeria

Andrew McArthur | McMaster University | Canada

DJOLIEU Medine | University of Yaounde | Cameroon

Catarina Inês Mendes | University of Lisbon | Portugal

Noutin Michodigni | SRL Cotonou | Benin

Jagadish Midthala | AI & Robotics Technology Park, Indian Institute of Sciences | India

Rania Milleron | Public Health | United States

Ilene Mizrachi | National Center for Biotechnology Information, National | United States

Catrin Moore | St George’s, University of London | United Kingdom

Karim Morey | Oregon State Public Health Laboratory | United States

Muhammad Ibtisam Nasar | United Arab Emirates University-Al Ain (UAEU) | United Arab Emirates

Aitana Neves | SIB Swiss Institute of Bioinformatics | Switzerland

Henry Njoku | University Of Ibadan | Nigeria

Ashley Norberg | APHL, Global Health | United States

Judith Oguzie | University of Texas Medical Branch | United States

Idowu Olawoye | African Centre for Excellence for Genomics of Infectious Diseases (ACEGID), Redeemer’s University | Nigeria

Paul Oluniyi | African Centre of Excellence for Genomics of Infectious Diseases (ACEGID), Redeemer’s University Ede | Nigeria

Bonface Onyango | Pwani University | Kenya

Karen Osman | UKHSA | United Kingdom

Ngozi Otuonye | Nigerian Institute of Medical Research Yaba | Nigeria

Andrew Page | Quadram Institute | United Kingdom

Wangun Parfait Pascal | Centre Pasteur du Cameroun | Cameroon

Jillian Paull | Harvard University, Broad Institute, Gates Foundation | United States

Arjun Prasad | National Center for Biotechnology Information, National | United States

Alyss Pyke | Public Health Virology, Queensland Health | Australia

Amos Raphenya | McMaster University | Canada

Ana Tereza Ribeiro de Vasconcelos | Laboratório Nacional de Computação Científica, MCT | Brazil

Paula Roydhouse | The Peter Doherty Institute for Infection and Immunity | Australia

Nicole Ruiz-Schultz | Centers for Disease Control and Prevention | United States

Sneha S | AI & Robotics Technology Park, Indian Institute of Sciences | India

Rohit Satish | AI & Robotics Technology Park, Indian Institute of Sciences | India

Sarah Schmedes | Centers for Disease Control and Prevention | United States

Torsten Semmler | Robert Koch Institute | Germany

Joel Studebaker | New Jersey Department of Health, Public Health and Environmental Labs | United States

Enyo Sule | Nigerian Institute of Medical Research | Nigeria

Simon Tausch | German Federal Institute for Risk Assessment, Unit 4 SZ: Study Centre for Genome Sequencing and Analysis, Department Biological Safety | Germany

Suchitra Thapa | Tribhuvan University Nepal | Nepal

Ruth Timme | FDA | United States

Andrea Tyler | Government | Canada

Gregory Tyson | U.S. Food and Drug Administration | United States

Annelies Van Rie | University of Antwerp | Belgium

Nicholas Waglechner | Shared Hospital Laboratory | Canada

Maryem Wardi | the Cell Biology and Molecular Genetics Lab, Faculty of Sciences at Ibnou Zohr University in Agadir | Morocco

Andrew Warren | University of Virginia | United States

Adam Witney | St George’s, University of London London, Greater London, United Kingdom | United Kingdom

Eugene Yeboah | Association of Public Health Laboratories | United States

Shing Zang | Big Data Institute, University of Oxford | United Kingdom

Related Research

convAST is a command-line tool designed to convert antibiotic susceptibility test (AST) results from laboratory instruments, EMRs, and LIMS into an INSDC-compatible standardized format. Supporting major platforms such as Vitek, Microscan, Phoenix, and Sensititre, convAST applies structured mappings to transform tabular AST outputs into harmonized, submission-ready data. Built using LinkML schemas and a modular object model, convAST streamlines interoperability, improves data consistency, and facilitates integration of antimicrobial resistance results into public sequence repositories.

PHA4GE kicked off work on malaria genomics data standards at ASTMH 2024 in New Orleans, where Alan Christoffels and Tracey Calvert-Joshua presented a funding proposal on metadata collection.

We're excited to announce episode 12 of the PHA4GE Genomic Horizons webinar series, featuring a talk by Dr. Su Datt Lam from the National University of Malaysia!