Participate in ethics and data sharing community  | ​  Learn More 

Working Group

Data Structures

Focus on the development, adaptation and standardization of data models for microbial sequence data, contextual metadata, results and workflow metrics to improve the transparency, interoperability and reproducibility of public health sequencing workflows.

Metadata standards, ontologies and conventions

Contextual data harmonization and sharing

Data inputs/outputs, APIs and interoperability

Result reporting and views

Data Security and Encryption

Identity management for role/resource based access

Overview

Inception: April 2020

# of Members: 30+

Chair: Emma Griffiths
University of British Columbia, Canada

Vice-Chair: Finlay Maguire

Contact: [email protected]

Working Group Description

How data is structured, organized, managed and stored greatly impacts how it can be used and integrated with existing data. Standardized data structures and interchange formats are critical to the development of an open software ecosystem that will empower the microbial genomics community to analyze and govern their own data. The working group comprises researchers, bioinformaticians and domain experts from around the globe, representing public health agencies, research institutions and different large public databases. As one of PHA4GE’s Technical Working Groups, the Data Structures team focuses on the development, adaptation and standardization of data models for microbial sequence data, contextual metadata, analytical results, and workflow metrics. Through the adoption of data models we hope to improve the transparency, interoperability, and reproducibility of public health sequencing workflows.

Projects

Antimicrobial resistance is a global health problem that contributes to tens of thousands of deaths per year around the globe. A number of widely-used AMR gene detection tools are currently available, which differ in terms of their inputs, functionality (including parameters and reference databases), and outputs. Differences in the meaning, structure and range of values in the different outputs of these tools can make comparing and interpreting results difficult for public health practitioners and researchers. To address these issues, we are developing a standardized AMR gene detection output specification to better harmonize the AMR detection results across tools and resources and improve interoperability. To support the specification, we have mapped the outputs of different tools to the standard and are developing biopython-compatible parsers that will transform the variable outputs to the PHA4GE standard. We have also created a fully automated pipeline that will run arbitrary microbial genome datasets through almost all currently available species-agnostic AMR gene detection tools. This pipeline can be used in tandem with the parsers and the standard to better enable comparisons of data and for benchmarking.

Genome sequencing of the SARS-CoV-2 virus has been a key tool for understanding the epidemiological spread of the disease at global, national and local scales. In the face of the current pandemic, we identified a clear and present need for a fit-for-purpose, open-source SARS-CoV-2 contextual data (metadata) standard. As such, we have developed a SARS-CoV-2 contextual data specification that incorporates publicly available community standards, as well as additional fields and guidance appropriate for public health surveillance and analyses. The specification is implemented via a collection template, as well as an array of protocols and tools to support the harmonization and submission of sequence data and contextual information to public repositories. Well-structured, rich contextual data adds value, promotes reuse, and enables aggregation and integration of disparate data sets. Adoption of the proposed standard and practices will better enable interoperability between datasets and systems, improve the consistency and utility of generated data, and ultimately facilitate novel insights and discoveries in SARS-CoV-2 and COVID-19.

Members

Adam Witney

St George's, University of London

Aitana Neves

SIB Swiss Institute of Bioinformatics

Allison Black

Washington State Department of Health

Amadou Diallo

Institut Pasteur Dakar

Amjad Ali

National University of Sciences and Technology

Amos Raphenya

McMaster University

Ana Tereza Ribeiro de Vasconcelos

Laboratório Nacional de Computação Científica

Andrea Tyler

Government

Andrew Page

Theiagen Genomics

Andrew McArthur

McMaster University

Annelies Van Rie

University of Antwerp

Arindam Basu

University of Canterbury

Arjun Prasad

National Center for Biotechnology Information

Ashley Norberg

APHL, Global Health

Bede Constantinides

University of Oxford

Brian Alcock

McMaster University

Brody Duncan

McMaster University

Bryce Asay

US CDC

Carla Cummins

EMBL-EBI

Catarina Inês Mendes

Theiagen Genomics

Charlotte Barclay

Simon Fraser University

Daniel Fornika

BCCDC

Daniel Bridges

National Malaria Elimination Centre, Lusaka, Zambia; Malaria Control & Elimination Partnership for Africa

David Aanensen

University of Oxford / Wellcome Sanger Institute

Duncan MacCannell

US CDC

Emma Griffiths

BCCDC

Emma Hodcroft

Swiss TPH

Erik Hjerde

The Arctic University of Norway

Eugene Yeboah

Association of Public Health Laboratories

Finlay Maguire

Dalhousie University

Gregory Tyson

U.S. Food and Drug Administration

Idowu Olawoye

University of Western Ontario

Ilene Mizrachi

National Center for Biotechnology Information

Jane Hawkey

Monash University

Jennifer Guthrie

Public Health Ontario

Jillian Paull

Harvard University, Broad Institute / Gates Foundation

John Lees

Imperial College London

Josefina Campos

WHO - IPSN

Karen Osman

UKHSA

Karim Morey

Oregon State Public Health Laboratory

Kolawole Ojo Adekunle

Landmark University, Omu-Aran Kwara State

Lee Katz

US CDC

Leonid Chindelevitch

Imperial College London

Mahjoub Aouni

University of Monastir

Michael Feldgarden

NIH

Miguel de Diego Fuertes

Universiteit Antwerpen

Mike Hamilton

Thermo Fisher Scientific

Muhammad Ibtisam Nasar

United Arab Emirates University-Al Ain (UAEU)

Ngozi Otuonye

Nigerian Institute of Medical Research, Yaba

Olusola Afuwape

University of Lagos

Sabrina Di Gregorio

Instituto de Investigaciones en Bacteriología y Virología Molecular, Universidad de Buenos Aires

Sarah Schmedes

US CDC

Shing Zang

Big Data Institute, University of Oxford

Simon Tausch

German Federal Institute for Risk Assessment

Simon Harris

Bill and Melinda Gates Foundation

Paul Oluniyi

Chan Zuckerberg Biohub

Phoenix Logan

Chan Zuckerberg Initiative

Rhiannon Cameron

Centre for Infectious Disease Genomics and One Health, Simon Fraser University

Ridhuan Ali

Institute for Medical Research (IMR), Ministry of Health Malaysia

Rintu Kutum

Ashoka University

Ruth Timme

FDA

Suchitra Thapa

Tribhuvan University

Terje Klemetsen

The Arctic University of Norway

Tim Dallman

Utrecht University / WHO - Integrated Pathogen Surveillance Network

Torsten Semmler

Robert Koch Institute

Tshikala Eddie Lulamba

SANBI

Wangun Parfait Pascal

Centre Pasteur du Cameroun

Joel Studebaker

New Jersey Department of Health

Bonface Onyango

Pwani University

Medine Djolieu

University of Yaoundé I

Related Research

In collaboration with 17 public health laboratories across 10 countries, the PHA4GE Data Structures working group has developed and piloted a standardized output specification for the bioinformatic detection of AMR from microbial genomes.

Amidst an atmosphere charged with anticipation and excitement, the first day of the PHA4GE Conference burst onto the scene, igniting minds and sparking conversations that promised to shape the future of pathogen genomics, globally.

With a line-up curated to address pressing challenges and explore new frontiers, attendees had the opportunity to engage deeply with a variety of workshop topics.