Working Group

Data Structures

Focus on the development, adaptation and standardization of data models for microbial sequence data, contextual metadata, results and workflow metrics to improve the transparency, interoperability and reproducibility of public health sequencing workflows.

Focus Areas

Metadata standards, ontologies and conventions

Contextual data harmonization and sharing

Data inputs/outputs, APIs and interoperability

Result reporting and views

Data Security and Encryption

Identity management for role/resource based access

Overview

Inception: April 2020

# of Members: 30+

Chair: Emma Griffiths
University of British Columbia, Canada

Vice-Chair: Finlay Maguire

Contact: [email protected]

Working Group Description

How data is structured, organized, managed and stored greatly impacts how it can be used and integrated with existing data. Standardized data structures and interchange formats are critical to the development of an open software ecosystem that will empower the microbial genomics community to analyze and govern their own data. The working group comprises researchers, bioinformaticians and domain experts from around the globe, representing public health agencies, research institutions and different large public databases. As one of PHA4GE’s Technical Working Groups, the Data Structures team focuses on the development, adaptation and standardization of data models for microbial sequence data, contextual metadata, analytical results, and workflow metrics. Through the adoption of data models we hope to improve the transparency, interoperability, and reproducibility of public health sequencing workflows.

Projects

Gene Detection/AMR Output Specification

Antimicrobial resistance is a global health problem that contributes to tens of thousands of deaths per year around the globe. A number of widely-used AMR gene detection tools are currently available, which differ in terms of their inputs, functionality (including parameters and reference databases), and outputs. Differences in the meaning, structure and range of values in the different outputs of these tools can make comparing and interpreting results difficult for public health practitioners and researchers. To address these issues, we are developing a standardized AMR gene detection output specification to better harmonize the AMR detection results across tools and resources and improve interoperability. To support the specification, we have mapped the outputs of different tools to the standard and are developing biopython-compatible parsers that will transform the variable outputs to the PHA4GE standard. We have also created a fully automated pipeline that will run arbitrary microbial genome datasets through almost all currently available species-agnostic AMR gene detection tools. This pipeline can be used in tandem with the parsers and the standard to better enable comparisons of data and for benchmarking.

SARS-CoV-2 Contextual Data Specification

Genome sequencing of the SARS-CoV-2 virus has been a key tool for understanding the epidemiological spread of the disease at global, national and local scales. In the face of the current pandemic, we identified a clear and present need for a fit-for-purpose, open-source SARS-CoV-2 contextual data (metadata) standard. As such, we have developed a SARS-CoV-2 contextual data specification that incorporates publicly available community standards, as well as additional fields and guidance appropriate for public health surveillance and analyses. The specification is implemented via a collection template, as well as an array of protocols and tools to support the harmonization and submission of sequence data and contextual information to public repositories. Well-structured, rich contextual data adds value, promotes reuse, and enables aggregation and integration of disparate data sets. Adoption of the proposed standard and practices will better enable interoperability between datasets and systems, improve the consistency and utility of generated data, and ultimately facilitate novel insights and discoveries in SARS-CoV-2 and COVID-19.