PHA4GE’s Data Structures Working Group, led by Dr Emma Griffiths, has been working on a paper focusing on the challenges of low quality public health sequence datasets. We chat to about how they successfully put together the Quality Control Contextual Data Tags paper.
What are PHA4GE Quality Control (QC) Contextual Data Tags?
Quality of data is important because it can impact analyses that are used to inform public health decision-making. Currently, when data providers share data in public repositories like the INSDC (i.e. NCBI, EBI-ENA, DDBJ), there are no required or standardized fields for including information about quality control processes or results. Without an easy way to communicate about quality control, the burden of investigating data quality weighs on data consumers. Also, without such a mechanism, there are no easy ways to identify lower quality datasets which are useful for training and teaching personnel as well as testing software. The PHA4GE QC Contextual Data Tags are attributes (standardized fields and terms) that enable submitters to “tag” their datasets with high level information about quality control processes that have been performed on shared sequence data to better enable searching and triaging of datasets for public health use.
What made you focus on this project?
PHA4GE aims to be responsive to community needs. In many discussions among our networks, quality control of publicly available data was flagged as a key issue. A common problem that can hinder the use and reuse of microbial and metagenomic sequences is in making sure that the quality of the sequence sets can be readily understood. Furthermore, our colleagues described how the most useful datasets for benchmarking and quality control assessment often contained high as well as lower quality data that highlight common errors or limitations of sequencing or analysis. These datasets were often hard to find and were usually sourced from friends or by word of mouth owing to insufficient annotation in databases which hindered findability and searchability – and many members of PHA4GE had had similar experiences. Improved QC annotations would help make sequences and datasets more FAIR (Findable, Accessible, Interoperable, Reusable). We also had community requests for standardized QC attributes to enhance data sharing. As a result, the Data Structures Working Group felt that developing a set of standardized attributes for communicating quality control assessment and results should be a priority.
How will this project be beneficial to the public health space?
We hope that the QC Contextual Data Tags will encourage the public health bioinformatics community to include more information about the quality control processes they perform, which would have a number of benefits for using publicly available data for public health analyses (such as enabling better comparisons across platforms and instruments for method improvements, and identifying methodological issues when troubleshooting). We also hope that the tags will enable public health labs to find the data they need more quickly, and to have more confidence in the data they access.
If so, what role has the International Nucleotide Sequence Data Collaboration (INSDC) played in the development of this data? OR in what ways will it be useful to them?
The INSDC were a key group that we consulted to ensure that the tags could be supported in public repository databases, and in data submissions – without INSDC support, it would be difficult to implement the tags. INSDC representatives provided important input during the development process including decision-making about where the tags should be stored and how they could be made available to users. As such, partnership with the INSDC was critical.
What have been the benefits of working with different researchers on this project?
Working with a variety of researchers was important for understanding different use cases, understanding different quality control issues and scenarios, and for understanding the usability and applicability of the tags. Having a wide range of input helped to highlight when the tags – or how to use them – was unclear or confusing, and when additional guidance would be required. Testing out the tags with the help of Genome Trakr was incredibly helpful as we got to observe how scientists could implement the data structures in real-world settings, across a variety of different labs.
What was your biggest challenge in doing this project? How did you overcome it?
There are a wide variety of quality control issues that could be flagged, so creating the right tags so the list would not be overwhelming and so that the tags would be easy to use, was a primary concern. To help us prioritize the right ones, we conducted consultations with the community as well as representatives from the INSDC. We also created a New Term Request (NTR) system in our GitHub repository so that community members could put in requests for new fields and terms if the list became outdated, if there were needed improvements, or if we left out anything important. The NTR system is a way to help PHA4GE match data specifications with community needs.
Based on your experience, what would your advice be to others who would like to conduct a similar project?
Our advice for developing data structures/specifications would be to:
- Do your homework first – consult as widely as possible to help ensure the specification you are developing is broadly useful
- Keep the scope of the specification focused on a particular challenge or set of challenges
- Implement and/or reuse data standards and ontologies as often as possible (and consult a specialist)
- Try out the specification in the real world via pilot projects before publishing to ensure feasibility and practicality
- Create mechanisms for updating and versioning the specification so that it can evolve with changing data needs.
Compiled by Zenande Mgijima, PHA4GE intern