An Oceans Networks Canada project will put powerful new tools in the hands of researchers to better utilize data sets for original research, access data to reproduce and advance existing research, and provide proper credit for their colleagues’ work.
Supported by funding from the Digital Research Alliance of Canada, the DynaCITE pilot project will promote culture change through training in data citation for ONC’s data partners and researchers, established and early-career scientists.
The movement to making science data more freely available (open source) and the dynamic nature of big data present challenges to the fast-changing field of research data management. DynaCITE training and targeted workshops to held over the coming months will ensure the ONC research community is conversant with best practices and the use of citation tools such as Persistent Identifiers (PIDs) and Digital Object Identifiers (DOIs).
PIDs are a unique, permanent code used to identify institutions, datasets, research studies and researchers, and are widely considered a mandatory element of good data management practice. Like citations for published research, data citations based on PIDs point the way to the exact source material used in a piece of research along with its location.
“We're really trying to facilitate better science, more reproducible science, more traceable science,” said ONC data stewardship manager Reyna Jenkyns.
In practical terms this means publications, datasets, researchers, organizations and funders are allocated persistent identifiers with descriptive metadata that includes its relationships to other entities, so that users can establish these interconnections in the research landscape.
“We are able to connect the dots between decisions and evidence in a reliable and traceable way that supports reproducibility and gives credit to individuals, organizations and funders,” she said.
This diagram shows the digital infrastructure that enables proper citation of data for the ONC Oceans 3.0 system (in blue), and its relationships to third party sources and applications (in orange).
Key to reproducing research is the ability to access the same datasets used in the original research. Complicating that task is the sheer volume of new data being added, minute by minute, which in ONC’s case is well over 140 terabytes of data a year into its 1.2 petabyte data archive.
“We are continually adding to the time series, but it may also be dynamic in terms of versioning,” said Jenkyns. “Sometimes information is derived from a formula applied to raw data and those formulas may be changed or improved over time.”
Data repositories must account for corrections and improvements and in some cases recreate an earlier version of the data. PIDs help data managers to track changes and explain them.
“PIDs are basically just a string of characters that we attach to so-called objects in the ecosystem,” she said. “For example, who made that object, who has the rights to that object and any licensing associated to it, and what is its relationship to the other objects out there.”
“All those links support queries about that constellation of relationships, that in practice look like magic.”
That metadata also attaches credit for the data collected to the researchers who collect it, a valuable product that has historically been overshadowed by the studies that interpret it. That means researchers can be properly cited for the datasets they create, when they are used over and again.
“What we end up with is traceability, connections between the datasets, who made them, what research was produced, and the decisions derived from that,” she said.
ONC will host a DynaCITE workshop for its data partners this summer, with a second workshop for researchers planned for the fall.
The 12-month DynaCITE project runs until March 31, 2023 and is supported by the Digital Research Alliance of Canada, a non-profit organization funded by Innovation, Science and Economic Development Canada (ISED), Government of Canada.