Managing your research data

With the quantity of digital data produced by research projects increasing exponentially, data management has become a real challenge for research organizations.
Managing data effectively is vital to ensure that the information can be retrieved, secured, used and shared.

The basics

According to the OECD, research data are factual records (numerical scores, textual records, images and sounds) used as primary sources for scientific research, and that are commonly accepted in the scientific community as necessary to validate research findings. Documents (laboratory notebooks, preliminary analyses, draft scientific documents, personal correspondence, etc.) and physical objects (bacterial strains, lab animals, etc.) are therefore not considered as research data.

Data are generally grouped together into a dataset which has a degree of unity and forms a coherent whole.

 


The data lifecycle

The data lifecycle has six broad stages: creation or collection, processing, analysis, preservation, access and reuse. Each stage of the cycle involves data management measures. These measures are represented on the data lifecycle diagram opposite (click on the image to enlarge it).

 

Image adapted from the data lifecycle published by the UK Data Service: https://www.ukdataservice.ac.uk/manage-data/lifecycle


Why manage your data?

Good data management is vital! Everyone says so, but why? Because well-managed data can be retrieved and reused by the scientific community. Because good data management is beneficial for scientists themselves and for their institution. And finally, because it is compulsory in some cases, especially for EU-funded projects.

The FAIR principles: Findable, Accessible, Interoperable, Reusable

One of the aims of data management is to facilitate the discovery and reuse of academic knowledge by both individuals and computer systems. The FAIR principles serve as a guideline to help achieve this objective. The four FAIR principles are as follows:

 

Findable

Data must be easy to find by both humans and computer systems.

Accessible

Data should be stored on a long-term basis so that they can be easily accessed and/or downloaded.

Interoperable

Data must be readable and usable by different IT systems so that they can be shared and reused.

Reusable

Data should be able to be reused for future research and processed using computer methods.

 

Advantages for scientists and the institution

 

Advantages for scientists and the institution - CeRIS - Institut Pasteur

Managing data effectively in compliance with the FAIR principles has several advantages:

  • It ensures that research data is precise, comprehensive and reliable

  • It improves data security and minimizes the risk of data loss

  • It guarantees the integrity and reproducibility of research

  • It avoids data duplication, saving time and resources

  • It boosts the visibility and impact of a scientist's work

  • It encourages reuse and innovation through sharing

  • It facilitates the establishment of scientific partnerships

 

A policy of the Institut Pasteur

 

This Policy sets out the Institut Pasteur's guidelines on the management and sharing of research data and software code. It aims to facilitate the sharing and reuse of data and software code according to the FAIR (Findable, Accessible, Interoperable, Reusable) principles.

It summarises the best practices to be implemented throughout the research process and refers to fact sheets that give scientists the operational resources they need to implement these best practices.

This Policy was developed as part of a collaborative, transversal project led by the CeRIS and the Data Management Core Facility.

Contact: rdm-policy@pasteur.fr

 

 

A requirement for funding bodies

French National Research Agency (ANR)

Since 2019, the ANR has implemented Open Science requirements. These obligations are part of the French national strategy for Open Science, initiated with the National Plan for Open Science in 2018. The ANR draws coordinators’ attention to the importance of considering data management and sharing at the project development phase, following the principle “as open as possible, as closed as necessary”. The ANR requires all projects funded in 2019 onwards to produce a Data Management Plan (DMP).

The European Commission

Since January 1, 2017, all grant beneficiaries of European Commission calls are encouraged to take measures concerning the data needed to validate the results presented in publications:

Requirements - CeRIS - Institut Pasteur

In the H2020 program, it was possible to avoid complying with these obligations (via an opt-out) without this having a negative impact on the evaluation of the project.

In the Horizon Europe program (which covers the period from 2021 to 2027), drawing up a data management plan become compulsorys for all European projects. In addition, it is is possible that effective evidence of data dissemination will be progressively requested.

For more information on funder's requirements, consult the CeRIS fact sheet

 

Drawing up a data management plan

What is a data management plan?

A data management plan (DMP) is a document drawn up at the beginning of a research project which defines how the data will be managed during and after the project, from creation or collection to sharing and archiving. The DMP needs to be constantly updated over the course of the research project.

It covers the following aspects:

Drawing up a data management plan - CeRIS - Institut Pasteur

How to draw up a DMP?

The CeRIS provides support for Institut Pasteur scientists in drawing up their data management plan by proposing a template composed of a series of questions that all scientists need to consider at the beginning of a research project. Each question is accompanied by sample answers and advice provided by the relevant departments at the Institut Pasteur. The structure of the DMP template is based on the template proposed by the European Commission, with contributions from several departments at the Institut Pasteur, including the CeRIS library and archives, the Information Systems Department, the Legal Affairs Department, the Ethics Unit, the Patents and Inventions Department, the Quality Unit and the Center for Translational Science.

The CeRIS provides scientists with the DMP template and several documents to help them with the process of drawing up their DMP:

Using and sharing data

Publishing data in a data repository

Data repositories are online services for the deposition, description, storage, retrieval and dissemination of datasets. The datasets are described by metadata in such a way as to be retrievable.

When choosing a repository, several factors need to be taken into account. Firstly, it must meet the requirements of the funding body or publisher. Secondly, it must have all the characteristics needed to store FAIR data (findable, accessible, interoperable, reusable). We would also recommend opting for a certified "trusted repository".

To find a suitable repository for your research field, you can consult several directories:

 

The list of repositories for the biomedical field proposed by the CeRIS contains repositories that are either certified or recommended by a publisher or a funding body.

Re3data is a multidisciplinary directory of data repositories (social sciences, life sciences, medicine, etc.) which enables users to filter results and only display certified repositories.

FAIRsharing is a directory of data repositories in life sciences which enables users to filter results and only display repositories recommended by publishers or funding bodies.

 

Publishing a data paper

A data paper (or data article) is a peer-reviewed scientific publication whose main aim is to describe one or more datasets rather than the results of scientific research. The data described must be accessible, either as annexed files or more generally via a permalink (URL or DOI) to the data repository where they are stored. Data papers can be published in a data journal (a journal that only contains data papers) or in a traditional scientific journal (which publishes a range of articles including data papers).

Publishing a data paper is a way of informing the scientific community of the existence of a dataset that has been deposited in a data warehouse. This makes the data more easily visible and citable. It also enables the data to be described precisely, thereby opening up the potential for reuse.

Some examples of data journals:

 

Questions & Answers

Where can I learn about research data management online?

The ELIXIR Research Data Management Kit (RDMkit) is an online guide containing good data management practices applicable to life sciences research projects. Developed and managed by people who work every day with life science data, the RDMkit has guidelines, information, and pointers to help you with problems throughout the data's life cycle.

What is the point of drawing up a data management plan if it is not compulsory?

Drawing up a data management plan before beginning your project is a way of asking yourself the right questions and adopting best data management practices. Well-managed data are data that are easy to retrieve and reuse, described precisely by metadata, secure and permanent. If the journal you are publishing an article in asks you to deposit the accompanying data in a warehouse, you can rest assured that your metadata are already prepared and all you have to do is transfer them to the various fields. You can also easily make your data accessible and visible by publishing them in a data paper.

Is there a search engine that I can use to search for data in different repositories?

There are several data search engines:

  • DataMed provides access to various types of data in the biomedical field. It currently covers 76 repositories and offers a powerful advanced search.

  • Omics Discovery Index allows you to search for datasets in the fields of genomics, proteomics, transcriptomics and metabolomics. It also offers advanced search functions (by organism, by disease, etc.).

  • Elsevier DataSearch covers more diverse scientific fields. It can be used to access datasets from a more limited number of repositories but also some supplementary data.

  • Google Dataset Search is the least efficient. It offers a basic search and very few features.

 

Contacts

 

 

Open Science newsletter

Every two weeks, the Open Science newsletter will provide you information and shed light on developments, challenges and new practices in three key areas of Open Science: scientific publishing in the age of Open Access, data and software management and sharing, research evaluation and planning.

 

Back to top