The wonderful world of OpenScience

Séminaire Scientifique et Technique de l’UR PROSE

Cédric Midoux

PROSE

June 30, 2023

The fourth paradigm of science

[1]

[2]

The data deluge

[3]

Innovation’s march

In the past ...

Science today

In the past Science today

The ravages of time

[4]

[5]

Reproducibility ?

[6]

[7]

Reproducibility ?

[8]Legend

Threats to reproducible science

[9]

A data management horror story

State of play …

  • Data deluge
  • Reproducibility crisis
  • Ethics crisis
    • P-hacking
    • Publish or Perish
  • Scientific-political crisis
    • Research funding
    • Private research
    • Academic publishing company

[10]

According to the UNESCO Recommendation, open science is a set of principles and practices that aim to make scientific research from all fields accessible to everyone for the benefit of scientists and society as a whole. The Recommendation aims to ensure not only that scientific knowledge is accessible but also that the production of that knowledge itself is inclusive, equitable and sustainable.

By promoting science that is more accessible, inclusive and transparent, open science furthers the right of everyone to share in scientific advancement and its benefits, as stated in Article 27.1 of the Universal Declaration of Human Rights.

FAIR Guiding Principles

Cost of not having FAIR research data

Following this approach, we found that the annual cost of not having FAIR research data costs the European economy at least €10.2bn every year

[13]

Research data

[14]

Factual records

Primary sources for scientific research

Necessary to validate research findings

This Recommendation principally concerns research data in a digital, computer-readable format.

Open Data 5★

  • OL★ : Open License
  • RE★ : machine REadable
  • OF★ : Open Format
  • URI★ : Uniform Resource Identifier
  • LD★ : Linked Data

Metadata

Data on data.

  • Date experiment was done
  • Time a measurement was made
  • Number of repeated measurements
  • Who conducted the experiment
  • Dosage of treatment
  • How many subjects were in study
  • How many subjects dropped out of study
  • Experimental design

WHO? WHAT? WHEN? WHERE? HOW? WHY?

Electronic Lab Notebook

Keep track of your experiments and collaborate with your team easily!

  • Lab notebook for experiments
  • Use templates for your experiments
  • Add steps to your protocols
  • Draw doodles & attach documents
  • Management of schedules and reservations
  • Database for lab equipment, storage, …
  • Todolist
  • Timestamp legally your experiments

Live Demo

Controlled vocabulary

Community need standards

In essence, a standard is an agreed way of doing something. A standard provides the requirements, specifications, guidelines or characteristics that can be used for the description, interoperability, citation, sharing, publication, or preservation of all kinds of digital objects such as data, code, algorithms, workflows, software, or papers.

Create your own metadata standards

Metadata standards - FAIRsharing

[16]

Metadata standards - MIxS

Open file format

Open File Formats are file formats that are published and freely available for anyone to use. A file format is a standard way of encoding storage of computer information. Open file formats can be contrasted with proprietary, protected file formats. Open file formats are often recommended for preservation purposes because they typically do not require special software to open.

Open Closed
Textes txt, odf, rtf doc, pages
Images png, jpg, gif, svg tiff
Spreadsheets csv, ods xls
Archives tar, zip rar

Not all proprietary formats are closed. For example, Adobe’s .pdf format has become an ISO standard. Anyone can open a PDF file.

Open file format - The game

Personal data

Personal data is “any information relating to an identified or identifiable person”.

  • directly / indirectly
  • from data alone / from metadata / from cross-referencing data
  • patients / agents / customers / …

Take particular care with sensitive data!

Compliance procedures

Tidydata

Artwork by @allison_horst

Tidydata

[18]

Tidydata - Example

[19]

Excel error plague

The spreadsheet software Microsoft Excel, when used with default settings, is known to convert gene names to dates and floating-point numbers. A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions.

[20]

Storing data - Challenges - the 6 V

  • Large and growing Volumes of data
  • Wide Variety of information
  • Velocity in data acquisition frequency
  • Guarantee of the Value and Veracity of the information
  • Take advantage of the Valorization (intellectual, scientific, social, economic, …) of data

Storing data - Recommendations

  1. Plan how the data will be described, structured and organised
  2. Always store data with metadata
  3. As soon as possible, use a persistent identifier
  4. Include the costs of data storage in the funding plan
  5. Identify the person(s) responsible for the data.

Storing data

Organised working environment

  • Structuring folders and files in a tree structure
  • Use naming conventions clear, coherent and shared
  • Explicitly define and track versions of tools and databases
  • Control file-system permissions
  • Ensure file integrity (md5sum)

Backup

Cybersecurity

Cybersecurity - Passwords

Cybersecurity - Examples

;document.getElementById("tweet-62840").innerHTML = tweet["html"];

Open Source Software

  • Access to code
  • Right to use, study, change and distribute it


  • Free ≠ Open source
  • Ensures the longevity of software
  • Not be captive to a development company

Beyond reproducibility: transparency in research

Explaining to justify and understand

Redo to check, correct and reuse

  • Obliges you to check your work (share data + code)
  • Your future yourself will thank you
  • And your colleagues too
  • By being reproducible, you strengthen your credibility and reputation
  • Reproducibility fosters confidence in the scientific process

You’re contributing to faster scientific progress

You don’t lose time …

[21]

For that you need to code a little …

What do we need to make research reproducible?

  • Data in some coherent format
  • Programming language (R, Python)
  • Text, figures and code in same environment (litterate programming)
  • Continuous and transparent editions and updates (version control)

Artwork by @allison_horst

Notebook

  • Unify in a single document :
    • Context details
    • Code
    • Computations and results
    • Interpretations
Ensures the consistency of analyses and improves traceability
Generates an exportable document (e.g. html) for improved portability and readability

[22]

Notebook - RMarkdown

[23]

Notebook - Jupyter Notebook

Tidyverse

Distributed version control (git)

  • Record changes made in set of files
  • Track history and review any changes
  • Back to earlier versions
  • Collaborative work on parallel features
It works with scripts & codes, protocols & documentation, reports, any documents !

What is a commit ?

[24]

With a visual interface

Git branching

Online remote repositories (eg: GitLab)

  • Sharing code with others
  • Contribute
  • Online backup server
  • Monitoring the project’s progress (Issues)
  • Run code (CI/CD)
  • Host website (Pages)

Why not keep it all?

  • Data storage has a human, financial and ecological cost.
  • What are the legal obligations for which data?
  • Technological obsolescence (supports, formats, doc)

What should we keep?

Why should I keep it?

For how long?

Where should I keep it?

And how?

The two major use cases and drivers for what to keep are Research Integrity and Reproducibility (availability of the data supporting the findings in research) ; and the Potential for Reuse (availability of data for sharing with other users)

[25]

Long-term archiving

Preservation of research unit archives

[26]

LPRN 2016 & Décret 2021-1572

LPRN 2016 & Décret 2021-1572

[I.] Lorsqu’un écrit scientifique issu d’une activité de recherche financée au moins pour moitié par des dotations de l’État, (…) son auteur dispose, (…) du droit de mettre à disposition gratuitement dans un format ouvert, par voie numérique, sous réserve de l’accord des éventuels coauteurs, la version finale de son manuscrit acceptée pour publication, (…) à l’expiration d’un délai courant à compter de la date de la première publication. Ce délai est au maximum de six mois pour une publication dans le domaine des sciences, de la technique et de la médecine (…).

[II.] Dès lors que les données issues d’une activité de recherche financée au moins pour moitié par des dotations de l’Etat, (…) ne sont pas protégées par un droit spécifique ou une réglementation particulière et qu’elles ont été rendues publiques (…) leur réutilisation est libre.

[Art. 1] L’intégrité scientifique se définit comme l’ensemble des règles et valeurs qui doivent régir les activités de recherche pour en garantir le caractère honnête et scientifiquement rigoureux.

[Art. 2] Les établissements publics et fondations reconnues d’utilité publique promeuvent la diffusion des publications en accès ouvert et la mise à disposition des méthodes et protocoles, des données et des codes sources associés aux résultats de la recherche afin d’en garantir la traçabilité et la reproductibilité.

[Art. 6] Ils veillent à la mise en œuvre par leur personnel de plans de gestion de données et contribue aux infrastructures qui permettent la conservation, la communication et la réutilisation des données et des codes sources.

[27]

Data repositories

Why use a repository?

  • Submit, share, re-use and archive data with FAIR principles
  • Link metadata
  • Provides a PID
  • Increases the visibility of your research
  • Obligations of funders / publishers

Disciplinary repository

Institutional repository

Recherche Data Gouv - Organization

Recherche Data Gouv - Content

Persistent identifier

  • Permanent identification
  • Identification and referencing
  • Interoperability
  • Aggregating scientific production and improving visibility
  • Distributed by a trusted organisation

For Object

  • Data and papers
  • DOI, ISBN, SWHID, …

For Contributors

  • People and organisations
  • ORCiD, idHAL, PID, …
  • French MESR will include ORCiD in agents records.

Persistent identifier

License

Without a licence, data is not truly open.

  • Allows users to be granted specific rights of use in advance
  • May include restrictions on use
  • It is necessary to use one in all cases to clearly display the associated rights

LPRN Guidelines

DataPaper

HAL

HAL INRAE is the open access repository, visible by everyone, for depositing and consulting the scientific production.

  • Help promote open access to scientific and technical information
  • Make INRAE researchers’ results as accessible as possible
  • Increase the visibility of INRAE research

Reuse data

Find Datasets

  • By publications
  • By repositories
  • By DataPaper
  • By social networks
  • By visualization

Forging new collaborations

Citations

  • Essential for linking data to the scientific publications that use them
  • Always cite the datasets used and their version
  • DOI Citation Formatter
  • DataCite

Research Data Lifecycle & DMP

[28]

What is a DMP?

Un Data Management Plan (DMP) est un document formalisé explicitant la manière dont seront obtenues, documentées, analysées, disséminées et archivées les données produites au cours et à l’issue d’un processus ou d’un projet de recherche.

Il est un outil pour gérer les données tout au long du projet en intégrant la notion de cycle de vie.

La gestion des données n’est pas une fin en soi, mais le moyen de conduire à la découverte de connaissances et d’innovations par l’intégration et la réutilisation des connaissances produites.

[29]

PGD

Plan de Gestion de Données

PGD

Pour Générer du Dialogue

DMP - Why, Who and When ?

Why ?

  • Plan the management of project data (obviously)
  • Describe how the data is obtained
  • Ensure that the data is understandable
  • Clarify the legal and ethical framework
  • Providing appropriate data storage
  • Define everyone’s responsibilities

Who ?

  • Project Coordination Team (and associated members)

When ?

  • Generally three releases (6 months, mid-project, end of project)

DMP - How?

  • Many templates
    • ANR, INRAE, European Research Council, …
  • Many tools
    • OPIDoR, DSW, ARGOS, Word/Nextcloud, ….
    • Machine Actionnable DMP
    • Comments & Guidance

DMP - Project / Structure

Project DMP

  • Defined in terms of the duration and scope of the project

Structure DMP

  • Defined for the scope of the structure to harmonise and document practices, in a more modular way

INRAE Project template

  1. Information concerning the management plan
  2. Information on the research project
  3. Brief presentation of project data
  4. Description and organisation of data
  5. Intellectual property rights
  6. Data Sensitivity
  7. Data storage and backup during the project
  8. Access and sharing of data at the end of the project
  9. Data archiving and conservation after the end of the project

1. Information concerning the management plan

  • Author of the DMP
  • Affiliation of the author of the DMP
  • Date of creation of DMP
  • Current version: (n°, date)

2. Information on the research project

  • Identifier of the call for proposal
  • Project funder(s)
  • Name of research programme
  • Reference of funding agreement
  • Project acronym
  • Name of research project
  • Project leader institution, coordinator & beneficiary (name, country)
  • Other partners
  • Unit to which project leader belongs
  • Project dates and duration

3. Brief presentation of project data

  • Type, scope, scale
  • Origin
  • Associated publications

4. Description and organisation of data

  • What methods and tools are used to acquire and process data?
  • Documentation associated with the data
  • What types of metadata will be produced to accompany the data?
  • What standards or taxonomies will be used to describe the data?
  • How will the metadata be produced?
  • How will the data files be managed and organised during the project: control of versions, conventions for naming files, organisation of files, …
  • What is the quality control procedure of the data?
  • Enclose the quality insurance plan if possible

5. Intellectual property rights

  • Who owns the rights on data and other information created during the project?
  • Will material protected by specific rights be used during the project? In this case, who will deal with the formalities required, obtain the authorisations for use and possible dissemination?

6. Data Sensitivity

  • Identification of the data sensitivity Level
  • What are the measures taken and the norms that must be met to guarantee the security of sensitive data?
  • If there is personal data, what measures are envisaged to protect it during the project or in the context of re-use?

7. Data storage and backup during the project

  • Have the information systems used been subjected to a risk analysis or certification?
  • What types of physical media are used to store data during the project?
  • What security measures are in place during the data transfer stages of the project?
  • What is the estimated amount of data?
  • Where will the data be located geographically?
  • Does the entity physically hosting the data have a security policy for its information system and security assurance plan?
  • Security - Confidentiality: will the data de exchanged or shared with third parties?
  • How are rights of access to data determined during the research project?
  • Security – Integrity – Traceability: what measures of protection will be taken to monitor data production and analysis during the project?

8. Access and sharing of data at the end of the project

  • Is there an obligation to share data (or on the contrary a prohibition or restriction?
  • What data will be shared at the end of the project? If all the data are not available in the same way, or at the same time, please specify
  • What are the potential reuses for these data?
  • Does reading the data require specific software or tool? If so, which one?
  • How will the data be shared?
  • With whom? With what licence?
  • As from when?
  • For how long?
  • Will the data be identified by a permanent identifier (DOI or other)?
  • Which organisation will be responsible for requesting the identifier in the case of multi-partner projects?

9. Data archiving and conservation after the end of the project

  • What data will be conserved in the medium and long term and what data will be destroyed?
  • On what permanent archive platform will the data that are to be conserved long-term be archived?
  • What procedures will be set up for long-term conservation?
  • What is the duration of data conservation?
  • Who will be responsible for long-term conservation?
  • Name an individual contact
  • What will be the volume of these data?
  • What funding guarantees will cover the costs of long-term conservation?

Is it all good?

Ready to go ?

Check-list

To read

1. Hey T, Tansley S, Tolle K, Gray J. The fourth paradigm: Data-intensive scientific discovery. Microsoft Research; 2009. https://www.microsoft.com/en-us/research/publication/fourth-paradigm-data-intensive-scientific-discovery/.
2. Schleder GR, Padilha ACM, Acosta CM, Costa M, Fazzio A. From DFT to machine learning: Recent approaches to materials science–a review. Journal of Physics: Materials. 2019;2:032001. doi:10.1088/2515-7639/ab084b.
3. Murphy DJ. Using modern plant breeding to improve the nutritional and technological qualities of oil crops. OCL. 2014;21:D607. doi:10.1051/ocl/2014038.
4. Michener WK, Brunt JW, Helly JJ, Kirchner TB, Stafford SG. Nongeospatial metadata for the ecological sciences. Ecological Applications. 1997;7:330–42. doi:10.1890/1051-0761(1997)007[0330:nmftes]2.0.co;2.
5. Gibney E, Van Noorden R. Scientists losing data at a rapid rate. Nature. 2013. doi:10.1038/nature.2013.14416.
6. Baker M. 1,500 scientists lift the lid on reproducibility. Nature. 2016;533:452–4. doi:10.1038/533452a.
7. Open Science Collaboration. Estimating the reproducibility of psychological science. Science. 2015;349:aac4716. doi:10.1126/science.aac4716.
8. Collberg C, Proebsting TA. Repeatability in computer systems research. Communications of the ACM. 2016;59:62–9. doi:10.1145/2812803.
9. Munafò MR, Nosek BA, Bishop DVM, Button KS, Chambers CD, Sert NP du, et al. A manifesto for reproducible science. Nature Human Behaviour. 2017;1. doi:10.1038/s41562-016-0021.
10. UNESCO recommendation on open science. UNESCO; 2021. doi:10.54677/mnmh8546.
11. The Turing Way Community. The turing way: A handbook for reproducible, ethical and collaborative research. 2022. doi:10.5281/ZENODO.3233853.
12. Wilkinson MD, Dumontier M, Aalbersberg IjJ, Appleton G, Axton M, Baak A, et al. The FAIR guiding principles for scientific data management and stewardship. Scientific Data. 2016;3. doi:10.1038/sdata.2016.18.
13. European Commission and Directorate-General for Research and Innovation. Cost-benefit analysis for FAIR research data : Cost of not having FAIR research data. Publications Office; 2019.
14. OCDE. Enhanced access to publicly funded data for science, technology and innovation. 2020. doi:https://doi.org/https://doi.org/10.1787/947717bc-en.
15. Réseau Qualinous, Batifol V, Burnel L, Cardona A, Johany F. Affiche "Cycle de vie des données : un outil pour améliorer la gestion, la mise en qualité et l’ouverture des données". 2021. doi:10.15454/hsc3-b796.
16. Lister A, Sansone S-A. FAIRsharing in a nutshell. 2023. doi:10.5281/zenodo.7737367.
17. Yilmaz P, Kottmann R, Field D, Knight R, Cole JR, Amaral-Zettler L, et al. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nature Biotechnology. 2011;29:415–20. doi:10.1038/nbt.1823.
18. Wickham H. Tidy data. Journal of Statistical Software. 2014;59:1–23. doi:10.18637/jss.v059.i10.
19. Hart EM, Barmby P, LeBauer D, Michonneau F, Mount S, Mulrooney P, et al. Ten Simple Rules for Digital Data Storage. PLOS Computational Biology. 2016;12:e1005097. doi:10.1371/journal.pcbi.1005097.
20. Ziemann M, Eren Y, El-Osta A. Gene name errors are widespread in the scientific literature. Genome Biology. 2016;17. doi:10.1186/s13059-016-1044-7.
21. Quintana D. Five things about open and reproducible science that every early career researcher should know. Open Science Framework. 2022. doi:10.17605/OSF.IO/DZTVQ.
22. Xie Y, Allaire JJ, Grolemund G. R markdown: The definitive guide. Boca Raton, Florida: Chapman; Hall/CRC; 2018. https://bookdown.org/yihui/rmarkdown.
23. Russo F, Righelli D, Angelini C. Advantages and limits in the adoption of reproducible research and r-tools for the analysis of omic data. Cham: Springer International Publishing; 2016.
24. Wickham H, Bryan J. R packages. 2nd edition. O’Reilly Media; 2023. https://r-pkgs.org/.
25. Beagrie N. What to keep: A jisc research data study. Jisc; 2019. https://repository.jisc.ac.uk/id/eprint/7262.
26. Deuxième Plan national pour la science ouverte. Ministère de l’Enseignement supérieur, de la Recherche et de l’Innovation; 2021. https://www.ouvrirlascience.fr/wp-content/uploads/2021/06/Deuxieme-Plan-National-Science-Ouverte_2021-2024.pdf.
27. Olivier P, Rennes S, Szabo D, Martel A-S. Ouverture des données : … aussi ouvert que possible ... aussi fermé que nécessaire. 2022. doi:10.17180/991x-t610.
28. CIRAD-DGDRS-DIST-FRA, editor. Le cycle de vie des données. Intégrer la gestion de données scientifiques aux activités de recherche. 2017. https://agritrop.cirad.fr/594579/.
29. Reymonet N, Moysan M, Cartier A, Délémontez R. Réaliser un plan de gestion de données "FAIR" : modèle. 2018. https://archivesic.ccsd.cnrs.fr/sic_01690547.
30. Sébire F. Check-list de l’Institut Pasteur pour des bonnes pratiques de gestion des données de recherche. 2023. https://hal.science/hal-04123336.
31. Arnould P-Y, Jacquemot-Perbal M-C. Guide de bonnes pratiques. Gestion et valorisation des données de la recherche. Research Report. OTELo ; INIST-CNRS; 2016. https://hal.science/hal-01275841.