Introduction to Open Science

These materials are based off the Mentoring365 circles held in November 2023 and March 2024 through AGU’s Mentoring365 program.

The content here will provide an introduction to the basics of Open Science and its benefits for individual researchers, particularly early-career researchers and students. We will share tips and skills enabling researchers to immediately make their digital presence, data, and software more transparent, reproducible, and reusable. Participants will learn how to manage their digital presence, get started with data and software, and comply with our AGU Publishing policy on sharing and citing data and software.

Circle Leads:
Kristina Vrouwenvelder (AGU Open Science Leadership); Brian Sedora (AGU Publications/Open Science Leadership) and Sophie Hanson (AGU Publications)

Circle Agenda:

Week 1: Creating your Digital Presence

Week 2: Getting Started with Data in Your Research

Week 3: Getting Started with Software in Your Research

Week 4: Get Credit for Your Work: Sharing Your Data and Software Alongside Your Publications

Week 1: Why Open Science?; Managing Your Digital Presence

Why Open Science?

To get us oriented for this Circle, we want to introduce the concept of Open Science and AGU’s commitment to upholding these principles. Open Science seeks to broaden participation, increase access to scientific research, and overall, make science more inclusive. The free exchange of scientific data and information is necessary to accomplish these goals. AGU’s Position Statement on Data asserts that:

“Earth and space science data are a world heritage, and an essential part of the science ecosystem. All players in the science ecosystem—researchers, repositories, publishers, funders, institutions, etc.—should work to ensure that relevant scientific evidence is processed, shared, and used ethically, and is available, preserved, documented, and fairly credited.”

For further information, feel free to read more about AGU’s Position Statement on Free and Open Science.

Why is Your Digital Presence Important?

This week we will be discussing how to create and manage your digital presence. Your digital fingerprint is how the global scientific community finds, perceives, and interacts with your research and work. While anyone on the internet can find you, this is especially important for other researchers, funders, societies, associations, and potential collaborators both inside and outside of academia. Ensuring that you and your research are discoverable makes it more likely that your work will be cited and provides more opportunities for potential collaborators to connect with you.


Activity: To get you thinking more about your digital presence, try Googling yourself! What do you find? What matches your professional profile? Is there anything that does not match that profile? Can other members of the scientific community easily find your research products (papers, datasets, software uploads, etc.) and connect them to you?


Using an ORCID to Curate Your Digital Presence

Managing your digital presence is vital to your research being seen. One powerful tool to distinguish yourself online is by creating and updating an ORCID. An ORCID is a unique persistent digital identifier that links your professional information to your research and research products.


Activity: Create or Update Your ORCID!

Go to https://orcid.org and select “For Researchers.” Your digital ID - your ORCID - can be included on everything you do: papers, datasets, presentations, posters, software uploads. Anything and everything you can think of related to your professional presence. This is an incredibly helpful resource for increasing the discoverability of your work and ensuring you are credited when others use your work. Check out this Digital Presence Checklist and YouTube tutorial (slides for reference linked here) for more information about establishing an ORCID and connecting your research work using your ORCID.

If you already have an ORCID, this would be a good opportunity to optimize your usage by turning on Auto-Updates and verifying that your information and research products are up-to-date. These updates come from two trusted publishers: CrossRef (for published manuscripts) and DataCite (for datasets and software). See this blog post for further details about how to set up automatic updates.


Week 2: Getting Started with Data

Last week, you all learned about practicing open science and why it’s important and had a chance to build your own digital profiles through an ORCID.

This week, we’ll focus on your data, why sharing is important, and some best practices for sharing!

Why share your data?

  • Papers that cite data are up to 25% more likely to be cited by others!
  • Sharing your data makes your work more reproducible and transparent.
  • Your data is valuable – and could enable scientific discovery beyond your own work! (One AGU Publications example: we’ve seen papers published in 2016 citing a dataset from the 1980s!)

For more reading, check out: Colavizza et al, (2020) PLoS ONE 15(4): e0230416; McKiernan et al. eLife 2016;5:e16800.

How to share your data:

  • Try to include all the data someone would need to reanalyze your work! This means you should include not just the raw data – but information about any data processing you did, your experimental method, and other descriptions needed to understand your data.
  • Choose a good place to store your data. Ideally, your data storage location should be easily findable and accessible and permanent, so your data can be part of the scientific record.
    • Domain repositories specific to your scientific field are easily findable by your colleagues and often tell you exactly what kind of data description you need to enable reuse of your data.
    • Generalist repositories can also be good solutions, but make sure you include detailed descriptions of your data.
  • Cite your data in your research publication – so that your analysis and your data are linked AND you get credit for sharing your data! This means it’s best to store your data in a place where it can receive a DOI, just like your papers. (We’ll talk more about data citation in your publications in week 3!)

For more tips on best practices on sharing your data, check out the following resources:

Postcard: Cite and Manage Your Data

Good Data Practices - Dryad

Checklist for Managing your Digital Objects

List of Useful Domain Repositories by AGU Journal

Data and Software Sharing Guidance for AGU Authors

What kind of data do you use? Does your funder or publisher require you to share your data? Start planning early for data sharing!

Week 3: Getting Started with Software

The Basics

What do we mean by software? This could include…

  • software or code that you used for analysis and visualization of the data
  • software or code used to produce a model output
  • software or code that someone else created and you used in your research

Which software should I be sharing?

Of course, not all software can be shared alongside your paper. If you’re using proprietary programs for analysis, you likely won’t be able to share them, but you should mention and include a link to these proprietary programs in your Methods section or Availability Statement. However, if your work depends on scripting in Python, R, or another scientific programming language, and/or creates or builds off an already existing model or analysis package, you should…

  • If it’s your code: publish it in a repository like Zenodo so it can be used by others
  • If it’s someone else’s code: cite their code, including the exact version!

Developing and Documenting Your Code

A lot has been written about good practices for developing software and code, and we won’t repeat them all here. However, there are some resources we’d like to share that can help make sure your code and software are ready to share with others:

  • Version control: When developing code, you’ll quickly find out that it’s an iterative process – sometimes you’ll break something and need to roll back your changes, or maybe you’ll add on something for one project that’s not relevant for another project. In these situations, and in particular if you’re working on code collaboratively with your team, you’ll find it very useful to employ a system for tracking changes in your code. Many researchers use Github as a platform for this; other options include GitLab and Open Science Framework (OSF). We recommend the Software Carpentries lesson on getting started with Git for an introduction to version control.
  • Documentation: Writing code that works is one thing – but you’ll quickly find that to share your work effectively with others, and even to go back to your own old code and understand it, you’ll need to add textual documentation. This often takes the form of headers, in-text comments, and README files that explain important information about how your code works. Conventions for documenting code can depend on the programming language, but a good rule of thumb is to imagine you’re explaining how your code works to a colleague. Get started with more resources in this guide.

NOTE: if you’re working to develop scientific software and analysis packages for wide reuse by others, stay tuned for a second post this week on advanced practices!

  • Licensing: A license allows you to determine what others can do with your own work, whether they’re reusing it or just sharing it. To encourage reuse, use as open a license as possible! Many scientists releasing software and code use an MIT Licence, which is compliant with Open Source Software principles and encourages reuse. You can also use a tool like https://choosealicense.com/ to find a license that suits your needs.
  • Sharing: Documenting your code explains the technicalities of your programming, but to share your code with other scientists, you may need to include other information alongside your code, like scenarios when your model can be applied or conditions for initial parameters. Once you’re ready to publish your code – likely because you’re ready to publish the scientific article that code was for – we recommend sharing your code in a repository such as Zenodo. If you’re using Github for code development, you can issue a DOI (persistent identifier) to your repository directly using Zenodo. Check out this tutorial for more information. Make sure to include a citation.cff file in your repository so others know how to share your work!

For more tips on sharing your software, check out the following resources:

AGU Resources:

Guidance for AGU Authors - R Scripts and Markdown

Guidance for AGU Authors - Jupyter Notebooks

More Advanced Practices for Software

So now you’ve built an analysis package or a model and you want to make it as reusable and useful to other scientists as possible. First off, good for you! Open source, community-developed and shared software underlies many major scientific projects and discoveries. As just one example, the community-owned, open source library Numpy for Python underpins packages like eht-imaging, used by the Event Horizon Telescope Collaboration in creating the first ever image of a black hole.

Do you use any community-developed software or analysis packages in your work? Make sure you cite them in your publications and – if you can – contribute to their ongoing maintenance by tracking and fixing bugs or even suggesting new features!

Advanced Software Sharing and Reuse

  • Version control: We gave tips in the preceding lesson for getting started with version control – a vital element to developing robust software – in the last lesson. If you’re aiming to maximize reuse of your software, good version control practices are even more important. Start by making sure everyone on your team – and those who might reuse your package – understand how they can contribute to the software; you can define this in the ReadME file for your project. Common ways to contribute include issue tracking for project issues or bugs and submitting, discussing, and approving pull requests to approve changes, whether to code or to documentation.

NOTE: When working collaboratively in Github or other version control platforms, you’ll want to make sure team members and external users are working on forks to the published repository, which contains the current version of the package. This ensures there’s always a centralized, working copy of the software package to share. Within each fork, it’s best practice for project members to use branches to add specific features or fix bugs. When you’re ready to merge your branch back into main, you’ll use pull requests to review and approve changes.

  • Documentation: You’ll want to step up your documentation practices when developing and sharing scientific software packages for reuse. Best practices for your code itself still apply: use meaningful variable names, structures, headers, and comments throughout. Then, you’ll want to consider a few advanced strategies for documentation. Many groups build independent web platforms with detailed information about different elements of their code and tips on reuse, including test cases for new users to try (some great examples include the numerical modelling package Landlab and the solar data analysis package sunpy). For advanced documentation, you’ll generally include Markdown documentation files and docstrings (documentation strings, or specially formatted comments and headers in your code) in your Github repository alongside your code. Then, you’ll choose a tool to build your documentation automatically from your code and a tool to host your documentation externally (e.g., on a public website).

NOTE: Documentation and docstring formatting and tools depend on the language you’re using. If you’re working in Python, one common method is to use Sphinx to generate documentation and readthedocs to build and host it online.

  • Tutorials and testing: Sharing tutorials and test suites alongside your code serve to encourage reuse in a number of ways.

    • Test suites are functions that test your Python package to make sure everything is still functioning as intended, even after you or someone else makes major changes to the code to fix bugs or add features. For example, a test suite might test whether your package’s results match a known value from an analytical calculation.
    • Tutorials serve as demonstrations for other users to help them understand how to run your code; they’re often deployed in the form of interactive Python or R Markdown notebooks and accompanied by test datasets. You can also publish and share your tutorial notebooks using a tool like BinderHub!

NOTE: For an example of analysis published in a Jupyter notebook, check out this AGU paper and associated notebook.

  • Software Packaging: New users can find it challenging to use your code if their environment, libraries, and other background elements don’t align with those needed to run your software. Packaging your software allows you to publish a distribution of your work alongside needed libraries and provide cross-platform support and environment management. Conda is one language-agnostic, commonly-used tool for package and environment management for scientific software.

This is an advanced topic and we can’t cover everything there is to know about scientific software development. If you’re interested in this route, make sure to check if your institution, library, or professional community has any guidance and community standards that can help you learn best practices. Research software engineers in your field or communities in the Earth and space sciences working to publish scientific packages (e.g., CSDMS, astropy, CIG, and many more) can offer expertise, resources, and community support. Finally, we’ll share a few resources for further learning below!

More Resources:

General:

Better Scientific Software

Python code packaging for scientific software

Scientific Computing in Practice

Good Documentation:

General: Documentation – Better Scientific Software

Python-specific: Documentation

R-specific: Documentation manual

Software Packaging:

Python Packaging User Guide

Week 4: Getting Credit for your Data and Software: AGU Journals Data and Software Guidance for Authors

Last week, we discussed sharing your data and software and why it’s important. This week, we’ll focus on getting credit for your data and software in your paper. We’ll be using AGU’s requirements for data and software sharing, but these will apply to many publishers, and will be helpful as you think about how you will share your data and software early in the writing and submission process.

Did you know? Citing your data and software gives you similar benefits to citing your published papers and allows you to get credit for reuse of your data and software.

AGU’s Policy on Data and Software

AGU requires that the underlying data and/or software needed to understand, evaluate, and build upon the reported research be available at the time of peer review and publication. Additionally, authors should make available software that has a significant impact on the research.

This is achieved by following three requirements:

  • Depositing the data and software in a community accepted, trusted repository, as appropriate, and preferably with a DOI

  • Including an Availability Statement as a separate paragraph in the Open Research section explaining to the reader where and how to access the data and software

  • Including citation(s) to the deposited data and software, in the Reference Section.

In the next activity, you’ll craft a sample Data Availability Statement for your own dataset or a paper in your field using AGU guidance.

Writing a Data Availability Statement

An Availability Statement, located in the Open Research section of a journal article, or at the end of a book chapter, contains information about your data, software, and other research objects (e.g. notebook) and how readers can access these. A good Data Availability Statement contains the following elements:

  • A brief description of the type(s) of data or software
  • Repository Name(s) where they are deposited
  • DOI (Persistent Identifier) [required]; or, if no DOI is available, Link to Data or Software
  • In-text citation in References [required for all data and software with DOIs]
  • For Software: Version and Link to publicly accessible development platform (E.g. GitHub)
  • Access Conditions (e.g. if Registration is Required)
  • Licensing/Permissions (e.g. Creative Commons Attribution)

Example Availability Statements:

Data:

The [type of data] data used for [brief context, description] in the study are available at [repository, source name] via [DOI, persistent identifier link, OR URL if no persistent identifier is available] with  [license, access conditions] [in-text citation in References, required for each DOI]

Software:

[Version number] of the [software name] used for [brief context, description of what the software was used for] is preserved at [DOI, persistent identifier link, OR URL if no persistent identifier is available], available via [license type, access conditions] and developed openly at [software development platform link]. [in-text citation in References, required for each DOI]


Activity: Availability Statements!
Write a sample Data Availability Statement for your own dataset or a paper in your field (using AGU guidance), or if you have code you plan to share, write a Software Availability Statement!


Example Data and Software Citations:

Data:

Edmunds, P. J., Didden, C., & Frank, K. (2021). Mean percentage cover of corals and Porites astreoides at each site by year at St. John, VI from 1992 to 2019 (Version 1) [Dataset]. Biological and Chemical Oceanography Data Management Office (BCO-DMO). https://doi.org/10.26008/1912/BCO-DMO.843284.1​

Software:​

Shobe, C. (2023). Code and data for “The uncertain future of mountaintop-removal-mined landscapes 1: How mining changes erosion processes and variables” (v1.0) [Software]. Zenodo. https://doi.org/10.5281/zenodo.10059514

Comments, Suggestions and Contact:

Thanks for visiting our Open Science course!

If you have comments, suggestions, or questions, email Kristina Vrouwenvelder.

Cite these materials:

Vrouwenvelder, K., Hanson, S., & Sedora, B. (2024, April 10). Lesson Materials: Introduction to Open Science. Zenodo. https://doi.org/10.5281/zenodo.10957625