Preserving very large data is a challenge. Spoilers, there are no easy answers!

When it comes to large datasets, we are often asked by authors and editors how they should preserve the data. These questions come via datahelp@agu.org and our data and software guidance discussions. Spoilers, there are no easy answers, yet! Here we offer our experience, share the current limitations, and the approaches we recommend with what is possible right now.

AGU requires that primary and processed data used for your research should be preserved and made available. This can range from observational data to the data used to generate your figures. The raw data may be needed, but usually, the processed or refined data that support and lead to the described results and allow other readers to assess your conclusions and build off your work should be preserved.

For data that is large, over 1 Terabyte (TB), authors run into the challenge of finding a suitable repository. Many repositories have file size limitations but also costs associated with deposits over certain limits. This generalist repository comparison chart provides an overview of the limitations. Discipline-specific and institutional repositories are often a place to turn to for assistance with preserving large data but they also have limitations and potential costs. This emphasizes the importance of avoiding surprises at the time of publication by:

planning ahead (e.g. type of data, size of files),
finding a suitable repository (e.g. discipline, institution, general),
knowing the limitations of the repository (e.g. upload size), and
determining costs ahead of time (e.g. size, years of preservation).

In other cases where the data is too large and complex to move and deposit, for instance, the simulation data from models and associated workflows running at computing clusters, the discussion turns to what data should be preserved. In AGU’s Data and Software Guidance for Authors we outline the decisions an author needs to make (see Models and Simulations) and we go into further detail in our journal specific guidance as well. There is also a group working on a Rubric for Models and Model Data — Best Practices for Preservation and Replicability.

Some research computing facilities and institutions provide sharing and access mechanisms for such scenarios but there are questions around their long-term persistence and preservation capabilities.

We encourage authors to provide as much access and contextual information as they can while also addressing what can be preserved.

Recommendations for Preserving Large Data:

Attempt to identify a trusted, community-accepted preservation repository (i.e., discipline, institutional, general)
- Contact the repository to plan the deposit and establish a budget. This option supports automated attribution and credit and is preferred.
If no preservation repository is adequate, then identify the most persistent location possible.
- Ensure the data will be managed and made available for at least 5 years to support the related research publication. Check with your funder and institution for additional time requirements.
- It is preferable that the platform used for sharing the data is managed by an institution or entity with long-term funding as opposed to a website established for a specific grant effort that will not have long-term management.
- The platform needs to support confidential access for the paper peer-review process.
- Contact the selected platform to plan the deposit and establish a budget.
- Include with your data the necessary documentation (e.g. README), licensing (e.g. CC-BY), recommended citation, and other information that will be helpful for understanding it.
- This option allows your data to be cited using a URL, but will not result in automated attribution or credit.

Possible institutional sharing platforms for very large datasets include: FTP site, DropBox, Box, and OneDrive. It is important to repeat that sharing your data in these platforms is only allowed when no other preservation repository can be used. The author is expected to plan and budget for preserving their data and make every attempt to identify a preservation repository.

AGU requires that data supporting your research be made available and cited in your paper. We are available to provide additional guidance through datahelp@agu.org.

We welcome your feedback and suggestions, please open an issue here.