Click to learn more about author Morgan Littlewood.
Much has been written about the hunger for data storage in university research, whether in the humanities, arts, social sciences, genomic studies, astronomy, or seismic research. An increasingly popular option has been open-source storage – or “Open Storage” – which is software that is developed in a public, collaborative manner under a license that permits the free use, distribution, and modification of the source code. Open Storage is also used as a descriptor for commercially supported versions that use open-source software as the fundamental underlying technology. A handful of storage system vendors add enhanced functionality in order to offer enterprise capabilities, providing a compelling option for many organizations, including world-class research universities.
Why Not the Cloud?
Most cloud storage providers offer universities a stable and secure computing environment to build and run their applications without having to manage a data center, invest in hardware, or install and update software. While many cloud providers offer similar storage services, they differ in how they price and deliver those services. Cloud provider pricing structure does not typically meet the budgetary requirements of most universities because pricing does not always scale linearly as usage levels increase, bulk storage discounts do not deliver the promised savings, significant data movement charges add up on large data volumes moved to and from the cloud, and services typically lack the latency and bandwidth necessary for rapid access to data. These limitations become more prominent when the amount of data being stored exceeds a few terabytes.
In contrast, Open Storage has evolved over the past decade and now embraces next-generation software-defined storage (SDS) and hyper-converged storage (HCI), becoming a viable option to tame “data explosion” challenges facing university IT professionals and academics. The capabilities and maturity of Open Storage have advanced quickly over the past several years as many contributors to code have devoted resources to advance platforms such as OpenZFS. Features previously available only at a premium with proprietary storage solutions – such as snapshots, replication, or access to APIs – have now become standard items in Open Storage.
There are a range of Open Storage options, including Red Hat Ceph, Minio, Apache Hadoop, and my own organization’s platform. These platforms offer flexible approaches in that they are not limited to a single storage infrastructure and can be used for structured data, unstructured network attached storage (NAS) environments, or object storage at a fraction of the price of proprietary storage without giving up major features – especially important in a university research setting.
An example of one university finding success with Open Storage is Cambridge University, a world-renowned research institution in the United Kingdom. To support their research data requirements, the university required a better means of storing and managing data and the related costs that were negatively impacting research budgets. As a result, administrators opted to implement Open Storage throughout. In one department, the university installed our platform to support a large amount of microscopy data. The use of traditional or cloud-based storage to collect this kind of valuable data would have presented limits in its collection, storage, and retrieval, thereby slowing data analysis and research progress.
According to a recent paper by Cambridge University, “Data acquisition rates in fluorescence microscopy are exploding due to the increasing size and sensitivity of detectors, brightness and variety of available fluorophores, and complexity of the experiments and imaging equipment. An increasing number of laboratories are now performing complex imaging experiments that rapidly generate gigabytes and even terabytes of data, but the practice of data storage continues to lag behind data acquisition capabilities.”
The university collects large amounts of this data while simultaneously developing new experiments and data processing pipelines that are accelerating storage volumes. A large amount of research performed at the university remains in constant flux, making it nearly impossible to implement a reliable, unified data acquisition process that can be leveraged over an extended timeframe. Data storage that is flexible enough to accommodate multiple requirements allows for further development and maturation of research tools, relieving pressure put on the required data processing and storage infrastructure.
As highlighted by the university, network connectivity for the Open Storage infrastructure is made using 10 GbE cards that can accommodate either copper wire or optical fiber connectors. The storage server is connected to two networks: a local network between the data storage system and a light-sheet microscope using a 10 Gbps transfer rate, and a general-use 1 GbE connection to the rest of the University’s network. This dual connection ensures that even if the university network is unavailable due to failure or maintenance, administrators can still use the platform to store large datasets from the light-sheet microscope.
As the popularity of Open Storage accelerates, there are many similar university use cases. Another U.K. university is using our platform to support a backup service for all of its departmental clients. The University of Florida, University of California San Diego, Johns Hopkins, University College of London, CalTech, and countless other forward-thinking institutions are actively leveraging Open Storage to solve data growth, and the robustness of ZFS is key to the long-term and reliable retention of data.
Open Storage offers the flexibility, reliability, and stability to take full control over both the architecture and the destiny of data storage, with the agility to handle an innumerable volume of research workloads simultaneously. More sophisticated storage requirements are easily achieved as Open Storage enables the low or no-cost addition of compelling new features due to open-source community support. This flexibility reduces the risk of wasted funding on proprietary or cloud-based storage options.
This class of open-source storage makes all of the above possible without sacrificing features or compromising resiliency. It also allows organizations to continue increasing the variety and volume of data kept under management at a very low price point. Computing costs have benefited enormously from open source, and storage costs are doing the same.