Site icon Cloudian

Cancer Data Science Initiatives

Paul Thuman, Director of DOD and IC Sales at Cloudian
View LinkedIn Profile


Cancer Data Science Initiatives

at the National Cancer Institute, part of the National Institutes of Health (NIH)

The Problem: Searching for Research Files on Tape Is Like Looking for a Needle in a Haystack

National Cancer Institute (NCI) scientists have long been frustrated trying to find complementary data from other scientists that could potentially help with current research. The lack of searchable infrastructure and the proliferation of storage locations made conducting any kind of search challenging. Data would originally be stored on disk-based NAS storage systems, and later migrated to tape, where files were especially difficult to find and retrieve. Once researchers found a file, if it wasn’t applicable to their current research, the cumbersome search process had to be repeated, costing valuable time. 

The Solution: Inexpensive Object Storage with Rich Metadata Tags for Quick and Accurate Search

To increase their scientists’ productivity, NCI needed an approach that would decrease the time their researchers spent searching for relevant files. NCI decided to leverage object storage, a data storage technology that offers two compelling features: 

  • Rich metadata that makes data easily searchable
  • Scalability that allows data anywhere to be located with a single search query 

Only object storage provides these capabilities. It has limitless scale. A single namespace can span multiple sites and storage systems. This consolidates the multiple storage silos that frustrate data searches. Object storage also includes embedded rich metadata, or “data about the data,” which enables rapid search based on descriptive text.

For their project, NCI implemented a 25-petabyte object storage system. In addition to being seamlessly scalable, the new system automatically replicates data to a second active data center and provides over six 9’s of data reliability, all at less than half the cost of public cloud storage.  When growth requires added capacity, the scale-out system can be expanded indefinitely with no system downtime. 

Because the new system is based on the de facto cloud standard AWS S3 API, it is compatible with a wide range of data management software, including iRODS, an open-source software solution selected by NCI. 

Storage Virtualization Drove High User Satisfaction Amongst Data Scientists at the National Cancer Institute

Storage virtualization was a key benefit driving NCI’s move to object storage. It was also a benefit that significantly improved user satisfaction because it simplified information access. To understand why, consider an example of storage virtualization we use every day: the internet. A web search is both simple and powerful because the physical location of the information we seek is irrelevant. The search engine finds information wherever it is, and we can access it without knowing (or caring about) the location. If the information changes location, our search result still works. The location is abstracted from the address (the URL). 

Object storage does the same thing.  It employs a storage pool, consisting of multiple storage devices combined into what appears to be a single storage device. Data is accessed via a URL, just as on the internet. We can find and access the data regardless of its location. If the data location changes, the original URL still works. This is a vast simplification compared to legacy storage technologies where data is found on specific devices that must be addressed individually. 

To learn more about storage virtualization, watch this 2-minute clip. Sunita Menon, who manages the Cancer Data Science Initiatives at the Frederick National Laboratory for Cancer Research in Frederick, MD, explains the simplicity of the database configuration when switching storage platforms from Cleversafe to Cloudian. 

Key Solution Factor: Reduced Time to Insight

Cancer research is a field where time to insight is one of the most crucial variables in saving lives. Object storage helps save time with rapid data search via embedded metadata. NCI developed an application that helps scientists augment system-generated metadata with descriptive “tags.” These tags are then stored in an Oracle database along with URL pointers to the actual file on the object storage system. With these tags, scientists are now able to find relevant complementary files very quickly, resulting in better, more thorough cancer research conclusions in less time. 

Lower Total Cost

NCI was able to decrease the total cost of ownership (TCO) compared to their tape system when considering the combined capital and operational expenses of tape. Going forward, NCI will further leverage its investment by incorporating other big data AI applications that leverage the same S3 API. This will allow their scientists to analyze large datasets for even more comprehensive results, making their chosen Cloudian object storage system future-proof. 

The transition to object storage was a big success for NCI. The end result is faster time to insight, greater user satisfaction, and lower operating costs. Furthermore, the solution is future-proof. It can grow indefinitely, across multiple locations, while retaining cloud-like simplicity of operation. And because object storage employs a cloud-native API (the S3 API), it is compatible with the full range of software written to employ cloud storage. Overall, the move to Cloudian has been a win for the storage team, a win for the researchers, and a win for the public who will benefit from the research results. 

Watch the full iRODS Presentation by Frederick National Lab’s Sunita Menon here

Click to rate this post!
[Total: 12 Average: 4]
Exit mobile version