Distributed Storage: What’s Inside Amazon S3?

Data Backup & Archive

The exponential growth of data volumes in all industries demands new storage technology. Distributed storage can spread files, block storage or object storage across multiple physical servers, for high availability, data backup and disaster recovery purposes. Learn about the distributed storage technology that powers massively scalable storage services like Amazon S3, and huge data pools in on-premise data centers.

In this article you will learn:

This article is part of a series on Data Backup.

What is Distributed Storage?

A distributed storage system is infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes.

Distributed storage is the basis for massively scalable cloud storage systems like Amazon S3 and Microsoft Azure Blob Storage, as well as on-premise distributed storage systems like Cloudian Hyperstore.

Distributed storage systems can store several types of data:

  • Files—a distributed file system allows devices to mount a virtual drive, with the actual files distributed across several machines.
  • Block storage—a block storage system stores data in volumes known as blocks. This is an alternative to a file-based structure that provides higher performance. A common distributed block storage system is a Storage Area Network (SAN).
  • Objects—a distributed object storage system wraps data into objects, identified by a unique ID or hash.

Distributed storage systems have several advantages:

  • Scalability—the primary motivation for distributing storage is to scale horizontally, adding more storage space by adding more storage nodes to the cluster.
  • Redundancy—distributed storage systems can store more than one copy of the same data, for high availability, backup, and disaster recovery purposes.
  • Cost—distributed storage makes it possible to use cheaper, commodity hardware to store large volumes of data at low cost.
  • Performance—distributed storage can offer better performance than a single server in some scenarios, for example, it can store data closer to its consumers, or enable massively parallel access to large files.

Features and Limitations

Most distributed storage systems have some or all of the following features:

  • Partitioning—the ability to distribute data between cluster nodes and enable clients to seamlessly retrieve the data from multiple nodes.
  • Replication—the ability to replicate the same data item across multiple cluster nodes and maintain consistency of the data as clients update it.
  • Fault tolerance—the ability to retain availability to data even when one or more nodes in the distributed storage cluster goes down.
  • Elastic scalability—enabling data users to receive more storage space if needed, and enabling storage system operators to scale the storage system up and down by adding or removing storage units to the cluster.

An inherent limitation of distributed storage systems is defined by the CAP theorem. The theorem states that a distributed system cannot maintain Consistency, Availability and Partition Tolerance (the ability to recover from a failure of a partition containing part of the data). It has to give up at least one of these three properties. Many distributed storage systems give up consistency while guaranteeing availability and partition tolerance.

Amazon S3: An Example

Amazon S3 is a distributed object storage system. In S3, objects consist of data and metadata. The metadata is a set of name-value pairs that provides information about the object, such as date last modified. S3 supports standard metadata fields and custom metadata defined by the user.

Objects are organized into buckets. Amazon S3 users need to create buckets and specify which bucket to store objects to, or retrieve objects from. Buckets are logical structures that allow users to organize their data. Behind the scenes, the actual data may be distributed across a large number of storage nodes across multiple Amazon Availability Zones (AZ) within the same region. An Amazon S3 bucket is always tied to a specific geographical region—for example, US-East 1 (North Virginia)—and objects cannot leave the region.

Each object in S3 is identified by a bucket, a key, and a version ID. The key is a unique identifier of each object within its bucket. S3 tracks multiple versions of each object, indicated by the version ID.

Due to the CAP theorem, Amazon S3 provides high availability and partition tolerance, but cannot guarantee consistency. Instead, it offers an eventual consistency model:

  • When you PUT or DELETE data in S3, data is safely stored, but it may take time for the change to replicate across Amazon S3.
  • When a change occurs, clients immediately reading the data will still see an old version of the data, until the change is propagated.
  • S3 guarantees atomicity—when a client reads the object, they might view the old version of the object, or the new version, but never a corrupted or partial version.

Learn more about storage archive and storage tiering in our guides.

Hybrid Cloud Distributed Storage with Cloudian

Cloudian HyperStore is an on-prem, enterprise storage solution that uses a fully distributed architecture to eliminate single points of failure, and enable easy scalability from hundreds of Terabytes to Exabytes. It is fully compatible with the S3 API.

The HyperStore software implementation builds on three or more distributed nodes, allowing you to replicate your objects for high availability. It lets you add as many storage devices as needed, and the additional devices automatically join an elastic storage pool.

Sign up for a free trial.

Get Started With Cloudian Today