Architecture of object-based storage and S3 standard specifications

Object storage has grown in popularity among data storage architectures. Compared with file system and block storage, object storage has no limitations when handling petabytes of data. By design, the boundaryless nature of object storage makes it suitable for Big Data and Cloud contexts.

In addition, item storage is simple and efficient. It gives simple data replication, scalability and is suitable for “Write once read many” contexts such as e.g data analysis. These characteristics combined with its ease of implementation and programmability all account for its widespread use.

What exactly is an object? How does object storage work and what makes it scalable? We aim to clarify this.

Object storage is not exclusive to cloud services such as AWS Simple Storage Service (S3), and several local object storage solutions are available as options. Because AWS S3 sets a standard for object storage APIs, storage solutions and applications that consume from them are federated under “S3 compatibility”. All S3-compatible apps work with a large number of S3-compatible object storage solutions and vice versa, enhancing the growth of both.

This article is the first of a series of three:

Object storage: how it works, why it scales

As the name suggests, object storage contains data in the form of objects. The core paradigm of object storage is to optimize common data and metadata operations while connecting the two. What is an object made of?

It is the combination of a key (granting access), a value (actual data), and associated metadata: both the object's and extra metadata added by object storage for large-scale management. This metadata is stored in the same location as the data, unlike in file systems. The key used to access the object is the object's name, path, and unique object identifier (OID) that the object store generates.

Metadata plays a key role in object storage, making it possible to abstract the hierarchy found in file systems. With object-based storage, everything is stored in a flat archive with no hierarchy. Indexing and further management is achieved through the use of metadata properties alone.

Custom metadata enrichment in objects is supported, enabling more flexible data analysis. They also help control data replication.

Object storage devices (OSDs) are the physical devices that support actual storage and are either dedicated disks or dedicated partitions within disks. OSDs can be of different types and belong to one or more storage pools. These pools are logical divisions of data, own objects and are replicated between multiple OSDs as shown below.




Illustration of replication of storage pool objects across multiple OSDs

Thanks to this data replication across multiple locations, object storage achieves:

  • High accessibility ensure low query latency and no bottlenecks on a single busy device;
  • Elasticity and failovers against device failures;
  • Scalabilitywhere an infinite amount of OSD can be added.

It is easy with object data storage to start small and grow big: available storage and the number of drives can be expanded without compromising existing data. It's as simple as adding a new node with raw disks to the cluster, which are automatically integrated into storage pools. Removing a storage device is also handled, copying the data it previously held on other devices. And the combination of object name, path, and ID helps eliminate name collisions.

This ability to scale the storage is infinite. In terms of performance, there are no differences between handling terabytes or petabytes of data. This is thanks to the object storage's flat structure and extra object metadata usage for indexing and efficient management of the store.

Overall, object storage is suitable for large volumes of unstructured data, and never exposes its underlying storage infrastructure to its consumers. It is a suitable architecture for distributed, scalable storage. Now let's dive further into the interface that provides access to this data.

Object storage data access: S3 API Standard

There are various implementations of object storage, with a common modern interface: the S3 API.

In object storage, it is common to transport data using a HTTP REST API. Several proprietary implementations of these APIs previously existed for object storage, and few developers programmed with them. During 2006, AWS Simple Storage Solution (S3) specify generally accepted fundamentals for this API.

In other words: S3 will be used here to denote open standardnot the AWS service.

The S3 REST API is easy to learn and use. It allows users to write, list, retrieve and delete objects from a single endpoint using PUT, GET, etc… In object storage, data is logically divided into buckets: protected partitions of data that can only be accessed by their associated S3 users. The bucket name is usually a prefix of an S3 request URI.

S3 users can own one or more buckets, and their S3 data give them this access. S3 credentials are a pair of access key and secret key. These two keys are confidential and provide write, read, and delete access to everything the user owns in the object store, so they should be distributed with care.

As a whole, the S3 API provides a number of benefits:

  • security because all actions require S3 data;
  • Privacy and isolation of data with multiple users, each user receiving an isolated portion of the storage;
  • Atomicitywrites and updates performed in a single transaction.

Having both storage providers and user applications converge on this standard is a hugely beneficial factor for the growth of object storage, for both providers and users. S3-compatible apps have a large market of different possible storage solutions, and object storage providers themselves are compatible with many different S3 apps.

Using object storage through S3 clients

Object access is done programmatically through S3 clients. These clients are used by S3-enabled apps to interact with the storage. There are two types of customers:

  • Command line clients, e.g AWS CLI or s5cmd. s5cmd is open source, one of the fastest clients, and the recommended way to interact with S3 object storage solutions via the CLI. It is written in Go and can either be used from a prebuilt binary, built from source, or used in a Docker container;
  • AWS SDKs, which are development tools that allow applications to query S3-compliant object storage. SDKs exist for a number of different programming languages, including Java, C++, Python, JavaScript, and more.

S3 URI schemes

Accessing an object using the API requires the object name, bucket name, and region name if you're using AWS S3. These are then combined into a REST URI, which acts as a unique identifier for an object. This URI uses s3:// family of systems:

  • s3://: Deprecated, used to create a block-based overlay on top of S3 storage and will not be used in this context;
  • s3n://: S3 Native protocol, supports individual objects up to a size of 5 GB;
  • s3a://: Successor to s3n, built with AWS SDK, more performant, less limited and recommended item storage options.

Additionally, we need to specify S3 endpoint. By default, when S3 clients query these schemas, they query Amazon AWS S3 Object Storage. This must be changed when using other object storage solutions, which use their own endpoint. In configuration settings for S3-compatible applications or as an option for S3 CLI tools, it is possible to change the endpoint used for S3.

Use of Customer Login Details

Credentials must be sent to the client in order to connect to a given bucket within object storage. Most S3 clients can retrieve the credentials in different ways, but the three most common ways are:

  • As environment variables: AWS_ACCESS_KEY_ID for the access key and AWS_SECRET_ACCESS_KEY for the secret key;
  • As a credentials file, under ~/.aws/credentials;
  • In it config file, under ~/.aws/config.

When any of these 3 are provided, the S3 client can retrieve them to connect to the object storage instance. Note that the authentication settings have a priority orderwith environment variables having the highest priority.

Since these credentials provide access to all operations in an S3 bucket, it is important to secure them. Sending them as options in the shell is not recommended, as they will be logged in plain text. Therefore, it is preferable to handle them as either files or environment variables. IN Kubernetes environments, Kubernetes secrets help manage these references and pass them to containers as environment variables are made safe using env and envFrom together with secretRef.

Conclusion

Object-based storage is popular for its ease of use, as each file operation is handled with HTTP queries such as PUT, GET

Some object storage solutions are in place, despite the model being associated with the cloud. The two most popular are open source and easy to distribute. They are full-fledged alternatives to cloud object storage providers and will not affect how object storage consumers behave.

The next two articles in the series explain how to host object storage on a local cluster, by Rook and Ceph and through MinIO.

#Architecture #objectbased #storage #standard #specifications

Source link

Leave a Reply