Multi Codec Jukebox: Dynamo, Swift Objectstore, Cassandra

This is a 3 part series where I discuss the Swift Objectstore and Cassandra in the context of the original idea that inspired them both: Amazon's Dynamo.

This post is part 2 of the series that looks at the Openstack Swift Objectstore. If you haven't read part 1 of the series - which describes the Amazon Dynamo data store - then you may want to first read that.

OVERVIEW

The Openstack Swift Objectstore is a distributed file storage system. Swift open-source software creates a content-addressable storage cluster on top of multiple (usually off-the-shelf x86) boxes full of storage disks. The content-addressable storage provides a uniform namespace for all objects that are stored ib Swift cluster. Each data object is represented via a URL of the form

http://swift-cluster-lb.address/account/container/object

By clustering multiple off-the-shelve boxes and replicating data across them Swift can achieve almost limitless scalability and data durability by spreading data replicas across independent failure domains (different disks/servers/racks or even data centres). The account, container and object abstractions of Openstack Swift are analogous to volume, directory and file in conventional file systems. Clients can issue Create, Read, Update and Delete (CRUD) requests on a data object by passing different verbs in HTTP requests made against the URL.

The Swift objectstore uses the Dynamo consistent hashing idea to set up partitions on the multiple boxes. Usually many hundreds or a few thousand partitions, chosen from random ring hash ranges, are mapped to each storage node. Each partition is replicated multiple (N=3 by default) times. Swift guarantees that partition replicas are kept in as distinct failure domains as the hardware allows (e.g. different disks if the whole swift cluster is on one server, or different servers if the whole swift cluster is a rack of servers, or across racks for multi-rack clusters). This strategy ensures high probability of data availability when hardware fails because hardware faults are usually localized. For example, if a top-of-rack switch fails, then a multi-rack Openstack Swift cluster would still be able to serve any stored object because the replicas exist on different racks.

Openstack Swift consists of multiple types of processes usually running (multiple copies) on different physical servers. Openstack Swift clients usually send CRUD requests to a HTTP(S) load balancer, which in turn distributes these requests among a pool of proxy processes. All proxy processes have complete information about how partitions are physically mapped to the boxes and OS partitions on disks within these boxes. So they can direct each incoming request (based on its URL and HTTP verb) to the appropriate storage processes for processing. In this model all data is passed through the proxy and load balancer(s) and no storage node is directly accessible to clients. Openstack Swift uses N=3, W=2 and R=1 by default (see the earlier post on Dynamo for the meaning of these variables). Therefore writes are acknowledged after the incoming object has been successfully written on two separate partitions. Reads are returned via the storage server that is fastest to respond to the proxy server with the requested data.

Like Dynamo, the Swift Objectstore provides eventual consistency. Swift adopts a proactive algorithm to check consistency between replicas of partitions. Storage nodes periodically compare their partitions (which are directories on the files system) and rsync any differences between the partition replicas. The period between these comparisons user-configurable and set to 30 seconds by default. One of the notable characteristics of Openstack Swift is that the replication is at the file level, not the block level (such as traditional RAID systems). So the amount of time to "rebuild" a broken drive is proportional to the size of the data stored on the drive and not the total capacity of the drive.

Swift also keeps account and container information separately in sqllite databases for any metadata operations (such as listing objects in a container). These sqllite databases are also checked for consistency via periodic comparisons with their replicas and synchronized based on timestamps if there differences between replicas.

MECHANICS

Its worthwhile understanding how the CRUD operations and background replication and scrub translate to low level disk operations on the storage nodes. Understanding this aspect of Swift opens the door to understand performance in terms of ideal and less than ideal workloads, strengths and limitations of Swift, and the impact of hardware choices on System performance. Fortunately Openstack Swift code is very well documented, well factored, and professionally maintained, making it an excellent source of understanding its inner workings.

Create

A create request requires Swift to store an object (a binary blob of data) along with some metadata associated with the object. The proxy node determines the (N) partitions where the object is to be stored and forwards the request to those storage nodes. Each storage node first does a lookup within the target partition to confirm that an object with identical account/container/object hash does not already exist. If not, then a directory is created within the partition to store the object. The object is stored as a binary file with a .data extension. Metadata associated with the object is also stored within the inode or within .meta files in the same directory. For more details, refer to the Object server and diskfile source code files.

Read

A read request is forwarded by the proxy server to all N storage servers containing the partition in which the object is stored. Each of these storage nodes checks within the appropriate partition if the directory containing the object exists. If it does, then the object's directory is checked for .ts files (a .ts or tombstone file would indicate that the object is deleted and the a 404 not found response should be returned to the client). The directory is also tested for .meta files in case additional metadata files associated with the object are available. Finally the .data file containing the object and the corresponding metadata read from the XFS inode metadata and any .meta files is composed into the HTTP response sent back to the proxy node and the client. Recall that Openstack Swift returns the first successfully read object to the client from among the storage servers. For more details, refer to the Object server and diskfile source code files.

Delete

Deletes are asynchronous in Openstack Swift. A .ts (tombstone) file is created in the object's folder to indicate that the object has been deleted. The container sqllite database is also updated. A subsequent asynchronous background process (called the auditor) deletes the object at a later time. For more details, refer to the Object server and diskfile source code files.

Object Replication

Replicating objects is necessary to guarantee the eventual consistency guarantee of Openstack Swift. Swift's object replicators on each storage server compare their partitions with the other (N-1) replica partitions via the background replicator process. For each partition, the replicator process sends requests to the other storage nodes storing that partition to send Merkle tree hashes of their objects stored in their partition directory. This data structure allows for quick identification of differing objects in the partition replicas. Subsequently rsync is used to synchronize replicas. For more details refer to the object replicator source code.

Scrub

Objects are periodically scrubbed to check for bit rot. Swift implements a periodic disk scrub in the background by computing checksums of stored file objects and comparing them with stored checksums. This process identifies any data corruption (due to disk bit rot for example) and suitably addresses the errors by creating more copies from the other replicas of the data object. The metadata stored for each object contains the (MD5) hash of the object's data. The auditor process(es) running on each storage node cycle through all partition directories containing objects, compute the MD5 hashes of each object and compare them to the stored checksums. Mismatches indicate corrupted object data, this is quarantined and a subsequent replication run restores the object's data from the other replicas. For more details refer to the object auditor source code.

Other Goodies

There are several other convenience features built into Openstack Swift, most of which are beyond the basic Dynamo design. For example there is the ability to specify object versions in create requests (which essentially results in different objects being created for different versions). Time bound objects, which are automatically deleted after a certain interval are also provided (as described in expirer.py). Extremely large objects (over 5 GB) are internally divided and stored as smaller objects. The Openstack Swift implementation is based on modular WSGI pipelines, which allows pipeline-style addition and removal of custom components while processing Swift requests and responses. For example, the Amazon S3 object interface can be enabled by installing an additional component into the processing pipeline in proxy nodes.

Swift also provides automatic disaster recovery features by giving operators the ability to asynchronously replicate objects across remote data centres. Read and write affinity features (as described in the proxy server source code file server.py) ensure that data is accessed from/written to the nearest data centers from clients if possible.

Corner Cases

There are a few corner cases where Openstack Swift may not yield great performance. These are interesting to discuss here.

Very Small files

One of the side-effects of using the hash-based rings to store data in Openstack Swift clusters on any of the 100s or 1000s of partitions on a storage node is that consecutive write operations on storage nodes are neither spatially nor temporally correlated. This means that one write operation will most likely be in a different partition (directory) than the previous write. This poses a challenge when Swift is used to store many small files because caching the XFS inodes and dentrys becomes ineffective. To appreciate the issue here, consider this example of the directory layout on a storage node

Example: An Openstack swift object storage directory with an object directory containing the object's .data file

/srv/1/node/sdb1/objects/717/89f/b359508efc4d39b0d22efde69fc7f89f/1382112651.23154.data

Breaking down the directory paths below:

/srv/1/node/sdb1/objects: This is the object directory where all objects stored on sdb1 device are stored.
/717: This is the partition
/89f: This is the hash_prefix of the object
/b359508efc4d39b0d22efde69fc7f89f: This is the directory with name = name_hash of the object
1082112651.23154.data:This is the actual data file containing the object

Each time an object is written to this storage node the inode and dentry cache needs to access a random entry down from the partition level. The only practical method to ensure fast inode metadata operations is to make sure that the memory can fit the whole inode and dentry cache. Also consider that storage nodes usually contain multiple large capacity disks, each containing a XFS filesystem and the associated inodes and dentrys. All these caches should ideally fit in memory for fast metadata operations! Given a storage node with 10s of TB storage capacity can store 100s of millions of small objects (say of 10-100kB size) , the memory requirement of each storage node becomes quite large if the inode and dentry caches need to be fully stored in memory. There is a good discussion of this issue here.

It is to be noted that this issue of small files is not unique to Openstack Swift. For example, Ceph, another distributed file system that can be used as an objectstore, stores each individual object as a file on the underlying filesystem (which is usually XFS on production Ceph systems). Many small files stored as objects in Ceph may cause similar issues.

Very Large files

Reading or writing speeds for extremely large files (e.g. several GB) are limited by single spindle speeds in Openstack Swift because objects are not striped across disk spindles. These "elephants" may also slow down read and write operations for other objects being stored on the same partitions, (since all the spindles across all the storage nodes that store the elephants will be busy at the same time serving the elephant request). However, the randomization in partition-to-object mapping makes such situations rare, especially if adequate number of spindles and partitions are provisioned in the Openstack Swift deployment.

OUTLOOK

Hopefully this article has supplemented your knowledge about Openstack Swift and encouraged you to look at the (very accessible) Swift source code to find exact answers of any questions you may have about it. In addition, Openstack Swift is remarkably well documented. Together, the source code and documentation unambiguously answer almost any question about how Openstack Swift works.

In the next (and last) part of this series we'll look at Cassandra, another very popular data store based on the ideas of Dynamo.

Multi Codec Jukebox

Tuesday, November 19, 2013

Dynamo, Swift Objectstore, Cassandra - Part 2: Openstack Swift Objectstore