IEEE Internet Computing 2001

The Evolving Field of Distributed Storage

Peter N. Yianilos (guest editor) and Sumeet Sobti

Abstract: Memory is a fundamental commodity of computation. Storage systems provide memory that costs far less than RAM and has greater persistence, but suffers from much lower data transfer bandwidths and higher access latencies. These attributes -- cost, persistence, bandwidth, and latency -- are the traditional evaluation metrics for storage systems, but the remarkable growth of communications and networking over the past few decades has complicated this simple picture.

Today the network is an integral part of the computer. Most of us routinely access Web pages whose display requires fetching data from dozens of machines around the world. The Web is the first distributed storage system to have such an immediate global impact. It illustrates the technological, economic, and cultural power of a distributed approach. However, the Web's fragility and operational semantics prevent it from addressing the storage problems of mainstream data processing. For example, error messages or suspended display is common when a network or system component fails somewhere, making a page or one of its elements inaccessible. Also, informal caching on the Web makes it hard to be certain that you're viewing current information.

A simple form of distributed file storage is widespread now, as many of us routinely and transparently access files stored somewhere else on a local area network. Between LANs and the World Wide Web lies the domain of distributed enterprise-wide storage, an area that industry is now actively developing.

In this increasingly complex and demanding world of distributed storage, we are forced to consider new metrics and issues beyond the traditional set. These include shared coherent access, availability, survivability, security, interoperability, search, caching, load balancing, and scale -- the need for storage systems of truly immense proportions. Indeed, our increased appetite for storage has also engendered another design issue: the need to largely automate the now human-intensive task of managing large storage systems. Finally, an undercurrent in the flow of ideas concerns the cultural issues of privacy and anonymity in the context of distributed storage.

The builders of distributed storage systems face many architectural decisions as they work toward their targets among these metrics and issues. The most basic of these is the question, "Who's doing the work of providing storage services?"

Hierarchical approaches, including the new trend toward storage virtualization, use layers of control and abstraction to stitch together distributed and disparate storage providers into a single virtual whole. In the peer-to-peer approach, the clients themselves provide storage for everyone. There is no need for any server in the traditional sense. In the ideal case, such systems are fully symmetrical with no fixed central leader. The server-to-server approach is related to peer-to-peer, but here many servers work together in a symmetrical way to provide storage services; clients need not install any new software to consume basic storage services. This broad categorization includes much work in the established field of distributed file systems.