OVERVIEW
Panasas is a leading provider of
storage for large scale, high performance systems. Our customers
depend on high performance computing systems to solve demanding
problems in energy exploration, financial analysis, climate
modeling, computational fluid dynamics, manufacturing design,
digital animation, computational physics, higher education, and
many similar applications. A critical component of their high
performance computing systems is the Panasas storage system that
lets them manipulate large datasets by thousands of compute nodes
that are organized into one or more clusters that communicate via
high speed networks. Without the right storage system, their
investment in computational power and network infrastructure will
be
underutilized, either because of performance limitations or down
time due to reliability issues. Our customers choose Panasas
because they know they can solve their large problems while relying
on our equipment.
Many of our customers are preparing for a future that involves very
large scale computations that require high performance access to
petabytes of storage. This paper explains the elements of the
Panasas system that are designed to handle very large scales. Our
recent paper in the 2008 FAST conference provides a technical
overview of the Panasas system. Our workshop paper at SC07 provides
a more background on our internal distributed system platform.
Earlier work presented at SC04 describes our approach to high
performance file-based RAID.
Scalability In the Panasas Architecture
The
elements of the system that provide scalability include:
• A distributed system platform that manages the rest of the
system.
• Distributed block management using the object storage
protocol.
• Distributed metadata management with a global namespace.
• Per-file RAID protection.
• Declustered RAID for scalable reconstruction performance.
• Fully redundant hardware and software with automatic fault
handling.
The Panasas system is based on a distributed system platform that
provides a scalable framework for managing a large collection of
software components and hardware components. Part of this is a
common platform layer that includes the base operating system, a
local process monitor, a local
hardware agent, and a message passing agent that communicates with
a global cluster manager. The cluster manager is a replicated
service that uses a quorum-based voting protocol to make decisions
and maintain a replicated copy of the overall system state. The
cluster manager keeps track
of services and hardware components, starting and stopping services
as necessary, monitoring their state and the state of the hardware,
and reacting to faults and changes in the environment.
The Panasas file system is an application hosted by the distributed
system platform. The separation of the file system from the overall
cluster management means that the file system protocols can be
optimized for performance while the management system is optimized
for robustness. The
system architecture allows for clean integration of other services
such as backup agents, replication agents, and more.
Block management is a fundamental aspect of any storage system. The
Panasas system delegates block management to StorageBlades that
export an Object Storage Device (OSD) interface. Higher levels of
the file system manage objects that are containers for data and
attributes, and the Storage-Blades implement the object abstraction
that involves traditional block management. Each StorageBlade is a
balanced component that has disks, a network interface, a
processor, and memory.
As storage capacity scales up, the necessary computing resources to
manage the storage and provide high bandwidth access are
automatically scaled up at the same time. Files are striped across
objects on different StorageBlades, so that even a single I/O
stream benefits from distributed block
allocation.
Metadata management has two aspects of distribution. The first is
that multiple metadata management services control different parts
of the file system namespace. These run on different DirectorBlades
so the system can be scaled up to harness the power of many
DirectorBlades.
File system clients are responsible for
4
WHITE PAPER:
PANASAS® STORAGE FOR PETASCALE SYSTEMS
generating redundant data and they transmit data and parity in
parallel to the StorageBlades. This
provides a natural scaling in RAID performance as the number of
file system clients increases. A
unique property of the Panasas system is that clients can verify
the RAID equation during reads to
provide true end-to-end data integrity checking. In addition, write
performance remains very close
to read performance as the system scales up, in contrast to
traditional RAID controllers that pay a
substantial write performance penalty in redundant configurations.
Panasas data is fully protected
in high available configurations without compromising write
performance.
The system handles very large numbers of small files as well as it
handles lesser numbers of very
large files. Small files start out mirrored on two StorageBlades so
they are cheap to create, have
low space overhead, are efficient to write with small I/Os, and
quick to rebuild after failures.
These are automatically converted to widely striped files as they
grow in size to optimize band-
width and reduce parity overhead. The memory on each StorageBlade
is used to cache hundreds of
thousands of object descriptors, as well as data, in order to
optimize access to a hot working set of
files. That working set could be a small number of large files that
are shared by a single computa-
tion and spread out over many StorageBlades, or very large numbers
of relatively small files used
by many concurrently running independent applications. The system
scales its resources naturally
to handle either kind of workload.
The per-file RAID approach is exploited to provide scalable RAID
rebuild. Parity groups are
declustered (i.e., spread out) among the StorageBlades, and
DirectorBlades distribute the rebuild
work on a fine-grained, per-file basis. Thus the system naturally
harnesses the power of many
disks, many network interfaces, and many computer systems to tackle
the critically important
problem of RAID rebuild. The result is a parallel RAID rebuild
system that scales RAID rebuild
performance in larger storage systems.
ROadmaP tO PEtaScalE
Today our largest single system is a 2 petabyte system at Los
Alamos National Labs for the Road-
Runner super computer. This system is created from 1000
StorageBlades that each have two 1 TB
drives, processor, memory, and a 1GE interface. There are also 100
DirectorBlades that provide
metadata management and RAID rebuild. The blades are housed in a 4u
chassis that holds 11
blades. Each chassis has redundant 10GE connections to the LANL
scalable network infrastruc-
ture. Each blade has two NICs routed through two different switch
modules. The chassis has dual
redundant power supplies and a battery that runs the system for
several minutes, which is long
enough to gracefully flush data to disk in the event of AC power
loss.
The next largest system at LANL is about half that size and has
been in production for over two
years. It is a shared storage cluster accessed by 3 different
compute clusters (TLCC, Lightning, and
Viewmaster). Commercial installations of our product typically
range in size from 100 Storage-
Blades to 200 StorageBlades, and we have one commercial customer
that has 500 StorageBlades in
one system. The commercial systems are all used in demanding 24x7
environments where they are
shared by hundreds or thousands of compute servers that run a wide
variety of applications. Our
smallest configuration is 10 StorageBlades and 1 DirectorBlade in a
single chassis, and it is easy
enough to manage that they are deployed in boats that take seismic
data for oil exploration.
Our blade chassis has a potential throughput of up to 2 GB/sec
assuming both network switches
and all blade NICS are fully utilized. Our current blades can
generate over 600 MB/sec from disk
out to file system clients. We plan to boost blade performance by
the end of 2009. By bonding
the two network switches and doing further blade improvements we
plan to reach the 2 GB/sec
mark by 2011.
We are introducing a multicore hardware platform in 2010 that
couples a high performance server
with more drives. This will be a larger building block that will
allow us to scale the storage system
to many petabytes without having to scale the number of computer
systems we use to manage the
storage. This platform gives us flexibility to provide very large
pools of storage (many petabytes)
with either the same high level of performance as our blades, or to
throttle back on the available
performance in order to reduce the cost of the system. Both
hardware platforms will use the same
file system architecture and can co-exist within the same file
system. Our existing data migration
facilities will allow online migration of data between different
storage pool classes.
Our largest systems today harness over 1000 computer systems to
provide very high performance
access to 2 petabytes of data. We will be able to use the same
distributed system architecture to
harness 1000 computer systems that are much more powerful than our
current blades, and that
manage one or two orders of magnitude more disks than our current
systems.
cOncluSIOn
The road to reliable, high performance, petascale systems starts
with a system foundation designed
to support large numbers of hardware components and software
services. The distributed system
platform within the Panasas system is that foundation. Success on
that road comes from experi-
ence gained through larger and larger deployments. The Panasas
system has been in production
for several years in a variety of demanding commercial and
scientific environments. We have been
able to refine our approach and improve our internal software
architecture based on that experi-
ence. While it is easy to focus on performance, the real key to
customer loyalty is an emphasis
on stability and reliability at scale. Performance will follow
naturally from advances in hardware
technology. Panasas has proven its ability to organize large
numbers of hardware and software
components to reliably support petabytes of storage in a single,
high performance system. We are
ready to apply our architecture and experience to support much
larger deployments as our custom-
ers tackle ever larger problems.