Introduction
Today, with the explosion of
digital content, the Internet is seen as the ultimate content
repository. Consumers are regularly using the Internet as a content
repository for uploading and sharing picture files with friends and
families. From digital cameras to cameras on mobile phones, the
number of
devices that are available for consumers to capture images is
growing tremendously. This has led to a significant increase in the
amount of digital data that is created than ever before. For
instance, numerous studies have indicated and projected the growth
of digital content.
In the year 2008 alone several hundred billion images were captured
by digital cameras which accounted for a whopping Exabyte (or 1
billion Gigabytes) of data. Not only have the number of devices
increased but the growing image resolution on digital cameras over
the past two years alone
has doubled or even tripled the size of the images captured and
given rise to an increased digital footprint today.
Consumer cameras are not the only things generating large amounts
of digital content. Enterprises are dealing with back office
applications generating data such as text files, emails,
spreadsheets, word documents, and PDF files. Medical imaging
systems with high resolution images ranging from several
hundred megabytes to gigabytes in size have all been contributing
to the growth of digital data.
This type of content is what is known as static media; data that
does not change over time. The volume of static media is exploding
and pressing a need for the newer technologies to handle the data
growth in an efficient manner. This whitepaper outlines the
challenges seen with the growth of
static media and provides alternative thinking about hosting and
serving static images using the HP Scalable NAS solution.
Static Media—Challenges
Static Media
presents a unique challenge to storage providers because of the
sheer number of objects that are stored, the varying size of files,
and the inconsistent data access pattern from the underlying
storage.
The concept of using the Web as a transport medium for accessing
data anytime has caused a tremendous amount of expectation for
online service providers.
The increasing number of social networking sites and the blurring
lines between pure video and social networking sites offering
similar services have increasingly contributed to the user
generated content which are not just confined to text blogs and
sharing images. As the Web medium becomes more
and more powerful, there is an increased expectation for easier
data access and collaborative data sharing. Increasingly there are
numerous applications called “Mashups” (for more information
see Appendix) which combine data from more than one source to
provide a unique set of solutions that
are becoming popular and are extending the services through various
Web services with HTTP as the key content access protocol.
There are several applications that produce and consume static
data; applications that provide access to static data over the Web
and applications that are traditional to enterprises.
It is important to note that not all applications that provide
access to data over the Web are Web 2.0 applications.
For example, an online photo sharing site who allows consumers to
upload, edit/modify their pictures and share them with friends and
family is a classic Web 2.0 application. On the other hand, a
journalist working on articles for a news portal might access
content publishing data on servers over the Web and perform edits
before the content is moved to the online news Web servers. Even
though this operation might use Web-based data access protocols,
such as HTTP for content access, it is very different from a
typical Web 2.0 application.
Storage challenges
Purchasing storage
hardware which is less powerful but capable of catering to
today’s needs puts stress on the business whenever there is
seasonal or unpredictable access to data which is very common in
the Internet-based portals where performance requirements could far
exceed the capabilities of the deployed system.
In order to provide an enhanced quality of service and to reduce
the churn rate, it is imperative to have an infrastructure that is
resilient, powerful, reliable, and scalable to grow with the
business.
In the next few sections we will examine the traditional
architecture for Static Media that exists today and its
shortcomings. We will discuss how the next generation scalable
storage architectures overcome the issues and provide an elegant
solution.
Business example
When the business model is
primarily based on user generated content, there is no control over
how much data flows into the environment. When the data access
pattern and the content flow are highly volatile, it makes it
extremely difficult to design a system to cater to the varying load
conditions.
For example, an online photo sharing site with many million picture
uploads and views per day, the load on the system is very
unpredictable and volatile. In this case, at any given time, the
number of user image uploads could go from a few thousand images to
many millions and this type of activity appears to be very seasonal
in nature. Here, consumers upload photographs at varying picture
resolutions which are ultimately stored at the service provider’s
location. So as far as the consumer is concerned, the image is
stored somewhere in the cloud and is accessible when requested. But
from a service provider’s perspective, this is a bit more
challenging. For every image that is uploaded, there are several
other images that are synthetically created and stored in the
storage infrastructure. In the case of an online photo sharing
company, thumbnail images and low resolution pictures are often
created and stored along with the high resolution images uploaded
by the user. For every million images uploaded, the system stores
two million more images in the form of thumbnail and viewing
quality images.
The same is true for a service provider offering a music download
service where a thumbnail or a low resolution music album image is
typically stored on the storage along with the audio music
file.
Long Tail problem
One of the classical
challenges of Web 2.0-based models is dealing with the “Long Tail
Content”. What that means is, the images may or may not be
frequently accessed but need to be stored on reliable storage and
served fast when requested. None of the data can be on an offline
device but instead needs to be on an online disk for rapid
retrieval and serving. But storing all of the data on expensive
online storage leads to a very expensive and therefore inefficient
business model. Where economies of scale are primary criteria,
technology such as HP Scalable NAS comes to the core of the
solution. More on the benefits of Scalable NAS will be addressed
later in the paper.
Depending on the business model and the type of service offered by
the online service providers, the content popularity distribution
curve varies. For a typical social networking site or a
collaborative data sharing site, some of the data could be more
popular than the rest. It is evident that popular data is
accessed more often than other content which is termed as “Long
Tail Content”. This creates a non-uniform data access pattern
which lays a very different stress on the system.
Key challenges for an infrastructure
provider
Ultimately, the challenge facing today’s
system administrator at a service provider of online data or a
company that is processing a large amount of static data content
is:
Scalability: Implement an infrastructure that can dynamically meet
the changing requirements and be able to provision capacity that
can handle ever growing data.
Availability: Deliver the reliability of systems in terms of famous
“five nines” while ensuring the systems are still functioning
even with multiple simultaneous component failures and remain
running while the infrastructure is being upgraded.
Affordability: Build an infrastructure that is robust, resilient
that provides a low cost of investment.
Manageability: Manage Petabytes of data with a limited number of IT
Staff.
Traditional architecture
With digital media
assets ballooning in size and number the only way to accommodate
growth, with traditional NAS systems, is by adding more storage
space to the system. This at the beginning might seem like a
reasonable approach but the conventional single headed NAS systems
do suffer from the
side effects of extra capacity taxing their processors. Now the
same set of processors on the NAS head need to drive the extra
capacity and this ultimately results in the degradation of
performance.
Hence the single headed monolithic NAS systems come with a certain
capacity limit. In this case, the reliable way of expansion is
either replacing the smaller NAS systems with bigger systems, with
larger capacity and compute power, resulting in data migration with
a forklift upgrade, or adding
more of the same NAS systems thereby creating islands of storage
that need to be managed separately. And to add to this, the NAS
filers need to be paired for High Availability and this results in
high priced, poorly utilized, and highly complex systems.
It is very evident that the conventional NAS systems were just not
designed for environments that deal with an exponential amount of
unstructured data growth.
There are several drawbacks to this traditional NAS solution:
Scalability limits: The complexity of dealing with the NAS systems
arises from the fact that each of these NAS filers results in
islands of storage and file systems. This results in unbalanced and
underutilized systems.
Namespace overhead: Multiple file systems results in multiple
namespaces which is extremely complex to manage. Again from the
picture above, the application server(s) that is utilizing the
underlying NAS system will need to mount a new file system every
time a new NAS system is added to the system.
Management complexity: Multiple and disconnected namespaces
introduces the complexity to managing the file systems and shares.
The complexity of managing the NFS/CIFS mounts and the resulting
mount storms (if applications are designed to dynamically mount the
file shares) can be a very painful exercise for the system
administrator to diagnose and fix the environment for performance
related issues. This problem is magnified if such a solution is
being deployed, for example, in online photo sharing sites where
there is a constant upload/viewing of images from the end users.
For a classical Web 2.0 solution deployment it is very inefficient
if the system introduces underutilization of either server
processing power or on capacity or a combination of both. Also,
managing data spread over several islands of storage is a paramount
task.
High-cost solution: One of the traits of a non-scalable system is
“scalability through copy” mechanism. Multiple copies of the
same information are made in order to provide the performance that
is needed to meet the Service Level Agreements (SLAs). More storage
is needed to store the same amount of information resulting in a
very high cost solution.
In addition, scaling out the NAS filer introduces new manageability
and availability issues. By adding more NAS filers, an
administrator can reduce the performance bottleneck. However, the
administrator must partition and redistribute the data among the
NAS filers. If this data is growing and changing
regularly, as is the case for many websites and Web server log
files, the administrator must continually partition the data and
ensure that there is ample amount of space on each NAS filer and
that no one NAS filer is bearing an overwhelming amount of the
load. Managing these data partitioning and load issues can be
complex and cumbersome.
Moreover, as NAS filers are added, the overall availability of the
system decreases. In fact, since data is being partitioned and not
distributed among the NAS filers, a failure of one NAS device can
bring the entire system down. Thus, the probability of a single
filer failing increases as more NAS
filers are added to the system. Hence, the Mean Time between
Failure (MTBF) for NAS becomes analogous to striping without
mirroring.
The bottom-line here is that, in order to manage multiple Petabytes
worth of information in a cost effective way, a new breed of
systems with a dramatic departure from the conventional
architecture is needed. This is the idea behind the HP Scalable NAS
architecture which effectively solves the
performance and capacity scaling issues which seem insurmountable
for the conventional NAS systems.
Scale-Out Architecture for Static Media
The
HP Scalable NAS solution with multi-headed symmetrical data access
architecture provides a viable solution to the common symptoms
found in the enormous data growth environments. Scalability on both
capacity and performance fronts is one of the most common symptoms
in these environments and hence the need for a scalable
architecture that provides a zero-downtime platform enabling the
business to grow.
With a Scale-Out and Clustered file system approach there is a
single pool of storage which is accessed in parallel by various
nodes in a cluster. The nodes work together to form a cohesive unit
to provide concurrent access to data. Because all of the resources
are stored in a single repository, no one node is taxed while a
particular JPEG image file or an audio file is accessed from the
system.
Each of these systems is equipped with cache that is coherent and
consistent across the cluster nodes.
The benefits of this architecture are:
Shared storage: The HP Scalable NAS solution can be attached to
storage with different levels of performance to minimize the
overall cost of the solution. For instance, for an online photo
sharing site, the high resolution images uploaded by the user are
only used for providing value-added services
such as image printing, calendar services, and others, whereas the
low resolution images and thumbnails are used whenever a user
requests images for viewing. So it is important to store those
images on faster disk storage which is critical in serving customer
online requests. Hence, it makes
sense to store the high resolution images on a capacity optimized
system such as an HP StorageWorks Modular Smart Array (MSA) fronted
by an HP StorageWorks EFS Clustered Gateway or an HP StorageWorks
9100 Extreme Data Storage System (ExDS9100) whereas low resolution
and
thumbnail images on a faster performance optimized system such as
an HP StorageWorks Enterprise Virtual Array (EVA) paired with the
HP StorageWorks EFS Clustered Gateway. HP Scalable NAS solutions,
such as those presented above, allow customers to build a cluster
that is made of
heterogeneous storage with a different class of service.
Global namespace: The file system storing the static data is
mounted on every node in a cluster. The data access is symmetric
and parallel. The application server or Web server can issue a data
request for store or retrieve to any of the cluster nodes as they
are all peer nodes. The management is greatly simplified as the
nodes are managed through a single dashboard and are managed as one
unit, as opposed to individual units.
Load balanced data access: An external load balancer such as DNS
round robin or a hardware load balancer switch can be deployed to
balance the Application Server connections to the cluster nodes to
ensure that no one node is overloaded.
A new node can be added to improve the overall performance of the
cluster. This is a critical aspect of the architecture when
considering the spiked use data access patterns that are typical in
online content repository Web models.
Data access protocols: Most Web applications work on a whole file
instead of manipulating portions of the file while other
applications such as geo-mapping applications might need to work on
certain portions within a file.
In most of the online Web applications, the end user uploads a file
and that file could be an image, audio, or video file and remains
unaltered throughout its lifecycle. The file is accessed (read)
many times but is never changed. For such data access patterns, use
of standard NAS protocols such as NFS and CIFS is a true overkill.
Hence HTTP is a dominant protocol that is optimized for whole file
access and provides a simple interface for data access.
The HP Scalable NAS platform supports multiple protocols namely
NFS, CIFS, and HTTP for applications that need to manipulate static
data.
Self healing and self managing: There is No Single Point of Failure
(NSPOF) within the Scalable NAS cluster. The system is highly
available and resilient and can sustain several component failures
including nodes, network connectivity, software stack, and disks to
name a few. The built-in monitors watch the components and initiate
a failover whenever there is a failure. So this means, if a Web
application server issues an HTTP GET request for a user image file
to a particular node in a cluster and at the time of request the
node to which the request was issued fails, the cluster
automatically serves the request from a different designated backup
node in a cluster.
Co-hosting applications: One of the unique features of the HP
Scalable NAS solution is the ability to host applications directly
on the cluster nodes. For instance, in order to provide highly
available Web serving, the Web server can be hosted to run directly
on the cluster providing block access to shared
data and simplifying the overall management and reducing the cost
of the infrastructure.
Conclusion
The high availability
architecture of the HP Scalable NAS solutions ensures reliable data
access to applications at all times. The added benefit of an open
application platform offers a unique ability to co-host
applications on the Scalable NAS servers, thereby eliminating the
need for a separate application tier. With every single cluster
node accessing the same shared storage content, a high degree of
scalability is achieved with a single copy of content. Also, the
clustered file system architecture eliminates the need to replicate
the content to keep up with the demand without affecting the
quality of service. This scalable clustered architecture provides a
simplified way to solve the problem of over provisioning of
resources to keep up with the spiked data access pattern and
increase growth in the digital static media arena.