How to Increase Data Storage System Reliability

By on

Click to learn more about author Grahame Morrison.

An interesting conundrum has perplexed many IT and DevOps departments. When it comes to disk drives, what’s behind the gap in reliability when all of the drives come from only two disk manufacturers worldwide – and this pair’s products are very comparable to one another? There’s a clear answer when you dig a little deeper. The variation in disk drive and data storage system reliability doesn’t hinge on the disk manufacturers. It’s much more influenced by storage vendors and their approach to three key areas:

  • Product design
  • Manufacturing processes
  • QA testing

Variation commonly occurs between storage systems in each of these functions, which can affect optimal reliability. Outdated designs and processes in these areas can result in lower quality as well as unnecessary storage management costs, increase management time and create business disruption, which no enterprise can afford.

Many data storage systems rely on the storage virtualization technology known as RAID (Redundant Array of Inexpensive Disks or Drives). While RAID can be effective at data protection, it’s not suited for every environment, and it has an inherent risk. If a system component or disk drive fails, this can result not only in business delays, but in monopolizing IT’s limited time.

Reasons Behind Failure

What causes disk drives to underperform or fail outright? The top culprits are:

  • Vibration: Think about the extremely high level of accuracy that’s required for proper positioning of the heads that are responsible for both reading and writing data onto a disk. These heads have a very tight space in which to perform optimally, to say the least. They hover at a height above the disk that is only a few nanometers, which is much smaller than even a strand of human hair. In environments with vibration, it can be difficult, if not impossible, for the heads to be positioned correctly, which can cause the disk to abort. Challenging environments that often lead to high-frequency vibration encompass everything from city subways to oceanic mapping and research ships to  NASA projects to cruise boats in the case of security camera video.  
  • Temperature: The same types of less-than-deal environments also often have extreme temperatures. It goes without saying that disk drives can have an increased rate of failure when operation is attempted in extreme temperatures, as is the case with any electro-mechanical device.
  • Service Interruptions: Any time that service is disrupted – for example, routine maintenance or system component replacement – can case  disk drive malfunctions.
  • Interconnection Failures: A number of physical interconnect failures –from physical damage or failure of signal path components to connection-path contamination –can result in the appearance of a missing disk.
  • Protocol Failures: Protocol failures can lead to potential data loss due to problems with input/output requests. A wide range of such protocol errors can happen, whether from data center switches or protocol incompatibility between different manufacturers.
  • Defects: Occasionally, a disk drive will be found to contain defects from the manufacturing process itself.
  • Performance Failures: Sector re-mapping is one example of an activity that can overload multiple disks with recovery activity at the disk level, thus leading to performance issues.
  • Damage: Delicate disk drives can be easily damaged during shipping and handling or assembly if protection is not provided from knocks and drops.

The list above shows the wide range of variables that must be baked into a data storage system’s product design, manufacturing, and QA testing to create system reliability. The following key features and best practices help to ensure highly efficient disk drives:

Designing to Prevent Vibration

Disk drive reliability begins at the engineering and design stage. A data storage system needs an anti-vibration design for optimal performance. When hard drives are placed near each other, their mutual vibration can disrupt their neighbor’s read/write ability. A solution is to position drives back-to-back, which helps control high-frequency vibration. You can also design for anti-vibration by introducing greater rigidity than the standard steel construction of many storage systems. By beefing up stiffness and mass through the use of aluminum, vibration is more easily absorbed.

Creating a Cooling System

Since high temperatures increase the chance of drive failure, it’s critical for data storage systems companies to design and test an advanced cooling system to allow for optimum airflow from front to back. A drive’s electronics are its “hot spot,” so when cooling mechanisms are paired with back-to-back positioning of the drives, the result is a cooling channel where a lower temperature is needed. Effective cooling systems also include a feature for ongoing temperature monitoring of not only the drives, but other system components, with the ability for auto-adjustment of fan speed. The next step should involve rigorous testing of both temperature and airflow.

Enabling Active Updates

Suffering downtime when system components require replacement can result in business grinding to a halt. To circumvent this problem, a storage system should be designed to allow “active updates”—meaning the IT administrator can replace components as the system continues humming along.

Qualifying Software and Controlling Production Processes

To avoid failures in software protocol, it’s important to stick with a revision process that’s carefully managed. This can be achieved by validating each drive, refusing to accept software levels unless they undergo a rigorous process to qualify each disk.

Testing the Drives

The best way to root out problematic drives from the get-go is to require testing as part and parcel of each storage system’s production process. If the test cycle flags too many sectors that need correction, that disk does not get qualified.

Protecting Disks During Shipping/Handling

Intensive packing and handling procedures should be deployed to avoid damage that can occur during shipping and go undetected once received. A best practice is to use special shipping containers to ship hard drives outside the chassis.

Disk drive reliability doesn’t just happen – it’s the result of intentional planning at the design, manufacturing, and testing stages. When a storage system encapsulates best practices in this trio of critical areas, the result will be efficiency, quality, and reliability.

Leave a Reply