Sherlock in the Data Center: The Value of Correlation When Investigating and Troubleshooting Problems

Click to learn more about author Kong Yang.

Technology constructs are getting more complex and varied. This means data center performance problems—whether related to applications, compute, network, storage, virtualization, web, the cloud, or, most likely, a combination of these—are more difficult to resolve than ever. They’re true mysteries. As such, you need to become a veritable IT Sherlock Holmes. This all comes back to troubleshooting, one of the most important skills for any IT professional.

Ratiocination = Troubleshooting

Holmes is famous for his ratiocination, or process of reasoning and logical thinking. In IT, this is akin to the troubleshooting process. IT troubleshooting is a foundational skill and a key element of monitoring with discipline. It enables you to drill down to discover the root cause of an issue. Without this skill, it’s nearly impossible for you to gain an understanding of the underlying cause and effect of any incident. Today, however, the multi-stack issues IT professionals often encounter transcend functional silos within the larger organization. Technology like cloud, virtualization, hybrid IT, and hyper-converged infrastructure, have all fundamentally transformed IT and rendered troubleshooting across these distributed systems more critical and yet more complex than ever.

It’s important at this point to review the eight fundamental steps of troubleshooting, which are applicable to any IT professional, any organization, and any IT environment:

Define the problem
Gather and analyze relevant information
Construct a hypothesis or probable cause
Devise a plan to remediate
Implement the plan
Observe the results and recreate the plan to reproduce or reverse-engineer the results
Repeat steps 2-6 as necessary
Determine the root cause and document it

Although these steps remain consistent regardless of new technology constructs, the volume and velocity in technology and services has changed, which affects the rules of engagement for IT professionals. We are consistently short on time—there are never enough hours in the day and we need to fix issues as fast as possible.

It’s Correlation, My Dear Watson

Causation and correlation are important concepts associated with effective troubleshooting. However, as you likely know, correlation does not necessarily equal causation.

Causation is the ideal outcome in troubleshooting in any environment; it’s about finding the exact cause and its effect, so that it can be remediated. In other words, eight troubleshooting steps outlined above are designed to reach causation.

On the other hand, correlation is exploring the connected context of multiple variables over time to see if they lead towards, although perhaps not accurately prove, the cause of a performance issue or incident. The main point is to associate and compare a multitude of key metrics, such as network performance counters and application performance counters, to track the situation over a period of time and, bolstered by experience and expertise, pinpoint the cause and remediate. For example, correlating network latency and bandwidth data with virtual machine compute and application-specific data to root cause a distributed application performance issue.

So, when it comes to IT troubleshooting, while correlation may not equal causation, correlation should be a part of steps one through seven above, helping you arrive at step eight.

Finding Your Inner Sherlock

Correlating performance metrics and data requires you to have a certain level of expertise and familiarity with your environment beyond up/down or green/yellow/red status. It also presents challenges in terms of soft skills. Soft skills are workplace skills like collaboration and communication. However, these soft skills, especially collaboration, are becoming more important to properly troubleshoot performance issues across highly distributed systems, as they are increasingly likely to involve root causes spanning multiple technology silos, regions, and service providers. Furthermore, correlation and collaboration, while certainly two different concepts, are related—good correlation often requires collaboration, and collaboration can likewise improve correlation.

Here are several suggestions to help you find your inner Holmes and overcome these challenges:

Implement monitoring with discipline: As I stated, using correlation to troubleshoot performance issues requires you to have a certain level of expertise and familiarity with your environment. The best way to accomplish this by properly monitoring your data center across the entire stack. This will require an investment in resources, such as IT monitoring and management software.
Use your monitoring toolset to help with correlation: A good monitoring toolset should be able to help you visualize and correlate IT monitoring data to improve troubleshooting of performance issues across the IT environment, from infrastructure to networking to applications, and from on-premises to cloud service providers. Seek the ability to simply combine and correlate time-series metrics as well as historical performance metrics from multiple hybrid IT data sources, including applications, compute, network, storage, virtualization, web, and the cloud, into a single shareable dashboard that visualizes relationships between suspect elements. Then, collaborate with subject matter experts that span the silos of that dashboard.
Work on your soft skills: Soft skills are how silos are broken down; however, they’re not always strengths for those of us in IT who perhaps gravitated to the field because of our proclivity towards technology and the sciences. As such, it’s up to us to develop these skills. The ability to effectively communicate and collaborate are two of the most important soft skills. There’s no better way to start refining these than by putting them into practice.
Remember and follow the eight steps of troubleshooting: Although simple and basic, the eight steps of troubleshooting I’ve outlined here are nearly universally applicable. While the tools we use to aid us in taking these steps are evolving to address the challenge, never forget that the foundational principles of troubleshooting remain applicable.

In Conclusion

The troubleshooting process can be more convoluted than ever before, often requiring collaboration among many different functional silos within IT and beyond, such as cloud service providers. However, with the proper process and the right tools, using correlation to determine causation can make troubleshooting more efficient and effective. It’s elementary, my dear IT professionals.

BECOME A DATAVERSITY INSIDER FOR ACCESS TO 160+ COURSES

Data Topics

Leave a Reply Cancel reply