Cybersecurity Data Science: Minding the Growing Gap

By on

Click to learn more about author Scott Mongeau.

Following cybersecurity Data Science best practices can help beleaguered and resource-strapped security teams transform Big Data into smart data for better anomaly detection and enterprise protection.

Future Shock: Growing Vulnerabilities and Liabilities

The consequences of ignoring security challenges are rising. According to the Cisco 2018 Annual Cybersecurity Report, over half of cyberattacks resulted in damages of greater than $500K, with nearly 20 percent costing more than $2.5M. Meanwhile regulators, seeking to spur heightened oversight, have become more aggressive in levying fines and holding corporate boards accountable.

A rapidly developing field, Cybersecurity Data Science (CSDS) brings hope to organizations challenged by evolving cyber threats. CSDS utilizes advanced analytics to address common security challenges – data overload, limited resources, overabundant false alerts, and more – in an increasingly data-driven, interconnected world.

Cybersecurity Data Science in a Nutshell

CSDS offers a practical path forward for organizations besieged by unknown-unknowns. The discipline unites a range of analytical methods to achieve cybersecurity monitoring, detection, and prevention goals. When operationalized, the result is an end-to-end organizational process orchestrating people, methods, and technologies.

Cybersecurity Data Science (CDSD) drives value through:

  • Aligning data engineering objectives.
  • Refining fast and big data into “smart data.”
  • Orchestrating a cyclical process of discovery and detection.
  • Facilitating the development of analytical models for pattern extraction and event detection.
  • Leveraging data analytics tools and methods to produce targeted, evidence-based alerts.
  • Routing focused incidents to the right resources at the right time for rapid review and remediation.

Chasing Phantoms: Unknown-Unknowns

Cybersecurity monitoring personnel are increasingly overwhelmed by false alerts, a challenge exacerbated by monitoring and remediation resource limitations. A recent report by Cybersecurity Ventures predicts a deficit of 3.5 million unfilled cybersecurity jobs by 2021. Status quo rule-based approaches to surfacing security events are deluged by data overload and data fragmentation from distributed sources. The lack of an integrated view results in confusion concerning unknown-unknows – phantom patterns issuing from increasingly complex environments and disconnected data. The resulting confusion and complexity cloaks threat indicators lurking in the shadows. The consequences are dire and growing: security teams struggle to do more with less, while exposure to persistent and sophisticated threats grows. Cybersecurity professionals struggle to find elusive signs in exponentially expanding data volumes.    

A New Hope: Cybersecurity Data Science (CSDS)

Data Science drives the application of advanced analytics methods to yield value-creating insights from data. A practitioner-driven discipline, Data Science combines methods from a variety of fields, including computer science, data engineering, statistics, machine learning, and operations research. Combining domain challenges with Data Science methods results in hybrid areas, CSDS being a significant and growing example.      

Cybersecurity is a broad, established professional domain addressing a range of topics associated with safeguarding network and computer infrastructure and devices. Subdomains focus on solutions engineering, data protection, safeguarding access, network and device monitoring, incident response and handling, forensics, penetration testing/ethical hacking, and rapidly emerging focus areas such as wireless, mobile, and Internet of Things (IoT) security.

Data, Data Everywhere and Not a Drop to Drink! 

As false alerts thwart effective monitoring efforts, cybersecurity professionals are challenged to disassociate signals from noise. Similarly, whereas organizations are overwhelmed by growing volumes of cyber data, security monitoring solutions struggle to extract focused alerts. Data Science addresses these twin gaps by bringing to bear a range of techniques to refine data into focused and effective alerts.

Lacking properly prepared data and context, security analytics efforts are highly constrained from the outset. Before trends can be extrapolated and predictions made, data engineering and selection must be undertaken. This involves exploring data to determine key features, preprocessing to impose structure, integrating sources, and establishing pipes – jobs or routines to streamline the movement of data from raw-and-distributed to structured-and-integrated.

Data sources for security analytics are prolific and voluminous, including log files, network traffic and packets, authentication and proxy records, device configuration, security information and event management (SIEM) and monitoring data, device telemetry, threat feeds, domain lookup, and user and device metadata.

The variety and diversity of cyber data sources, often unstructured, require focused data engineering and feature selection to aggregate and transform sources into an integrated picture. CSDS directly supports the refinement, linking, and selection of effectual master datasets, turning big data into smart data.

Advance Analytics: Walk Before You Run

Advanced analytics methods applied in CSDS range from exploratory methods (e.g., unsupervised machine learning, diagnostic statistics, and time-series analysis) to predictive techniques (e.g., forecasting and supervised machine learning). Specialized techniques such as text analytics, network graph analysis, and probabilistic process benchmarking are applied for advanced challenges.

A fundamental CSDS best practice is to use basic exploratory techniques to develop a clearer picture of what is “normal” in the environment – i.e., to better understand natural groups and dynamics so it is easier to spot anomalies. This implies that straight-forward descriptive and diagnostic techniques are applied before jumping to predictive machine learning and advanced methods.

Utilizing combinations of fundamental analytical methods, CSDS provides focused value by improving organizational understandings of security infrastructure and dynamics. Analytical techniques support the identification of statistical baselines – an understanding of what is expected for a given environment and set of entities, be they users or devices. Pattern detection algorithms and diagnostics bring clarity and definition to complex environments.

CSDS-as-a-Process: The Explore and Detect Cycle

A frequent misstep in security analytics initiatives is to fail to distinguish processes for exploration versus detection. The former process, exploration and discovery, is used to identify new detection methods, and thus involves larger and richer datasets to monitor trends to spot emerging, unknown exploits and threats. The latter process, detection automation, focuses on operational detection. By nature, it utilizes a highly-refined, efficient set of data and analytical methods for monitoring.

The processes taken together, exploring for new insights and detection monitoring, operate as a virtuous lifecycle. However, from an operational standpoint, it is a costly misstep to attempt to “boil the ocean” by storing all the data, all the time. Operational effectiveness and cost control depend on knowing which data can be forgotten and which must be operationalized, while allowing for the model to change based upon new insights.

Many companies now struggle with immense and costly cyber data lakes filled with unstructured, untreated data which does not result in analytics insights or value. Explicitly distinguishing data and methods associated with the exploration and discovery processes implies distinct data and models for each challenge.

A unified approach combines iterative analytical methods to refine big data into smart data, and from there delivers targeted, efficacious alerts to the right resources at the right time.

Together, Better: Take Away CSDS Best Practices

With an understanding of why CSDS is a new hope for cybersecurity and the approaches, organizations can start with some best practices:

  1. The New Normal: Using analytics methods to build a picture of “normal” for your environment is the first step towards focused detection.
  2. Garbage In, Garbage Out: Data Quality is often overlooked, but small investments in refining data using analytics has outsized benefits in operational efficiency and effectiveness.
  3. Walk Before You Run: Focus on basic descriptive and statistical diagnostic techniques before jumping to fancier predictive machine learning and artificial intelligence (AI) approaches.
  4. From Big Data to Smart Data: Feature engineering supports the refinement and reduction of exhaustive cybersecurity datasets into highly refined measures for monitoring.
  5. Insight as a Process: There are a range of CSDS methods available. Focusing on implementing an end-to-end process from raw data to insights will help to structure engineering efforts.
  6. Segment Your Goals: Formally distinguish data and methods used for continuing exploration versus for operational detection.
  7. Knowing the Right Questions: Use a continual exploration approach to refine the questions being asked; operationalize the cycle in a refined exploration-detection process.

Do You Know the Way? Next Steps for Your Organization

Organizations should carefully plan objectives for their security analytics initiatives, with a focus on defining a process for applying CSDS to systematically refine big data into smart data. The goal is an operationalized process which reduces data into an effective set of features which result in focused alerts. Emerging threats are monitored continuously, but separately, in a parallel exploratory process.

Dumping security data in a massive cyber data lake and attaching machine learning algorithms on-top is not only insufficient, it is guaranteed to be a costly misstep. It is important to keep one’s bearings by focusing on the operational goals of each refined step in the CSDS process. At a high level, the operating process should facilitate the role-based interactions of data engineers, data scientists, cyber investigators, and incident response professionals.

While most large organizations have focused experts in the areas of big data, Data Science, and cybersecurity, it is beneficial to speak with experienced, focused CSDS professionals that have implemented cutting-edge solutions at the junction of these three domains. Going it alone, or attempting to reinvent the wheel altogether, can be a costly and time-consuming misstep. Moreover, adhering to CSDS best practices will help organizations stay on track in terms of goals, costs, and value realized.

We use technologies such as cookies to understand how you use our site and to provide a better user experience. This includes personalizing content, using analytics and improving site operations. We may share your information about your use of our site with third parties in accordance with our Privacy Policy. You can change your cookie settings as described here at any time, but parts of our site may not function correctly without them. By continuing to use our site, you agree that we can save cookies on your device, unless you have disabled cookies.
I Accept