You are here:  Home  >  Conference and Webinar Communities  >  Current Article

Case Study: Using Mongo and MapReduce to Analyze a Difficult Research Problem

By   /  February 23, 2012  /  No Comments

About the Presentation

This presentation demonstrates how NoSQL technologies were used to solve a difficult analytical problem that traditional SQL databases could not. PDF malware has been on the rise for the past few years and has become one of the most successful methods for attackers to gain unauthorized access into a network. The standard way to do malware analysis in PDF documents has been very independent in nature. Commercial entities do not share their data so researchers must fend for themselves and more often than not, researchers analyze a PDF file independent of other malicious PDF files. I found this static approach to be highly inefficient, but storing multiple PDF documents in a database was a problem in itself. Traditional SQL databases didn’t seem like the right fit given their forced constraints and true relational models.

PDF files also contain a lot of dynamic data that make them a tough fit in a traditional SQL model. PDF A could contain 40 objects where as PDF B could contain 3,000 objects. Scaling this out becomes quite difficult and messy. When looking at this problem, I ideally wanted to solve a number of issues at once. I needed a good way to share a PDF samples, an easy way to query on a corpus of documents and the ability to efficiently get my data back out so I could display it elsewhere.

With all my samples in a JSON format, MongoDB just made sense as it could take in these objects and allow me to query on them as a whole or independently. MongoDB also provided me with a rich tool-set to further answer questions that had never been posed before. By using single and multi-step map/reduce jobs, I was able to aggregate PDF characteristics and apply simple averaging to identify shared commonalities between malicious documents.

Though I have had great outcomes and successes with MongoDB, there have also been annoyances and the unexplained details. These issues pinned me against a wall for days and sometimes had me wondering if I had picked the wrong model for tackling this problem. At the end of the day though I was able to overcome these issues and account for them without hassle.

This talk covers new research methods and tools created using NoSQL technologies to analyze PDF documents in a more efficient manner that promotes collaboration among the community. It also serves as a step forward in detecting malicious PDFs by looking at them from a statistical standpoint. When I look back on the choice to use MongoDB, I think I made a great decision. I can’t begin to think how I would have handled processing thousands of unique named function calls with multiple attributes for each PDF or running multi-step map/reduce jobs against a Mongo document instead of a blob in a SQL database. MongoDB provided me with a rich functionality that easily led me to success on my project.

In its current state, my PDF malware collection contains over 10,000 documents with over 100,000 objects. By using Map/Reduce, complex queries and a bit of math I am now able to compare uploaded PDF documents with the thousands of malicious files in my collection in less then 20 seconds. Using this tool and MongoDB, I have been able to accurately identify shared resources among malware that can help identify future variants that get introduced online.

I believe my approach is truly unique and I have yet to find anything else like it. Though it was unorthodox, it has produced a new way of sharing malicious PDF samples and identifying future attacks. Parts of my tool have been released open source allowing other users to build their own corpus of malicious PDF documents or perform their own research. Also, a web version of the tool currently exists in beta form where users can upload samples and have them compared among the malicious collection to identify suspicious components in their uploaded document.

About the Speaker

Brandon is a security researcher for 9b+ where he spends his time identifying malicious attacks and thinking of better ways to detect/stop them. His research on various security topics has gotten him attention from companies such as Adobe, Verizon, Sprint, and Cisco. He has discovered several exploits and flaws based on vulnerabilities found in commercial products, web applications and messaging technologies.

You might also like...

Machine Learning vs. Deep Learning

Read More →