Developing AI-Based Solution for Web Scraping: Lessons Learned

By on

Click to learn more about author Aleksandras Šulženko.

As the buzz around artificial intelligence and machine learning keeps increasing, there are few tech-based companies who have yet to try their hand at it. We decided to take the plunge into AI and ML last year.

We started small. Instead of attempting to implement machine learning models in all stages of our solutions for data acquisition, we split everything into manageable pieces. We applied the Pareto principle – what would take the least effort but provide the greatest benefit? For us, the answer was AI-based fingerprinting.

Thus, late last year we unveiled a new type of proxy: Next-Gen Residential Proxies. They are our machine learning-based innovation in the proxy industry. Since we had fairly little experience with AI and ML beforehand, we gathered machine learning experts and Ph.D researchers as the advisory board.

We were successful in our efforts of developing AI-powered dynamic fingerprinting. Such an addition to our arsenal of solutions for web scraping makes it easy for anyone to reach 100% data acquisition success rates. 

Additionally, we are in the beta testing phase of our Adaptive HTML parser. In the foreseeable future, Next-Gen Residential Proxies will allow you to gather structured data in JSON from any e-commerce product page.

These solutions arose from necessity. Getting IP blocked by a server is an unfortunate daily reality of web scraping. As websites are making strides in the enhancement of flagging algorithms with behavioral and fingerprinting-based detection, our hope lies in using machine learning to maximize data acquisition success rate.

Note:The tips and details outlined below are a theoretical exploration for those who want to try out web scraping. Before engaging in web scraping, consult your lawyers or legal team.

Understanding Bot Detection: Search Engines

It shouldn’t be surprising that gaining access to search engine scraping is rife with opportunity. However, search engines know full well about all the advantages gained by scraping them. Thus, they often use possibly the most sophisticated anti-bot measures available.

Throughout the experience of our clients, we always notice the same issue with search engine –low scraping success rates. Search engines are quite trigger-happy when labeling activity as bot-like.

Newest developments in anti-bot technology include two important improvements: behavioral detection and fingerprinting-based detection. Simply changing user agents and IP addresses to completely avoid blocks is a thing of the past. Web scraping needs to be more advanced.

Search engines will lean more heavily towards fingerprinting-based detection. In this case, fingerprinting is simply acquiring as much information as possible about a particular device, OS, and browser through cookies, JavaScript, and other avenues.

Once the data is collected it is then matched with known bot fingerprints. Avoiding fingerprinting is not easy. According to research, an average browser can be tracked for 54 days, and a quarter of those can be easily tracked for over 100 days.

Thus, there is no surprise that the process of maintaining a high success rate in search engine scraping is challenging. It’s a cat-and-mouse game where the cat has been enhanced by cybernetics. Upgrading the mouse with AI would go a long way for most businesses in this area.

E-Commerce Platforms

E-commerce platforms are a completely different beast. Paths to data in search engine scraping are very short. Generally, they consist of sending queries directly to the engine in question and downloading the entire page. After that, some valuable data has already been extracted.

However, for e-commerce platforms, a vast amount of constantly changing product pages will need to be scraped to acquire usable data. A lot more daily browsing will need to be done to extract all the requisite data points.

Therefore, e-commerce platforms have more opportunities to detect bot-like behavior. Often they will add more extensive behavioral and fingerprinting-based detection into the mix to flag bots more quickly and accurately.

Intuitively, there’s an understanding that bots will browse in a different manner from humans. Often, speed and accuracy will be the standout features of a bot. After all, just slowing down a bot to human browsing speeds and mannerisms (or even slower!) would be a considerable victory.

Machine learning is almost always used in behavioral detection as a comparison model is required. Data on human browsing patterns is collected and fed to a machine learning model. After enough data has been digested by the machine learning algorithm, it can begin making reasonably accurate predictions.

Human and bot behavior tracking can take numerous routes through both client and server-side features. These will often include:

  • Pages per session
  • Total number of requests
  • User journey (humans will behave in a less orderly fashion)
  • Average time between two pages
  • Resources loaded and suspicious use (for example, browsers sending JS requests but being unable to draw GUI, disabling CSS, etc.)
  • Average scroll depth
  • Mouse movements and clicks
  • Keys pressed

Someone attempting to circumvent sophisticated fingerprinting-based tracking techniques will have to do two things: a lot of trial-and-error testing to reverse-engineer the machine learning model and its triggers, and create a crawling pattern that would be both evasive and effective.

Trial-and-error testing will be extremely costly. Lots of proxies will receive temporary or even permanent bans before enough data is acquired to get a decent understanding of the model at play.

The e-commerce-platform-specific crawling pattern, of course, will be developed out of an understanding of the model used to flag bot-like activity. Unfortunately, the process will be an endless effort of tinkering around with settings, receiving bans, and changing parameters to get the most out of each crawl.

AI-Powered Dynamic Fingerprinting

What is the best way to combat AI- and ML-based anti-bot algorithms? Creating an AI- and ML-based crawling algorithm. Good data is not hard to come by as the success and failure points are very cut-and-dry.

Anyone who has done web scraping in the past should already have a decent collection of fingerprints that might be considered valuable. These fingerprints can be stored into a database, labeled, and provided as training data.

However, testing and validation is going to be a little more difficult. Not all fingerprints are created equal and some might receive blocks more frequently than others. Collecting data on success rates per fingerprint and creating a feedback loop will greatly enhance the AI over time.

That’s exactly what you can do using Next-Gen Residential Proxies. They find the most effective fingerprints that result in the least number of blocks without supervision. Our version of AI-powered dynamic fingerprinting involves a feedback loop that can use trial-and-error results to discover combinations of user agents, browsers, and OS that will have a better chance at bypassing detection.

AI-powered dynamic fingerprinting solves the primary problem of e-commerce platform scraping: enhanced bot activity detection algorithms. With some injection of AI and machine learning, our proverbial mouse might just stand a chance against the cat.

These technologies, of course, would be helpful for anyone who does web scraping at scale. Understanding how fingerprinting impacts block rates is one of the most important theories for those who do web scraping.

Adaptive Parsing

Our next step is fixing a different web scraping pain point: parsing. Developing and maintaining a specialized parser for every different target takes a substantial amount of effort. We’d rather leave that to AI.

What we found is that building an adaptive parser requires a ton of labeled data but doesn’t require feedback loops as complex as with AI-powered dynamic fingerprinting. Getting the training data was simple, if a little boring.

We contracted help to manually label all the fields in an e-commerce product page and fed the training data to our parsing machine learning model. Eventually, after some data validation and testing, it reached a stage of allowing its users to deliver data from e-commerce product pages with reasonable accuracy.

Conclusion

Introducing AI and machine learning to proxy management is going to become an inevitability at some point in the near future. Real-time bot protection has already begun creating ML models that attempt to separate bots from humans. Manually managing all aspects of proxies (e.g., user agents, rotation, etc.) might become simply too inefficient and costly. 

To some, building AI and machine learning models might seem like a daunting task. However, web crawling is a game with countless moving parts. You don’t have to create one overarching overlord-like ML model that would do everything. Start by tending to the smaller tasks (such as dynamic user agent creation). Eventually, you will be able to build the entire web crawling system out of small ML-based models.

We use technologies such as cookies to understand how you use our site and to provide a better user experience. This includes personalizing content, using analytics and improving site operations. We may share your information about your use of our site with third parties in accordance with our Privacy Policy. You can change your cookie settings as described here at any time, but parts of our site may not function correctly without them. By continuing to use our site, you agree that we can save cookies on your device, unless you have disabled cookies.
I Accept