The Art of Data Extraction

By on

To be a data-driven business, companies need data. A lot of data.

They have plenty of data in-house to work with, of course — their customer and sales data, data they get from their subscriptions to third-party data providers, and so on. But what about all that data on the web?

Sourcing data from the web to help businesses with analytics has its challenges. Making sure that the data they can access is always up-to-date is even more challenging.

Companies often attempt to tackle the problem in-house by creating their own web scrapers. But websites are constantly changing, and many sites use blocking technologies to discourage bots. Enterprises can wind up spending a huge amount of money and still fail to get the data they need in a reliable, consistent, and quality manner.

The number of websites is growing rapidly and is expected to exceed 2 billion by next year, as stated in the Opimas report Web Data Integration Leveraging the Ultimate Dataset. “The total amount of data lurking on the web will continue to mushroom, and firms who do not successfully harness this source will quickly be left behind by savvier competitors,” the report said.

“That’s where we come in,” said Gary Read, CEO of Import.io, which provides automated data extraction, web data and harvesting, data preparation, and data integration services. “Almost all our customers have tried to do this and failed.” Soon after they get into trying to write data scrapers, they realize it’s really hard to do.

“You’re constantly having to maintain and update the code you build to get it to work,” Read explained. That isn’t practical for companies that understandably want to spend their time using data. “There are answers to so many questions on the web, but it’s very difficult to pull that data off and make it useful.”

The stakes are getting higher. Fewer than half of the Fortune 1000 companies or industry-leading firms who responded to a recent NewVantage survey said they are competing on data and analytics and not even one-third have created a data-driven organization or have forged a data culture.

Freedom of the Data

Companies identify the data they need via URLs, or they can use Import.io’s auto suggestion feature to discover appropriate sites for their data needs. Import.io’s web integration process starts with extracting data — whether displayed or hidden, accessible behind a login, existing across multiple pages on the site, or requiring interactions for entry.

An interesting issue that has come up with regard to choice over what web data can be used or cannot relates to LinkedIn. A couple of years ago, LinkedIn would send cease and desist orders to any company deemed to be scraping data from its website on the premise that it was accessing data in an unauthorized manner. hiQ Labs, a talent management startup, was one of the companies that used automated bots to scrape LinkedIn user information from public profiles. LinkedIn blocked it and the case went to court. A judge ruled in hiQ’s favor in 2017.

Recently the Ninth Circuit Court of Appeals let the ruling stand, noting that there was little evidence that LinkedIn users who make their profiles public have an expectation of privacy with respect to that information:

“It’s really important. It speaks to the bigger issue whether large companies like LinkedIn can build walled gardens on the web and tell people you are not allowed to use this data in any way,” Read said. “But we’re the ones putting the data out there, so why should LinkedIn get to decide that only they can use it for their own apps? This all speaks to openness and freedom on the web.” 

Finding the Trouble Spots

Another issue Import.io and other companies deal with is the fact that some companies will deliberately give false data on their sites to thwart competitors from being able to use it successfully. “Everyone gets data off of each other’s websites, for uses such as understanding competitive pricing,” Read said.  

Import.io has built-in capabilities to ensure Data Quality and to make sure it’s providing trustworthy data in the face of such tricks. “We’re constantly looking at data, finding anomalies or things that seem to be wrong in data,” he said.

The company over time has built a knowledge base of what particular types of data should look like in order to catch those oddities. For example, if extracting data from a supermarket website exposes a price of $2,000 for a can of tuna, its algorithms will recognize the incongruity. “We use knowledge and data we have collected from millions of websites and feed that back into the product to do data validation and data checking.”

A human may be introduced into the loop if necessary.

“Machines show us the anomalies and the human is there to interpret that information,” he said. “Companies are paying us to get them quality data from the web; data they can use to make critical business decisions.”

Import.io delivers the data in the specific format in the specific schema their customer requests.

“You can take oil out of the ground but until it’s refined, it’s not usable,” Read said. “It’s the same thing here. You can take data off the website but you have to refine it and transform it to make it the right fit for the customer.”

Focus on Data Privacy

The company has to be careful, for its customers’ sakes, that privacy regulations are adhered to when extracting data. That’s particularly true for companies collecting data from EU citizens, which is subject to GDPR regulations.

The product can automatically scan all the data that it looks out to find if there is any personally identifiable information and mask out that data for legal and ethical reasons. It’s never stored or shown to anyone.

Import.io also takes into consideration when websites use robots.txt file to tell search engines which of its pages can be published to a search engine and be called from it. If the website owner doesn’t want those pages published, Import.io won’t get pages from them at all.

Additionally, it has features in place that keep it from overloading different websites. Some websites — say, a small business site — may only be able to sustain 10 or so concurrent users.

“We have to make sure we only get data in a way that doesn’t negatively affect the performance of the website,” Read said. “The engine in our product understands sites and their performance and only collects data at the rate that it won’t negatively affect that. We want to be good web citizens.”

Use Cases for Web Data

Import.io’s solutions cover equity research, ecommerce and retail, online travel, sales and marketing, and risk management.

Most of the time customers will ask for a set of data to be delivered on a regular basis — daily, weekly, or monthly. A minority ask for even more rapid notices. For example, if the customer is monitoring a news site, it may want to know whenever something changes on it, and to have that piece of news delivered to it within 20 minutes of being published on that site.

“It’s really fascinating to see all the different use cases web data can be put toward,” Read said. “Dynamic pricing engines. Building training models. Sentiment analysis. Almost every day we discover a new use case.”

Image used under license from Shutterstock.com

We use technologies such as cookies to understand how you use our site and to provide a better user experience. This includes personalizing content, using analytics and improving site operations. We may share your information about your use of our site with third parties in accordance with our Privacy Policy. You can change your cookie settings as described here at any time, but parts of our site may not function correctly without them. By continuing to use our site, you agree that we can save cookies on your device, unless you have disabled cookies.
I Accept