Bringing structured data to users from unstructured web content – that’s what Webhose.io is offering in the way of a new API service that works with millions of blogs, forums, reviews and news sites, and comments posts. The company is leveraging its technology roots as a message board search engine, Omgili, to the work of getting posts’ clean text content, dates, authors, links, language and so on out in JSON, XML and RSS formats rather than as unstructured content via an HTML file.
For Omgili to do its job in the message board space, a world untouched by microformats and other semantic web structures, required creating a crawler that used heuristic techniques to extract text, titles and other details from those posts, he says. “Once we created that we were able to extract data from less complex sources like blogs, news sites and so on,” Geva says.
The output is similar to what import.io does (see The Semantic Web Blog post on that technology here), but on a much larger scale,” says founder Ran Geva. The import.io service, he says, is great when users want specific data from one or two sites but more is required for heavy lifting when you want to leverage a lot of data. “We save and download millions of posts per day,” he says. “When it comes to getting structured content out of complicated sources on a mass scale, that’s a unique technology challenge we conquered a while ago.”
Geva’s company was acquired by Buzzilla Ltd., and Omgili’s crawling technology is currently licensed by big brand-monitoring companies and prominent IT vendors like Salesforce, among others. Webhose was launched about a month and a half ago to bring its services to a wider community, including smaller companies, he says. Structuring realtime data feeds so that they can be integrated into databases, data warehouses, social media analytics and media monitoring services is a must, he says, in order for any information gained to be effectively analyzed. “Otherwise you just have a lot of text mingled together and can’t separate anything from anything,” he says.
The basic idea behind what Webhose.io does, he says, revolves around learning and analyzing platforms. “Many times sites are built on platforms that are being reused – modified but reused,” he says, noting that the HTML structure of one blog is very similar to another, as an example. “So you learn how the site looks behind the scenes, and create pattern recognition that is robust to changes.” Its parser looks at the flat HTML and uses regular expressions to detect where content starts and ends, and heuristics to find date, author and other data belonging to a specific post. “So it’s more going over text and analyzing the semantic structure of HTML to understand where data resides,” he says. Its parsing approach, eschewing reliance on the Domain Object Model (DOM) structure and parsing, helps with speed when dealing with millions of pages and doesn’t break if the site changes, he says.
Webhose has a variety of pricing plans, from $39 a month for light useage to $500 a month premium service for 300,000 monthly requests and 100 results per page.