(Photo: ann j p//Flickr )
Here’s the technology feeding cycle of the web startup space at work: A recent joint venture between newcomer 80legs and NLP vendor Language Computer Corp. â€“ the source behind the Swingly semantic search service (still pending an alpha release) that is itself leveraging 80legs’ web content crawling and processing service â€“ has created technology that will figure in the service platform provider’s new Crawl Packages. The pre-configured Crawl Packages will extract interesting, publicly available data from specific websites.
To be precise, the alpha-stage joint venture is called Extractive, and what it does is run semantic crawls across the web. “You point it at any part of the web and it converts any unstructured text into structured semantic data,” says 80legs CEO Shion Deysarkar. “So, rather than having to do manual markup, you can use Extractive to do that, as well as to aggregate semantic data from the entire web.” The combination of LCC’s semantic technologies with 80legs’ web crawling expertise adds up to pre-packaged semantic annotation of blogs and comments, provided as a steady data stream, and feeds of sentiment data taken from product reviews.
“Right now a lot of the bigger companies are doing sentiment analysis with semantic technology driving that on blogs and comments and Twitter,” says Deysarkar. “By providing a crawl package that does that in a pre-determined way, we can enable other, smaller players to use that data as well.” They can get data that is already marked up for them to build their own monitoring platforms, for example â€“ feeding the rich opinion and scoring data generated by authors and their commentators into their own applications.
That semantic Crawl Package will debut a few weeks from now, following in the footsteps of the initial set that goes live starting today. These include Crawl Packages around publicly-available profiles on social networks such as LinkedIn, product listings on retail and shopping websites, property listings and geographic information from real estate websites, and business information from company directories. The company says that future packages also will aggregate data from multiple sites, like top 100 blogs, link graphs of the Web, and more. “Any of our users could create these Crawl Packages themselves with time and effort, but a lot of them have asked for this,” says Deysarkar.
80legs has created a separate download app for the data that is delivered in XML format. The separate app was necessary as the files are pretty large and a lot of files can be generated. Crawls are ongoing in each case, and depending on the site being crawled postings can occur every 3 to 5 hours. The custom data extractor that determines which data is downloaded is a mix of what the company thinks is interesting and what users have asked for, the idea being to put in as much as seems useable and letting customers filter out what they do or don’t need. The more data â€“ and the more accessible the data â€“ the better, Deysarkar says.
For instance, “on social networks, the interesting thing is the profile or what people are saying, on product listing sites, of course, it’s the pricing,” he says. With data from a social network like LinkedIn, for instance, customers might build applications to inform an understanding of people’s connections between groups, and then beyond that create some semantic understanding of how individuals move between organizations and then chart that for further insights. Using the Crawl Packages, Deysarkar says, customers can apply their own technology to the data in order to add value â€“ for instance, the company notes that data from retail and shopping websites provides excellent information on how products are priced and sold, and can be leveraged to help customers price their own inventory.
Most of the crawls will produce between 10 to 20 million results per month, he says, and pricing for a crawl package of that size is about $350 per month. Deysarkar says he’s seen other services that provide data of this type for thousands of dollars per month, but 80legs’ grid model gives it the ability to produce data at lower costs and then aggregate it among different customers to further drive down the price.
Since its launch in September, the company’s on track with its timeline, growing between 5 and 10 percent per month, according to Deysarkar. Its customers are reflected by the Crawl Packages, including social media monitoring and retail and shopping sites that use it to build out their own aggregation services. One of the challenges, the CEO notes, in generating new customers is that there isn’t a very well-defined distribution strategy for web crawling itself.
“But if we define it in terms of ‘we crawl this site and it produces these results for you,’ that’s more meaningful and compelling to someone,” he says. “So hopefully that will make it easier to grow the business and make it easier for customers to do less work to get what they want.”