If you’ve redeemed points at the Chase Ultimate Rewards web site or booked business travel through American Express’ AXIOM service, you’ve already had some experience with Rearden Commerce’s Deem e-commerce services platform. In the not-too-distant future, such marketplace experiences may be even further informed by semantic technology to add even greater personalization.
VP of analytics Steve Bernstein’s duties include leading traditional quantitative analysis and predictive modeling – the kind of work that last week resulted in the company providing an analysis of ten years of big data on domestic flight performance to discover everything from the best day of the year to fly (Oct. 3, thanks to only7 in 1,000 arrivals or departures being late, cancelled or diverted to the worst arrival performance for a major airport (EWR, in Newark N.J.). But he spends the rest of his time working with other parts of his team to structure unstructured data with the help of semantics and natural language processing technology so that it can serve as input for predictive modeling purposes.
Today, structured information for traveler services that use the Deem platform come from third-party feeds, such as hotel information from providers like Orbitz that consist of a couple of hundred pieces of information on over 100,000 hotels. That’s helpful for pinpointing users to guest accommodations at their preferred location or price, for instance, but not so much for helping them book a quiet hotel room. That’s where his team’s work crawling and capturing over 5 million user-generated hotel reviews, loaded up into a Hadoop file system, can come into play to better serve users’ needs.
“The secret sauce is using hotel reviews and pulling from that large body of reviews the concepts that come up very frequently that aren’t in the feeds,” Bernstein says – concepts like noisiness – and then using NLP and semantic wizardry to normalize them. People use many different ways to say a hotel is noisy, sometimes without even using the word ‘noise.’ So the concept in its raw form is very fragmented, and you have to pull terms and words that mean a hotel is noisy into one structured representation that it’s noisy. Then you can use sentiment scoring to say we have an attribute of noise level, let’s score these with respect to how they do there.”
Rearden is in the process of turning what it’s internally describing as its sentiment scoring engine to understand negative, positive or neutral opinions. End users may never detect that such an engine is in place per se, as the work rather would be used as input into a predictive model for them. So, if a user has transacted with a service a few times it would learn from his choices that he tends to like hotels with certain characteristics – those that are distinguished by a really great pool, for instance.
“Once semantic-derived data is injected into the process, new patterns will emerge that might make it easier for us to sort a list of hotels for that user even better suited to his tastes,” Bernstein says. “Structured data feeds tend to be binary in the sense that this hotel has wireless or not. But when we inject semantically-derived data, then we could say this hotel has a great or a lousy pool. Without that semantic data there may be no pattern in your choices we can detect, but when we add in new characteristics of a hotel that we can infer, a pattern might emerge – that you pick hotels with really good pools, so now we are sorting with a stronger weighting to the quality of the pool.”
Getting to this point has involved creating training sets to understand sentiment language for each of the domains in which it deals – hotels, travel, and so on. It’s tricky work, in order to avoid the problem in data or text mining of over-fitting a model. “You could take a training set and tweak the software to totally replicate what the set does, but then that may not generalize well because it’s over-fit to that training set. One way to avoid that, if there’s more than one training set for more than one domain, is that you tweak it for hotel reviews and then run, say, the restaurant reviews through it, and see if that’s damaged the software’s ability to match results there,” he explains. “Because if you improved hotels and damaged restaurants, that means you have overfit to hotels at the expense of the general model.” The core semantic scoring engine has to perform reasonably well on everything and have on top of that a pluggable set of additional rules for each domain to suit it to that task.
Bernstein expects that these capabilities will become public in the first half of this calendar year in the hotels domain, through all its channel partners and every direct service it offers in the space.