You are here:  Home  >  Data Education  >  BI / Data Science News, Articles, & Education  >  Current Article

Lexalytics Chows Down on Wikipedia To Improve Text and Sentiment Analytics

By   /  April 18, 2011  /  No Comments

Lexalytics, whose text and sentiment analytics engine underpins media monitoring solutions from vendors such as Radian6 (the recent acquisition target of Salesforce.com) and Scout Labs, has a new version of its Salience software that digested every bite of knowledge inside Wikipedia and built an understanding of the relationships between words and meaning to deliver its new Concept Matrix and dependent capabilities such as Facets and Collections.

“The common complaint from customers is, ‘There’s my tag cloud and it sucks,’” says Jeff Catlin, CEO of Lexalytics, separating concepts that are contextually similar rather than grouping them together and linking them to broader categories to which they are implicitly related. A 9-iron, golf club and driver are all part of the common concept of golf club, and that concept relates to other concepts like recreation and outdoors.  “Knowing that they are semantically related takes the tag cloud from this unwieldy thing to a directed view of what’s going on.”

The auto-roll up that Salience 5.0 has with its Concept Matrix is the foundational layer for other capabilities such as Collections, which uses the metadata across many documents to produce enhanced results. As Catlin explains it, this provides a way for users to get out of a whole bunch of data in documents on-the-fly insight; no pumping of information into a database first to get to the stats. “That lets us ask questions that were almost impossible to ask before,” he says. Find out from a processed data set the ten individuals associated with the highest negative sentiment across them, or ten percent of the documents that have the highest sentiment, or any other statistically-oriented questions about the collection.

Facets is its engine’s approach to picking out the handful of the most important ideas or subjects within the data and their pertinent attributes, with the help of the Concept Matrix. For example, take a collection of cruise review comments – ship would be identified as the most important subject or concept, and how people are talking about the ship are its attributes. “That’s important because this becomes actionable information,” Catlin says. Users can roll up different views of data to discover, for instance, what really is the most important attribute around a particular concept. For instance, it may appear that the most important thing to reviewers about a ship are its restaurants, but the reality may be that large cabins matter more – something it can discover because the software can automatically account for the fact that ‘cabins’ and ‘stateroom’ are the same concept, and that attributes around them such as ‘roomy,’ ‘spacious,’ and such roll up into the same idea.

“The difference from what you can do today is that today this takes a little more work. We want to make it pop out on its own for you, instead of your having to pull out all the themes, roll them up somewhere and hope that some users didn’t say ‘cabin’ and others say ‘stateroom’ because those wouldn’t get grouped together,” he says. “You know they go together but most software hasn’t known that. Once you have that knowledge you can do this neat stuff.”

Jeff Catlin, CEO, Lexalytics

What Catlin calls the candy feature is its Conceptual Topics that let you define a classification subject with only a few simple concept keywords. Rather than having to define all the concepts that would go into a bucket called ‘food’ – either manually, where you’re likely to miss a bunch of terms, or trying to build a model via the examples of a host of food related documents – this feature sweeps in words ranging from gourmet pizza to cool new bar for a food bucket that is defined as nothing more than food, restaurant and dining, without training. “Just give it a label and off it goes,” he says, adding that it also understands from ingesting Wikipedia that the word ‘clams’ in ‘clams casino’ is a food but the word ‘clams’ in ‘digging clams on the beach’ is an animal and so wouldn’t wind up in a Conceptual Topic for food.

Such capabilities will make it easier for applications such as measuring whether the messages a company wants to get across really are getting across. Say marketers want to track whether the points it’s calling out for a new cell phone campaign – long battery life, lightweight, and inexpensive – actually get picked up in coverage. Using just a keyword approach might fail to pick up the reviews that use terms like cheap and low-cost rather than inexpensive, for instance. With Salience 5.0, you can just give the message labels and it knows enough from ingesting Wikipedia that cheap and low-cost have a relationship with inexpensive.

The update also offers improved sentiment scoring to help distinguish “buzz” comments from actual opinion – for instance, the comments of someone who says they think the new Batman movie will be really great will carry less weight than those of someone who says the new Batman movie is really great. “The latter is opinion-based on having done it, so it should have higher credibility than the other one,” Catlin says.

Ahead for the engine’s technology include verticalizing it for specific industries such as pharmaceuticals and patent data. Salience 5.0 will be in beta early this summer and enter general availability later that season.

About the author

Jennifer Zaino is a New York-based freelance writer specializing in business and technology journalism. She has been an executive editor at leading technology publications, including InformationWeek, where she spearheaded an award-winning news section, and Network Computing, where she helped develop online content strategies including review exclusives and analyst reports. Her freelance credentials include being a regular contributor of original content to The Semantic Web Blog; acting as a contributing writer to RFID Journal; and serving as executive editor at the Smart Architect Smart Enterprise Exchange group. Her work also has appeared in publications and on web sites including EdTech (K-12 and Higher Ed), Ingram Micro Channel Advisor, The CMO Site, and Federal Computer Week.

You might also like...

Predictive Analytics: Giving Smart Manufacturers an Edge

Read More →