[Editor's Note: This interview, conducted by guest Sean Golliher, is our first in the new series entitled "Innovation Spotlight." It's part of our initiative to introduce the semantic web community to innovative companies working on important problems using Semantic Technologies.
If you would like your company to be considered for an interview please email editor[ at ]semanticweb[ dot ]com.]
Alyona Medelyan ( @zelandiya ) joined Pingar ( @PingarHQ ) in 2010 and is the chief research officer at Pingar. She has a PhD in Natural Language Processing that was completed at the University of Waikato and funded by Google. Her expertise areas are Keywords and Entity Extraction, as well as Wikipedia Mining.
In this interview we find out more about Pingar's research, their products, and the clients they work with.
Sean: Hi Alyona. Thanks for speaking with us today. When was Pingar founded and can you explain a little bit about what Pingar does?
Alyona: Pingar was founded in 2007 and in the past 5 years we have developed innovative software for document management and text analytics. I joined the company in 2010 and have been focusing more specifically on automated metadata assignment by adding keyword extraction, named entity recognition and taxonomy mapping capabilities.
Sean: What techniques do you use for keyword extraction and named entity recognition? Are you using any existing databases to aide with entity recognition?
Alyona: For keyword extraction, we are using techniques that are generic enough to handle documents from various sources. This means that we look at typical properties of keywords in any document and any vertical: position, frequency and statistically derived importance of a phrase in a language in general. For named entity recognition we use a Machine Learning classifier that combines many different properties of typical words that appear within a name and takes into account the word order in a sentence. We use state-of-the-art techniques, but enhance them with unique knowledge derived automatically from existing databases such as Wikipedia and Freebase. We also have demos of our API on our website at http://apidemo.pingar.com/ .
Sean: You have done some interesting work with Twitter data recently. What are you doing with social data?
Alyona: The goal of our experiment was to find out how well the Pingar API works on social data. There are many open-source and commercial products out there that attempt to determine sentiment in tweets, but what is interesting to find out is what entity is that sentiment attached to. We applied Pingar’s keyword extraction to determine trending topics in positive and negative tweets. While achieving acceptable sentiment accuracy was challenging, some results were insightful. For example, when looking at positive tweets on June 16th containing the word “Google”, the prevalent positive topics are about “browser” and “chrome”, whereas “search engine” is trending among the negative tweets.
Sean: Social media networks are obviously creating large amounts of unstructured data. Where else are you seeing a high need for entity extraction on the web?
Alyona: The news continues to be the main area in need of entity extraction on the web. Interestingly, there are several companies out there that utilize named entity extraction to mine news and present them meaningfully on social networks, e.g. Prismatic and Wavii. News providers themselves rarely take advantage of text analytics tools, but they should. Other areas of the web that produce a lot of unstructured data are forums, online shops and auctions, product reviews sites. Currently, searching in these areas is still limited to phrase matching on Google and the results are only really good for popular forum questions and products.
Sean: What is your business model and how do you charge for your services?
Alyona: Last year we released the Pingar API that provides access to our algorithms via a web service. We work closely with partners who integrate our API into existing business solutions for their customers. Other companies use our cloud service and pay per usage.
Sean: What are companies doing with your API?
Alyona: Our customers are organizations that have large volumes of unstructured data. It could be masses of documents that have been accumulated over the years and must be stored and retrieved efficiently. It could also be a constant stream of unstructured text that needs to be analyzed and understood. Customers such as Deloitte NZ and LegiNation use Pingar to transform large volumes of continuously updated unstructured data in a wide variety of formats into structured data in order to be useful for analyzing trends, tracking information and reporting.
Sean: Currently you are working on algorithms that can dynamically generate enterprise taxonomies. What is your goal with this?
Alyona: In the past 2 years I have been presenting Pingar’s technology at several document management and big data conferences and have always been asked the same question: Can you help us generate custom taxonomies automatically? So, last year our research team took up this challenge. The algorithm that we have developed takes as an input a collection of documents and creates a tailored multi-level hierarchical taxonomy that groups the most prominent topics mentioned in these documents. We feed into this algorithm Linked Data such as Wikipedia, DBpedia and Freebase, as well as existing taxonomies crafted for specific verticals by experts. Additional data comes from Pingar’s named entity extraction and statistical phrase extraction algorithms. The goal for the generated taxonomy is to use it as a navigation tool for that document collection, and also for other tasks, for example, content audit. Our early research and testing are quite promising, and we expect to release the Pingar Taxonomy Generator later this year.
Sean: How does your company fit in with the semantic web and/or big data?
Alyona: The Semantic Web is about defining the meaning of entities in terms of unique identifiers accessible to algorithms from any point on the Web, which will allow solving complex tasks that require reasoning. We have come a long way towards this goal. Linked Data such as Freebase or GeoNames already provide such identifiers for millions of entities. What Pingar does is allow unstructured data to be annotated with these identifiers automatically, which is the basis of our dynamic taxonomies research. By adding meaning to unstructured data we make it more useful for further analysis, for example through data mining techniques and content visualization.
Sean: Will you be speaking or attending any upcoming conferences?
Alyona: Yes, I will be attending the Strata NYC in October 2012 and the SharePoint Conference in Las Vegas in November 2012. Thanks so much for your time. We look forward to the new developments at Pingar.
Have more questions for Pingar? Please join the discussion by posting them below!
About the Author:
Sean Golliher (@seangolliher) is an adjunct professor of search engines and social networks at MSU and is a member of their computer science advisory board. He is also the founder and publisher of SEMJ.org. Sean holds four engineering patents, has a B.S. in physics from the University of Washington in Seattle, and a master’s in electrical engineering from Washington State University. He is also president and director of search marketing at Future Farm, Inc., Bozeman MT, where he focuses on search marketing, internet research, and consults for large companies. He has appeared and been interviewed on well-known blogs and radio stations such as Clickz.com, Webmasterradio.com, and SEM Synergy. To maintain a competitive edge he reads search patents, papers, and attends search marketing conferences on a regular basis.