Editor's Note: Here at the Semantic Web Blog we've done a lot of coverage of the personalized news mag app space. That includes some in-depth looks into Zite, acquired by CNN in August, such as this article. Most recently, we brought you news of Zite's iPhone app.
Today, over at Zite's blog, the company today will run a piece entitled Zite: Under the Hood. It should be of interest to anyone who wants more details about how its technology operates. It goes like this:
Zite: Under the Hood
If you’re already a Zite user, you’ve experienced the delivery of personalized content that is updated every time you open the app. To make that transparent and easy for you, takes a lot of effort. The Zite team brings together decades of software development in artificial intelligence, machine learning and natural language technologies, and more than six years of product development, to blend and tune the experience for you. In short, Zite works by:
- mining content from your social web
- modeling that content
- modeling the community that interacts with it
- modeling your interests
- matching your interests to the content and your community, to help you discover content you’ll want to see.
Here’s a technical description, a look under the covers for those of you who are interested in the complex technology behind Zite.
Finding “What’s interesting”
There are tens of billions of web pages out there and more than two million terabytes of text, images and more are created every hour. So, where in this deluge does Zite start looking for what’s interesting to you? Zite observes what’s happening around the social web, because the community, in aggregate, creates a strong signal for what’s interesting. User-generated content, sharing, commenting and bookmarking have overtaken email and web pages in sheer volume of data created and total time spent online – eMarketer expects 115 million people in the U.S. to be creating content by 2013. What’s important is either happening on, or reported through, social media. What’s more, mining the social web makes it possible to personalize content at the moment you start using Zite for the first time.
To take advantage of the social web in order to find and choose great content for you, Zite:
- Monitors URLs that are shared through a wide range of social streams that you choose to connect to Zite, such as Twitter and delicious, to begin to tell Zite about your interests and focus.
- Throws out spam using adaptive pattern matching heuristics and other techniques.
- Associates each URL with the user who shares them and calculates the credibility of each of those users—because a URL from someone who has a lot of followers or is often re-tweeted, for example, is usually more credible.
- Combines the credibility scores of all the users who share a particular URL to calculate an overall quality score for that URL.
- Carries forward URLs with scores above a certain threshold as potential content to show, depending on later calculations.
The result is millions of new and vetted URLs put into the Zite pipeline every day.
Each vetted URL points to text and graphics that Zite could potentially show you, but it takes a lot more processing to find out what’s worth your time. So, Zite:
- Strips out all the extraneous, non-readable content at a URL. This includes HTML formatting, file “includes,” scripting code, whatever. That’s all removed via syntactic analysis, leaving a document that a machine can analyze for its content and one that you can read (if Zite figures it’s worthwhile).
- Analyzes each document via text mining and term extraction techniques, inferring the terms that succinctly capture and summarize what the content is about.
- Parses out the places, names and dates via entity extraction techniques.
- Characterizes the writing style, patterns of speech, and the length of sentences, phrases and words, all via semantic classifiers.[probably right, don’t mind if you delete from ‘which add…’to send of sentence]
- Lastly, collects metadata such as the author’s name, modifiers from user-added tags and comments, Twitter hash-tags, etc.
All these features—terms, entities, styles, metadata—define a model of what’s in a document, and they are carried forward with the document itself.
The aggregated habits and interests of a community of users can provide valuable recommendations for its members. You’ve likely experienced this via collaborative filtering from Amazon or Netflix. The heuristics correlate the habits of many users who are like you, in order to help derive what you will find relevant. Using a similar technique, Zite:
- Correlates relationships across millions of users and billions of documents, based on vetted data that Zite has captured from the social web. This creates a huge matrix of document-user relationships, derived from both Zite users and external data.
- Condenses these relationships into a few hundred features that characterize each user and each document. Later on, these features become the basis for matching each incoming document to your individual interests.
The process of condensing tends to “blur” the data a bit, and this is a good thing—it enables Zite to show you documents that are a little outside your direct interests, adding an element of serendipity and helping you to discover new things.
The more your friends and colleagues learn about you, the more enjoyable your conversations become. Zite works the same way—the more you interact with it, the smarter it gets about you, so the better it works at bringing you “what’s interesting”. To do this, Zite:
- Tracks the specific topics you say you’re interested in and lets you create a Section in your Zite app for each one.
- Quietly watches what you read and don’t read, and uses machine learning to infer your degree of interest in each document.
- Asks for feedback in the form of thumbs-up / thumbs-down ratings as well as labeled click-boxes so you can ask for more stories from specific sources, specific authors, or on specific topics. These could be popular sites or lesser-known blogs, news items or editorials, and so on.
So, let’s say you “thumbs-up” multiple stories about upcoming political elections. Zite will show you more stories about that. Or, if you repeatedly “thumbs-down” certain stories on the same general topic, Zite will develop a rule to stop showing you similar ones. But how does Zite know what “similar” means? Why do you like or dislike a particular story? Is it because it’s about foreign policy, or written by a specific author, or about a fringe candidate? (You might not even realize why yourself.) Automatically figuring that out, without pestering you to answer a lot of questions, isn’t easy. Zite uses the hundreds of features in its models of content, community, and you, to find the fine-grained patterns in your ratings that represent your preferences. This way, it can correctly reflect your interest by what it shows you, without too much effort on your part.
In short, Zite gets better every time you use it, just by using it. And the more you tell Zite what you like and dislike, the more accurate its choices become.
(Note: Although Zite builds a model of your interests, your name and email address are never shared or sold. Your usage data is used internally by Zite only to get you “what’s interesting” specifically for you. We do share some usage data with our partners, but only when aggregated with other users—no one ever sees your individual data on its own.)
Matching “What’s interesting” to your interests
Zite now has everything it needs to narrow down the daily deluge of content into focused, personalized, and up-to-date stories. To do this, Zite:
- Looks at the incoming stream of new documents since you last opened Zite, and keeps the ones that match your Zite Sections, sorting them by the quality score.
- Makes a fine-grained comparison of the highest-scored documents to you and your interests, using the hundreds of features calculated for each document. This yields a content-matching score for how closely a story fits your interests.
- Factors the age of a story into its score. As a story get older, it often becomes less interesting and so Zite lowers its score proportionally.
- Applies your block source input to eliminate sources you don’t want to see.
- Sorts the stories according to their scores with the most relevant first.
- Lastly, Zite flows these stories onto the screen of your iPad or iPhone, populating each Section according to topic, and using the best of those to populate your Top Stories.
Delivering your slice of the Zeitgeist
So that’s how Zite blends advanced technologies to create a unique and powerful experience on your iPad or iPhone. We’re planning to keep pushing the technology and user experience, so stay connected by signing up for our blog feed. And let us know what you think of Zite.