You are here:  Home  >  Data Education  >  BI / Data Science News, Articles, & Education  >  BI / Data Science Blogs  >  Current Article

Outliers, Charts and Data Visualizations

By   /  February 22, 2012  /  No Comments

Each $ Spent on Space Exploration is spent here on Earth

by Karen Lopez @datachick

I posted that Tweet last week while attending the NASA 2013 Fiscal Budget Briefing at NASA Headquarters. It did well, being retweeted 200+ times and had the prospect of reaching 1.9 million people. I say prospect because not everyone who follows someone reads all their tweets. But 1.9 million isn’t such a bad reach when you have a message to get out. Most of this reach was due to various NASA-related accounts retweeting it, but it was helped by regular Twitter users doing their normal thing on Twitter: sharing information with their followers.

One of the trade offs of having such a huge outlier in my data is that the charts on my Twitter data analytics are nearly useless for all the other thousands of tweets I did last week (yes, I Tweet…a lot.)

That red circle in the upper right represents the number of replies (the size of the circle) and the number of retweets and impressions (the X and Y axis). Looks good, until you see the blob of blue in the lower left. The fact that this outlier in my data was so far out there makes the other pieces of data look almost zero on both axes . I think I do pretty well with my social media outreach, but this chart would so fuddle duddle the data that it hides important information about my “normal” performance on Twitter. In fact, it almost makes it look like all my Tweets perform equally as well…or poorly.

So what could you do to make this chart more meaningful:

  • Remove the outlier from the chart and or the data, completely
  • Create two charts, one with and one without the outlier
  • Make the graph 1000 times taller and wider
  • “Break” the Y Axis so that there’s a gap between 100k and 1.8 million
  • Use other techniques such as a logarithmic scale to show the data ratios instead of quantities
  • Use statistical methods to massage the data even more
  • Make better data (in this case, send Tweets that fill the gap to make my outlier look more normal)

I could also try to include a longer time period, such as including all my Tweets, not just the ones from this past week.

So a few more Tweets that had more retweets, but the impressions still look almost zero. So that doesn’t really help show how the rest of my Tweets did.

In business data, I’ve seen people opt to remove the outlier in additional charts, but sometimes they mask or delete them with no indication that they’ve been removed. Sure, my almost 2 million impression Tweet is messing with the display of other data, but if my performance bonus was based on that sort of thing, I wouldn’t want the data to vanish like the $38 million that was cut from the NASA STEM outreach budget. Your data needs may be different, though. So it’s important to find out how business users want outlier date dealt with. The reference links below talk about other more advanced methods for dealing with outliers. All I know is that my “all” chart isn’t going to help me much as long as that outlier is in the data set.

I’d recommend that whatever technique you use, you ensure that the reader understands what has been done. Remember that the goal of all charts should be to reveal more about the data than just looking at the raw data. If your chart doesn’t do that (Congrats again, Klout), maybe you need to rethink how you are presenting the data.

Other References:

About the author

Karen Lopez is Sr. Project Manager and Architect at InfoAdvisors. She has 20+ years of experience in project and data management on large, multi-project programs. Karen specializes in the practical application of data management principles. She is a frequent speaker, blogger and panelist on data quality, data governance, logical and physical modeling, data compliance, development methodologies and social issues in computing. Karen is an active user on social media and has been named one of the top 3 technology influencers by IBM Canada and one of the top 17 women in information management by Information Management Magazine. She is a Microsoft SQL Server MVP, specializing in data modeling and database design. She’s an advisor to the DAMA, International Board and a member of the Advisory Board of Zachman, International. She’s known for her slightly irreverent yet constructive opinions and rants on information technology topics. She wants you to love your data. Karen is also moderator of the InfoAdvisors Discussion Groups at www.infoadvisors.com and dm-discuss on Yahoo Groups. Follow Karen on Twitter (@datachick).

You might also like...

Artificial Neural Networks: An Overview

Read More →