“People are overlooked for a variety of biased reasons and perceived flaws. Age, appearance, personality. Bill James and mathematics cut straight through that. Billy, of the 20,000 notable players for us to consider, I believe that there is a championship team of twenty-five people that we can afford, because everyone else in baseball undervalues them.”
This was the thinking that Peter Brand (played by Jonah Hill in the movie Moneyball) brought to Billy Beane, General Manager of the Oakland Athletics in 2002. In the previous year, Beane’s team had made it to the postseason, but were defeated by the Yankees. The team then lost three star players to free agency, and Beane didn’t have the budget to replace them. But baseball analyst Brand showed him that Beane could do big things with his small budget, and as a result, the A’s went to the World Series the very next year.
Turning to data to find undervalued players didn’t stop with the A’s. Beane and Brand started a trend in baseball that changed the game forever, and the use of data has only gotten more complex and competitive as the types and amount of data have exploded over recent years.
This was the focus of Dean Allemang, Tim Harsch, and Amar Shan’s presentation at the recent SemTechBiz Conference, Big Data Analytics for Baseball. The three men from YarcData showed a roomful of baseball and semantic technology fans how in the current world of Big Data, RDF is not only a great solution for health care, government, and media organizations, but for America’s favorite pastime, as well.
Amar started things off by comparing the Moneyball approach of Beane and Brand to what is being done now. The Moneyball approach relied solely on outcome data — box scores and play-by-play data from games played between 1963 and 2002. This provided Beane and the teams that jumped on board in the years to come with quite a bit of data, but baseball didn’t really enter the era of Big Data until the last five years. Over the last half decade, only 6% of all the Major League Baseball games that have ever been played occurred, but from those games, 95% of all baseball data has been generated.
Why the explosion? The short answer is, technology got better. Now, instead of just keeping track of whether a batter hit a certain pitch on a given day, teams can also keep track of the velocity of the pitch, it’s movement, where precisely it hit in the batter’s box, the speed of the batter’s swing, the point of contact, and an entire host of other data points that simply weren’t previously available.
Now the trick is getting all of that data together and finding ways to query it in order to identify players that are undervalued, and to do all of that better and faster than your rival team.
YarcData is doing this for their (undisclosed) baseball clients with their RDF Graph Analytics Appliance, Urika. The basic process is pretty simple:
They bring together all of the baseball data available from a wide variety of tracking sources. Then they build all of the associations that are implicit within the data by converting everything to RDF. Doing this allows YarcData to see how every piece of data relates to every other piece.
Next comes the analysis phase. In this step, YarcData takes a slice of the data that they have collected and present the data as a graph. Amar commented, “The power of graphs is that they show things that are correlated. We can use these to find similar players and start to form conclusions.” He pointed out that graphs also allow them to query the data through a matrix that can then be analyzed using statistical techniques. In their case, YarcData turns to R for this process. Ultimately, the process leaves their clients with the ability to visualize the data in Gephi, beautifully showing GMs the clusters–and just as importantly, the anomalies–that occur. This allows teams to see how different players match up against each other on a plethora of fronts.
The true power of the graph analytics approach is that it allows GMs to query the available data in any new way that they can think of. Every team has access to the same data — it’s the teams that can figure out what questions to ask of the data–and the power to get back the clearest answers–that will have the greatest advantage. YarcData believes that they have created a solution that allows GMs to analyze pitchers and hitters from any angle they choose, giving them the power to quickly and effectively gain new insights from the same data by re-visualizing that data with simple SPARQL queries.
For example, below is a visualization of how right-handed pitchers perform against left-handed batters.
Players inevitably fall into clusters that allow GMs to assess them according to their common attributes. This also allows GMs to find players with unusual attributes that fall outside of the clusters, attributes that might make them more valuable.
Dean commented, “The business value is in being able to identify an undervalued player. If you get your clusters right, you can look at a batter’s performance against a pitcher in a cluster and perform induction to assume how they’d do against other pitchers in the cluster. More specifically, you can look at the numbers to assess batters against the pitchers of the teams in your division that you will play the most.”
Dean added, “Everyone has the data, but not everyone knows how to use it.” Over the coming years, the data will get even more complex and the tools more advanced as teams strive to evaluate players not just every season but perhaps every month. And just wait until fielding data gets added to the mix…
If it ever does. As expansive as all of this data is, it ultimately boils down to one thing: will this player score runs? With their client list under wraps, we don’t get to know exactly how well YarcData’s picks do. But when the World Series comes around, I recommend betting on whatever team they decide to back.
Learn more about what YarcData is doing with Big Data in baseball and elsewhere on their website.
Images: Courtesy YarcData