Click to learn more about author Will Ochandarena.
Everyone knows that next big frontier in the automotive industry is self-driving vehicles. While a handful of companies (Tesla, Waymo) grab the majority of the headlines, nearly every auto manufacturer today is investing in autonomous driving, either directly through R&D or partnering with companies that are. The arrival of autonomous driving will have an enormous impact on vehicle safety, traffic management, and economics of road travel so any auto manufacturer that doesn’t adapt in time will be left behind.
Achieving full autonomy is (mostly) a technical challenge. Much has been written about advancements that have to be made in sensors, computing power, and object detection algorithms. One challenge that isn’t often discussed is how difficult it can be to manage the firehose of data generated from self-driving vehicle prototypes.
When it comes to building object detection models, more data is always better, and models have to be continuously tested and iteratively improved. To gather data and test models fast enough to keep up with the industry, companies need to put thousands of prototype vehicles on test tracks and roads. These cars test the most recently-built object detection models while also collecting new training data at a rate of more than 1TB per hour. In order to effectively manage these large fleets of vehicles and collect sufficiently diverse training data, auto companies tend to build multiple geographically distributed R&D centers. Putting all of this together, it isn’t hard to imagine 5 R&D centers, each with 1000 cars on the road for 8 hours per day collectively generating 8 petabytes of data.
Data gravity is the phenomenon that happens when datasets get so large that it becomes physically impossible to move and creates a “pull” of applications and Analytics towards it. Eight petabytes of data per day is too much to transfer over even today’s fastest network connections, so it possesses significant gravity. As a result, companies need to find a way to not only perform Analytics and Machine Learning at the location where the data sits, but also coordinate that processing across each geographic site so that learning is based on a global view of data rather than a geographic silo.
It is exactly this type of problem that the emerging category of “Data Fabric” products are trying to address. They do so by first joining all geographically distributed data together into a single view or namespace. Next, they offer the ability to launch Analytics or Machine Learning workloads across multiple locations in parallel, with only the results or output being transferred back to a central location. As the race to full autonomy intensifies, I predict that we will see the companies that invest in this type of technology have a competitive advantage and come to market faster.