Benchmarking the Full AI Hardware/Software Stack

By on

Click to learn more about author James Kobielus.

Artificial Intelligence (AI) is a resource hog. AI-powered programs will grind to a halt unless developers continue to seek out the fastest, most scalable, most power-efficient and lowest-cost hardware, software and Cloud platforms to run their workloads.

As the AI arena shifts toward workload-optimized architectures, there’s a growing need for standard benchmarking frameworks to help practitioners assess which target hardware/software stacks are best suited for training, inferencing, and other workloads.

In the past year, the AI industry has moved rapidly to develop open, transparent, and vendor-agnostic frameworks for benchmarking for evaluating the comparative performance of different hardware/software stacks in the running of diverse workloads. Here the most important of these initiatives, as judged by the degree of industry participation, the breadth of their missions, the range of target hardware/software environments they’re including in their scope, and their progress in putting together useful frameworks for benchmarking today’s top AI challenges.


The MLPerf open-source benchmark group recently announced the launch of a standard suite for benchmarking the performance of Machine Learning (ML) software frameworks, hardware accelerators and Cloud platforms. The group — which includes Google, Baidu, Intel, AMD and other commercial vendors, as well as research universities such as Harvard and Stanford — is attempting to create an ML performance-comparison tool that is open, fair, reliable, comprehensive, flexible and affordable.

Available on GitHub and currently in preliminary release 0.5, MLPerf provides reference implementations for some bounded use cases that predominate in today’s AI deployments:

  • Image Classification: Resnet-50 v1 applied to Imagenet.
  • Object Detection: Mask R-CNN applied to COCO.
  • Speech Recognition: DeepSpeech2 applied to Librispeech.
  • Translation: Transformer applied to WMT English-German.
  • Recommendation: Neural Collaborative Filtering applied to MovieLens 20 Million (ml-20m).
  • Sentiment Analysis: Seq-CNN applied to IMDB dataset.
  • Reinforcement: Mini-go applied to predicting pro game moves.

The first MLPerf release focuses on ML-training benchmarks applicable to jobs. Currently, each MLPerf reference implementation addressing a particular AI use cases provides the following:

  • Documentation on the dataset, model and machine setup, as well as a user guide.
  • Code that implements the model in at least one ML/DL framework and a dockerfile for running the benchmark in a container;
  • Scripts that download the referenced dataset, train the model and measure its performance against a prespecified target value (aka “quality”).
  • The MLPerf group has published a repository of reference implementations for the benchmark. Reference implementations are valid as starting points for benchmark implementations but are not fully optimized and are not intended to be used for performance measurements on target production AI systems. Currently, MLPerf published benchmarks have been tested on the following reference implementation:
  • 16 central processing unit chips and one Nvidia P100 Volta graphics processing unit;
  • Ubuntu 16.04, including docker with Nvidia support;
  • 600 gigabytes of disk (though many benchmarks require less disk); and
  • Either CPython 2 or CPython 3, depending on benchmark.

The MLPerf group plans to release each benchmark — or a specific problem using specific AI models — in two modes:

  • Closed: In this mode, a benchmark — such as sentiment analysis via Seq-CNN applied to IMDB dataset — will specify a model and data set to be used and will restrict hyperparameters, batch size, learning rate and other implementation details.
  • Open: In this mode, that same benchmark will have fewer implementation restrictions so that users can experiment with benchmarking newer algorithms, models, software configurations and other AI approaches.

Each benchmark runs until the target metric is reached and then the tool records the result. The MLPerf group currently publishes benchmark metrics in terms of average “wall clock” time needed to train a model to a minimum quality. The tool takes into consideration the costs of jobs as long as price does not vary over the time of day that they are run. For each benchmark, the target metric is based on the original publication result, minus a small delta to allow for run-to-run variance.

The MLPerf group plans to update published benchmark results every three months. It will publish a score that summarizes performance across its entire set of closed and open benchmarks, calculated as the geometric mean of results for the full suite. It will also report power consumption for mobile devices and on-premises system to execute benchmark tasks and will report cost for cloud-based systems performing those tasks.

The next version of the MLPerf benchmarking suite, slated for August release, will run on a range of AI frameworks. Subsequent updates will include support for inferencing workloads, eventually to be extended to include those executing run on embedded client systems. It plans to incorporate any benchmarking advances developed in “open” benchmarks into future versions of the “closed” benchmarks. And it plans to evolve reference implementations to incorporate more hardware capacity and optimized configurations for a range of workloads.


Established in 2017, DAWNBench supports benchmarking of end-to-end Deep Learning (DL) training and inferencing. Developed by MLPerf member Stanford University, DAWNBench provides a reference set of common DL workloads for quantifying training time, training cost, inference latency and inference cost across different optimization strategies, model architectures, software frameworks, clouds and hardware. It supports cross-algorithm benchmarking of image classification and question answering tasks.

DAWNBench recently announced the winners of its first benchmarking contest, evaluating AI implementations’ performance on such tasks as object recognition and natural-language-understanding comprehension. Most of the entries to this DAWNBench were open-sourced, which means that the underlying code is readily available for examination, validation, and reuse by others on other AI challenges.

On the DAWNBench challenge, teams and individuals from universities, government departments, and industry competed to design the best algorithms, with Stanford’s researchers acting as adjudicators. Each entry had to meet basic accuracy standards and was judged on such metrics as training time and cost.

For example, one of the DAWNBench object-recognition challenge required training of AI algorithms to accurately identify items in a CIFAR-10 picture database. A non-profit group won with a submission that used an innovative DL training technique known as “super convergence” which had previously been invented by the US Naval Research Laboratory. This works by slowly increasing the flow of data used to train an algorithm, and, in this competition, was able to optimize an algorithm to sort the CIFAR data set with the required accuracy in less than three minutes, as compared with more than half-hour in the next best submission.


ReQuEST (Reproducible Quality-Efficient Systems Tournaments) is an industry/academia consortium that has some membership overlap with ML Perf.

ReQuEST has developed an open framework for benchmarking full AI software/hardware stacks. To this end, the consortium has developed a standard tournament framework, workflow model, open repository of validated workflowsartifact evaluation methodology, and a real-time scoreboard of submissions for benchmarking of end-to-end AI software/hardware stacks. The consortium has designed its framework to be hardware agnostic, so that it can benchmark a full range of AI systems ranging from cloud instances, servers, and clusters, down to mobile devices and IoT endpoints. The framework is designed to be agnostic to AI-optimized processors, such as GPUs, TPUs, DSPs, FPGAs, and neuromorphic chips, as well as across diverse AI, Deep Learning, and Machine Learning software frameworks.

ReQuEST’s framework supports comparative evaluation of heterogeneous AI stacks’ execution of inferencing and training workloads. The framework is designed to facilitate evaluation of trade-offs among diverse metrics of AI full-stack performance, such as predictive accuracy, execution time, power efficiency, software footprint, and cost. To this end, it provides a format for sharing of complete AI software/hardware development workflows, which spans such AI pipeline tasks as model development, compilation, training, deployment, and operationalization. Each submitted AI pipeline workflow to be benchmarked specifies the toolchains, frameworks, engines, models, libraries, code, metrics, and target platforms associated with a given full-stack AI implementation.

This approach allows other researchers to validate results, reuse workflows and run them on different hardware and software platforms with different or new AI models, data sets, libraries, compilers and tools. In ReQuEST competitions, the submissions arranged on the extent of their Pareto-efficient co-design of the whole software/hardware stack to continuously optimize submitted algorithms in terms of the relevant metrics. All benchmarking results and winning software/hardware stacks will be visualized on a public interactive dashboard and grouped according to such categories as whether they refer to embedded components or entire stand-alone server configurations. The winning artifacts are discoverable via ACM Digital Library, thereby enabling AI researchers to reproduce, reuse, improve and compare them under the common benchmarking framework.

For AI benchmarking, the ReQuEST consortium has initiated a competition to support comparative evaluation of Deep Learning implementations of the ImageNet image classification challenge. Results from this competition will help the consortium to validate its framework’s applicability across heterogeneous hardware platforms, DL frameworks, libraries, models and inputs. Lessons learned in this competition will be used to evolve the benchmarking framework so that it can be used for benchmarking other AI application domains in the future.

ReQuest is also developing equivalent benchmarking frameworks for other applications domains, including vision, robotics, quantum computing, and scientific computing. Going forward, the consortium plans to evolve its AI framework to support benchmarking microkernels, convolution layers, compilers, and other functional software and hardware subsystems of AI stacks.

It may take two to three years for MLPerf, DAWNBench, ReQuEST, and other industry initiatives to converge into a common architecture for benchmarking AI applications within complex Cloud and Edge Architectures. For a separate discussion of AI-benchmarking initiatives that are focused on mobile-, IoT-, and Edge Computing platforms, check out this recent article that was authored by yours truly.



Leave a Reply