Beyond FLOPS: The co-evolving world of computer benchmarking

Benchmarks have been evolving along with the hardware they measure, and both are getting more complex.

It used to be simple: Multiply the microprocessor's clock rate by four, and you could measure a computer's computational power in megaFLOPS (millions of floating point operations per second) or gigaFLOPS (billions of FLOPS.)

No more. Today they're talking about teraFLOPS (trillions) and petaFLOPS (quadrillions) -- which brings up an important question: How do you benchmark these much-more-powerful systems?

"The majority of modern processors are systems on a chip and that has completely muddied the water," says Gabe Gravning, director of product marketing at AMD. An x86 microprocessor may actually include multiple processor cores, multiple graphics co-processors, a video encoder and decoder, an audio co-processor and an ARM-based security co-processor, he explains.

"For a longest time we built single-core processors and pushed the frequency as hard as possible, as frequency was the clearest correlation to performance," agrees Rory McInerney, vice president of Intel's Platform Engineering Group and director of its Server Development Group. "Then came dual cores, and multiple cores, and suddenly 18 cores, and power consumption became more of a problem, and benchmarks had to catch up."

But at the same time, benchmarks are integral to the systems-design processes, McInerney explains. When a new chip is considered, a buyer will "provide snippets of applications that best model performance in their environment -- they may have a certain transaction or algorithm they want optimized," he says.

"From there we need a predictive way to say that if we take option A we will improve B by X percent," McInerney says. "For that we develop synthetic or internal benchmarks, 30 to 50 of them. These benchmarks tend to stay with the same CPU over the life of the product. Then we see how the [internal] benchmarks correlate to standard [third-party] benchmarks that we can quote."

Gravning adds, "There is no perfect benchmark that will measure everything, so we rely on a suite of benchmarks," including both internal and third-party benchmarks; this part of the process hasn't really changed over the years.

As for the nature of those benchmarks, "The internal ones are proprietary, and we don't let them out," McInerney notes. "But for marketing we also need ones that can be replicated by a third party. If you look bad on an external benchmark all the internal ones in the world won't make you look good. Third-party benchmarks are vital to the industry, and are vital to us."

As a third-party benchmark for desktop and consumer devices, sources regularly mention the PCMark and 3DMark benchmarks, both from Futuremark Corp. in Finland. The first is touted for assessing Windows-based desktops, and the second for benchmarking game performance on Windows, Android, iOS and Windows RT devices.

But for servers and high-performance machines, three names keep coming up: TPC, SPEC and Linpack.


Formed in 1988, the Transaction Processing Performance Council (TPC) is a non-profit group of IT vendors. It promotes benchmarks that simulate the performance of a system in an enterprise, especially a stock brokerage (the TPC-E benchmark) or a large warehouse (TPC-C). (The newest TPC benchmark measures Big Data systems.) The scores reflect results specific to that benchmark, such as "trade-result transactions per second" in the case of the TPC-E benchmark, rather than machine speed.

TPC benchmarks typically require significant amounts of hardware, require person-power to monitor, are expensive to set up and may take weeks to run, explains Michael Majdalany, TPC spokesman. Additionally, an independent auditor must certify the results. Consequently, these benchmarking tests are usually carried out by the system manufacturers, he adds.

After results are posted, any other TPC member can challenge the results within 60 days and a technical advisory board will respond, adds Wayne Smith, TPC's general chairman. Most controversies have involved pricing, since benchmarks are often run on machines before the systems -- and their prices -- are publicly announced, he adds. One that did get some press: In 2009 the TPC reprimanded and fined Oracle $10,000 for advertising benchmarking results that rival IBM complained were not based on audited tests.

The oldest TPC benchmark still in use is the TPC-C for warehouse simulation, going back to the year 2000. Among the more than 350 posted results, scores have varied from 9,112 transactions per minute (using a single-core Pentium-based server in 2001) to more than 30 million (using an Oracle SPARC T3 server with 1,728 cores in 2010). TPC literature says such differences reflect "a truly vast increase in computing power."

The TPC also maintains a list of obsolete benchmarks for reference purposes. Smith recalls that some were rendered obsolete almost overnight. For instance, query times for the TPC-D decision-support benchmark dropped from hours to seconds after various database languages began adopting a function called "materialized views" to create data objects out of frequently-used queries, he recalls.

Smith says that the TPC has decided to move away from massive benchmarks requiring live auditors and towards "express benchmarks" that are based on the results of running code that the vendor can simply download, especially for big data and for virtualization applications.

"But the process of writing and approving a benchmark is still lengthy, in terms of getting everyone to agree," Smith adds.


Also founded in 1988, the Standard Performance Evaluation Corporation (SPEC) is a non-profit corporation that promotes standardized benchmarks and publishes the results, selling whichever source code is needed for the tests. Currently, SPEC offers benchmarks for the performance of CPUs, graphics systems, Java environments, mail servers, network file servers, Web servers, power consumption, virtualized environments and various aspects of high-performance computing.

Its oldest benchmark still in use, and probably its best known, is the SPEC CPU2006, which, as its name implies, gauges CPUs and was published in 2006. ("Retired" versions of SPEC go back to 1992.)

The SPEC CPU2006 is actually a suite of applications that test integer and floating point performance in terms of both speed (the completion of single tasks) and throughput (the time needed to finish multiple tasks, also called "rate" by the benchmark). The resulting scores are the ratio of the time-to-completion for the tested machine compared to that of a reference machine. In this case the reference was a 1997 Sun Ultra Enterprise 2 with a 296MHz UltraSPARC II processor. It originally took the reference machine 12 days to complete the entire benchmark, according to SPEC literature.

At this writing the highest CPU2006 score (among more than 5,000 posted) was 31,400, for integer throughput on a 1,024-core Fujitsu SPARC M10-4S machine, tested in March 2014. In other words, it was 31,400 times faster than the reference machine. At the other extreme, a single-core Lenovo Thinkpad T43, tested in December 2007, scored 11.4.

Results are submitted to SPEC and reviewed by the organization before posting, explains Bob Cramblitt, SPEC communications director. "The results are very detailed so we can see if there are any anomalies. Occasionally results are rejected, mostly for failure to fill out the forms properly," he notes.

"Anyone can come up with a benchmark," says Steve Realmuto, SPEC's director. "Ours have credibility, as they were produced by a consortium of competing vendors, and all interests have been represented. There's full disclosure, the results must be submitted in enough detail to be reproducible and before being published they must be reviewed by us."

The major trend is toward more diversity in what is being measured, he notes. SPEC has been measuring power consumption vs. performance since 2008, more recently produced a server efficiency-rating tool, and is now working on benchmarks for cloud services, he adds.

"We don't see a lot of benchmarks for the desktop," Realmuto adds. "Traditional desktop workloads are single-threaded, while we focus on the server space. The challenge is creating benchmarks that take advantage of multiple cores, and we have succeeded."


FLOPS remains the main thing measured by the Linpack benchmark, which is the basis for the Top500 listing posted every six months since 1993. The list is managed by a trio of computer scientists: Jack Dongarra, director of the Innovative Computing Laboratory at the University of Tennessee; Erich Strohmaier, head of the Future Technologies Group at the Lawrence Berkeley National Laboratory; and Horst Simon, deputy director of Lawrence Berkeley National Laboratory.

The top machine in the latest listing (June 2014) was the Tianhe-2 (MilkyWay-2) at the National Super Computer Center in Guangzhou, China. A Linux machine based on Intel Xeon clusters, it used 3,120,000 cores to achieve 33,862,700 gigaFLOPS (33,862.7 teraFLOPS, or almost 34 petaFLOPS).

Number one in the first list, in June 1993, was a 1,024-core machine at the Los Alamos National Laboratory that achieved 59.7 gigaFLOPS, so the list reflects improvements approaching six orders of magnitude in 21 years.

Linpack was originally a library of Fortran subroutines for solving various systems of linear equations. The benchmark originated in the appendix of the Linpack Users Guide in 1979 as a way to estimate execution times. Now downloadable in Fortran, C and Java, it times the solution (intentionally using inefficient methods to maximize the number of operations used) of dense systems of linear equations, especially matrix multiplication.

Results are submitted to Dongarra and he then reviews the claims before posting them. He explains that the Linpack benchmark has evolved over time; the list now relies on a high-performance version aimed at parallel processors, called the High-Performance Computing Linpack Benchmark (HPL) benchmark.

But Dongarra also notes that the Top 500 list is planning to move beyond HPL to a new benchmark that is based on conjugate gradients, an iterative method of solving certain linear equations. To explain further, he cites a Sandia report (PDF) that talks about how today's high-performance computers emphasize data access instead of calculation.

Thus, reliance on the old benchmarks "can actually lead to design changes that are wrong for the real application mix or add unnecessary components or complexity to the system," Dongarra says. The new benchmark will be called HPCG, for High Performance Conjugate Gradients.

"This will augment the Top500 list by having an alternate benchmark to compare," he says. "We do not intend to eliminate HPL. We expect that HPCG will take several years to both mature and emerge as a widely visible metric."

The plea from IBM

Meanwhile, at IBM, researchers are proposing a new approach to computer architecture as a whole.

Costas Bekas, head of IBM Research's Foundations of Cognitive Computing Group in Zurich and winner of the ACM's Gordon Bell Prize in 2013, agrees with Dongarra that today's high-performance computers have moved from being compute-centric to being data-centric. "This changes everything," he says.

"We need to be designing machines for the problems they will be solving, but if we continue to use benchmarks that focus on one kind of application there will be pitfalls," he warns.

Bekas says that his team is therefore advocating the use of conjugate gradients benchmarking, because conjugate gradients involve moving data in large matrices, rather than performing dense calculations.

Beyond that, Bekas says his team is also pushing for a new computing design that combines both inexact and exact calculations -- the new conjugate gradients benchmarks having demonstrated enormous advantages in doing so.

Basically, double-precision calculations (i.e., FLOPS) are needed only in a tiny minority of cases, he explains. The rest of the time the computer is performing rough sorting or simple comparisons, and precise calculations are irrelevant.

IBM's prototypes "show that the results can be really game-changing," he says, because the energy required to reach a solution with a combination of exact and inexact computation is reduced by a factor of almost 300. With minimal use of full precision, the processors require much less energy and the overall solution is reached faster, further cutting energy consumption, he explains.

Taking advantage of the new architecture will require action by application programmers. "But it will take only one command to do it," once system software modules are aware of the new computing methodology, Bekas adds

If Bekas' suggestions catch on, with benchmarks pushing machine design and machine design pushing benchmarks, it will actually be a continuation of the age-old computing and benchmarking pattern, says Smith.

"I can't give you a formula saying 'This is the way to do a benchmark,'" Smith says. "But it must be complex enough to showcase the entire machine, it must be interesting on the technical side and it must have something marketing can use." When several firms use it for predictions "it feeds on itself, as you build new hardware or software based on the benchmark.

"A result gets published, it pushes the competitive market up a notch, other vendors must respond and the cycle continues," he explains.

Join the TechWorld newsletter!

Error: Please check your email address.

Tags business issuesHigh performancecomputer hardwarehardware systemsintel

More about AMDARMFuture TechnologiesIntelLenovoLinuxOracleT3

Show Comments