Testing: GeForce GTX 1080 Compute Performance

By Loyd Case

Can Nvidia's new flagship compute?

Can Nvidia's new flagship compute? Sure it does. But how well?

Out of idle curiosity, I ran a couple of OpenCL compute-oriented benchmarks on the GTX 1080 and three other GPUs. Bear in mind that this is more quick-and-dirty benchmarking, not rigorously repeated to validate results. The results, however, look interesting and the issue of compute on new GPUs bears further investigating.

The Setup

These tests ran on my existing production system, a Core i7-6700K with 32GB DDR4 running at the stock 2,133MHz effective. I used four different GPUs: GTX 1080, Titan X, GTX 980, and an AMD Radeon Fury Nano. The GTX 1080 used the early release drivers, while the other GPUs ran on the latest WHQL-certified drivers available from the GPU manufacturer's web site.

As you can see from the table below, all four GPUs ran at the reference frequencies, including memory. When I show the results, I don't speculate on the impact of compute versus memory bandwidth or quantity. As I said: quick and dirty.

GPUGTX 1080Titan XGTX 980Radeon Fury Nano
Base Clock1.6GHz1.0GHz1.126GHz1.0Ghz
Boost Clock1.73GHz1.075GHz1.216GHz1.05GHz
Memory TypeGDDR5XGDDR5GDDR5HBM
Memory Bandwidth320GB/s336GB/s224GB/s512GB/s

CompuBench CL

The first benchmark, CompuBench CL from Hungary-based Kishonti, actually consists of a series of benchmarks, each focusing on a different compute problem. Because the compute tasks differ substantially, CompuBench doesn't try to aggregate them into a single score. So I show separate charts for each test. CompuBench CL 1.5 desktop uses OpenCL 1.1.

Vision Processing: Face Detection and TV-L1 Optical Flow

According to Kishonti, "Face detector is based on the Viola-Jones algorithm. Face detection is extensivesly used in biometrics and digital image processing to determine locations and sizes of human faces".

The second vision processing test, TV-L1 optical flow, is "based on dense motion vector calculation using variational method. Optical flow is widely used for video compression and enhancing video quality in vision-based use cases, such as driver assistance systems or motion detection".

So far, it's looking pretty linear, with the GTX 1080 leading the other cards by pretty wide margins. Can the latest consumer GPU from Nvidia stay the course?

Physics: Ocean Surface Simulation and Particle Simulation – 64K

Nvidia spends a lot of PR capital touting physics processing with its GPUs. CompuBench includes two physics-oriented OpenCL benchmarks. Let's first look at Ocean Simulation. Kishonti notes, "Test of the FFT algorithm based on ocean wave simlation. The Fast Fourier transform computes transformations of time or space to frequency and vice-versa. FFTs are widely used in engineering, science, and mathematics".

Well, it looks like a few cracks are showing up in Nvidia's compute performance capabilities. Let's look at particle simulation. The benchmark notes read, "Particle Simulation in a spatial grid using the discrete element method. The result of the simulation is visualized as shaded point sprite spheres with OpenGL".

Okay, the FFT-based ocean simulation test could just be an outlier. Clearly, though, there's something about AMD's GCN architecture that makes it an efficient FFT engine — at least, more efficient than Nvidia's consumer GPUs.

Graphics: T-Rex Path Tracing

CompuBench CL provides a single test for graphics, based on the T-Rex benchmark the company developed for mobile GPU testing. This particular test, in Kishonti's words: "features dynamically updated acceleration structure and global illumination".

Once again, the Fury Nano surprises a bit, easily outperforming the GTX 980, and trailing the shiny new GTX 1080 by under 7%, while giving up 600MHz in clock frequency. On the other hand, I've never been one to test at identical clock frequencies. It's all well and good to talk about architectural efficiency, but when one processor can run 600MHz faster, marginally lower ISA efficiency doesn't really mean much.

Video Processing: Video Composition

Kishonti describes this benchmark as "… replicating a typical video composition pipeline with effects such as pixelat, mask, mix, and blur".

Once again, it appears that the Radeon Fury Nano offers better execution efficiency, but the raw clock speed of the GTX 1080 makes up the difference. But the Nano even beats out the 12GB monster that is the Titan X.

Bitcoin Mining

CompuBench CL's bitcoin mining test offers a pretty straightforward integer hashing benchmark.

Well, this looks like a trend. The GTX 1080 wins out, but AMD beats the older Nvidia GPUs.

Now let's look at something different.

LuxMark Open CL Rendering

LuxMark uses the LuxRender physically-based rendering tool to run its benchmark. In the interest of time, I only ran the default LuxBall HDR test, a relatively low triangle-count scene incorporating 217K triangles. I might revisit the medium and high-end scenes later. LuxMark 3.1 uses LuxRender 1.5, which seems to be based on OpenCL 1.1, though documentation on API usage is sketchy.

Uh… wow?

I'm not quite sure what's going on with LuxMark, and it's clearly worth going back and checking other scenes. I did run these tests twice to double-check.

The emerging pattern we've seen suggests AMD's GCN architecture offers better efficiency, but the GTX 1080 is running on early release drivers focused on gaming performance. Even so, any new drivers need to cover a lot of ground to catch up with the Radeon Fury Nano.

Final Thoughts

There's no question GPUs have proven useful as general purpose compute engines, which means there's money to be made in selling dedicated GPU compute hardware. The company's line item for data center revenue exceeded $100 million in the company's 2017 first fiscal quarter, hitting a $143 million, accounting for nearly 11% of revenue. That's pretty serious money.

Nvidia began bifurcating its GPUs with Kepler, shipping GPUs with substantially different capabilities depending on the target market. The folks at ArrayFire wrote a pretty illuminating post about the differences between floating point performance differences on Kepler-based GPUs. Nvidia's segmentation has only gotten stricter since then.

To be fair, though, removing compute capabilities unneeded for games allows Nvidia to build a superb gaming GPU that's just 314mm2. Incorporating additional features would also increase the die size, adding cost. However, it also means users can't just go out and buy a bunch of consumer GPUs and expect near-parity compute performance with Nvidia's Tesla-class products.

Also, bear in mind we're looking at essentially two OpenCL 1.1-based benchmarks. The OpenCL 1.2 spec has been around since 2011, and 2.0 since 2013. It's possible the landscape for compute could change. Nvidia also has a lot of capital invested in its proprietary CUDA software architecture, though what impact that has on Nvidia's OpenCL development is unknown.

Consider other differences between AMD and Nvidia GPUs. Ignoring die size for a moment, Nvidia designed the GTX 1080 using 7.2 billion transistors, 1.6 billion transistors less than the Radeon Fury GPU. That suggests the Fury may have additional capabilities that the GTX 1080 lacks, which may be another reason for the OpenCL performance disparities shown.

AMD seems to have little compunction in allowing relative parity between consumer and compute GPUs. This could partly be because of resources: building a bunch of different SKUs costs money, and the trade-off between die size and SKU complexity is simply another axis when considering overall cost. Re-examining GPU compute performance when AMD ships its Polaris and Vega products, which should run at much higher clock frequencies, will be interesting, indeed.

This post originally appeared on Uncertainty on May 23rd, 2016 and is republished here with permission.