We are in the future! It is time to continue our scintillating look at interfaces, and the bandwidth limitations thereof. This week, we cast our gazes on PCI Express and Thunderbolt. First, PCI Express: what exactly does it mean when you have a PCIe 2.0 x8 connection? And does it make a difference whether your connection is x8 or x16?
PCI Express is a little confusing. A PCIe connection consists of one or more data-transmission lanes, connected serially. Each lane consists of two pairs of wires, one for receiving and one for transmitting. You can have one, four, eight, or sixteen lanes in a single consumer PCIe slot--denoted as x1, x4, x8, or x16. Each lane is an independent connection between the PCI controller and the expansion card, and bandwidth scales linearly, so an eight-lane connection will have twice the bandwidth of a four-lane connection. This helps avoid bottlenecks between, say, the CPU and the graphics card. If you need more bandwidth, just use more lanes.
There are several different physical connections, each of which can function electrically as a slot with a lower number of lanes and can accommodate a physically smaller card as well. A physical PCIe x16 slot can accommodate a x1, x4, x8, or x16 card, and can run a x16 card at x16, x8, x4, or x1. A PCIe x4 slot can accommodate a x1 or x4 card but cannot fit a x16 card. And finally, there are several different versions of the PCIe interface, each with different bandwidth limitations, and many modern motherboards have PCIe slots of different physical sizes and also different PCIe generations. Confused yet?
Let's start with maximum theoretical bandwidth. A single PCIe 1.0 (or 1.1) lane can carry up to 2.5 Gigatransfers per second (GT/s) in each direction simultaneously. For PCIe 2.0, that increases to 5GT/s, and a single PCIe 3.0 lane can carry 8GT/s.
What's with this gigatransfers nonsense? Gigatransfers per second are the same thing (in this case) as gigabits per second, but they include the bits that are lost as a result of interface overhead. All PCI Express versions lose some of their theoretical maximum throughput to the physical overhead associated with electronic transmissions. PCIe 1.* and 2.0 use 8b/10b encoding (like SATA does), the upshot of which is that each 8 bits of data cost 10 bits to transmit, so they lose 20 percent of their theoretical bandwidth to overhead. It's just the cost of doing business.
After overhead, the maximum per-lane data rate of PCIe 1.0 is eighty percent of 2.5GT/s. That gives us two gigabits per second, or 250MB/s (remember, eight bits to a byte). The PCIe interface is bidirectional, so that's 250MB/s in each direction, per lane. PCIe 2.0 doubles the per-lane throughput to 5GT/s, which gives us 500MB/s of actual data transfer per lane.
PCIe 3.0 achieves twice the speed of PCI 2.0, despite having a per-lane throughput that's only 60 percent more than a PCIe connection.
You've probably heard that PCIe 3.0 is twice the speed of PCI 2.0, but as we've seen above, its per-lane theoretical throughput is 8GT/s, which is only 60 percent more than PCIe 2.0's 5GT/s. That's because PCIe 3.0 and above use a more efficient encoding scheme called 128b/130b (PDF link), so the overhead is much less--only 1.54 percent. That means that a single PCIe 3.0 lane, at 8GT/s, can send 985MB/s. That's not quite twice 500MB/s, but it's close enough for marketing purposes.
What that means is that a PCIe 3.0 x4 connection (3.94GB/s) should have nearly the same bandwidth as PCIe 1.1 x16, or PCIe 2.0 x8 (both 4GB/s).
Modern GPUs use a x16 PCIe 2.0 or 3.0 interface. That doesn't mean they always run at x16 speed, though. Many motherboards have multiple physical x16 slots, but a smaller number of actual of PCIe lanes available. On a Z87 (Haswell) or Z77 (Ivy Bridge) desktop, the CPU has 16 PCIe 3.0 lanes. Intel chipsets have an additional eight PCIe 2.0 lanes, but those are typically used for sound cards, RAID cards, and so forth. (AMD's 990FX chipset includes 32 PCIe 2.0 lanes, plus four on the northbridge.) In the Asus board shown above, for example, the PCIe 3.0 slots are CPU lanes, while all the rest have to share the eight chipset PCIe 2.0 lanes. Using the PCIe 2.0 x16 slot in x4 mode disables three of the PCIe 2.0 x1 slots.
So a single x16 graphics card will use all 16 CPU PCIe lanes, but adding a GPU to the second x16 lane will drop both graphics cards' connections down to eight lanes each. Adding a third GPU will drop the first card's connection to x8, and the second and third cards' connections down to x4 each. This is why many people who run multi-GPU setups prefer Intel's enthusiast architectures, like Sandy Bridge-E and the upcoming Ivy Bridge-E. Ivy Bridge-E CPUs will have forty PCIe 3.0 lanes. That's enough to run two cards at x16 and one at x8, one card at x16 and three cards at x8, or one at x16, two at x8, and two more at x4. That's just ridiculous.
Does it matter for performance?
Two PCIe 3.0 GPUs running at x8 each on a PCIe 3.0 motherboard should have nearly the same bandwidth as two PCIe 2.0 GPUs running at x16--the first set runs at 7.88GB/s each, while the second two run at 8GB/s. If either your motherboard or graphics card is limited to a PCIe 2.0 connection, you'll be stuck using the slower interface.
TechPowerUp did an enormous roundup of PCIe performance last May. They tested the two most powerful single-GPU cards at the time--AMD's Radeon HD 7970 and Nvidia's GeForce GTX 680--at x4, x8, and x16 using PCIe 1.1, 2.0, and 3.0, all on the same motherboard. This is by far the best apples-to-apples test I've ever seen on PCIe bandwidth scaling. The entire article is worth a read, but the performance summary page collects the relative results at a glance.
Last year's most powerful graphics cards perform just fine at PCIe 2.0 x8 or even PCIe 3.0 x4.
As you'd expect, equivalent bandwidth configurations perform around the same. Most importantly, to quote the TechPowerUp authors, "Our testing confirms that modern graphics cards work just fine at slower bus speed, yet performance degrades the slower the bus speed is. Everything down to x16 1.1 and its equivalents (x8 2.0, x4 3.0) provides sufficient gaming performance even with the latest graphics hardware, losing only 5% average in worst-case. [emphasis added] Only at even lower speeds we see drastic framerate losses, which would warrant action."
The most interesting part in these results is the finding that last year's most powerful graphics cards perform just fine at PCIe 2.0 x8 or even PCIe 3.0 x4. That means that three-way SLI or CrossFireX should be viable, even in x8/x4/x4, on Ivy Bridge or Haswell desktops. But even if you don't have PCIe 3.0, you're not missing out on much performance running at x8 on a PCIe 2.0 connection.
The doubled bandwidth of PCIe 3.0 x16, compared to PCIe 2.0, doesn't seem to make much of a difference yet. AnandTech's Ryan Smith tested two Nvidia GeForce GTX Titans, the current fastest single-GPU cards, in SLI on both PCIe 3.0 and 2.0, and found, at best, a seven percent performance improvement at 5760 x 1200.
So that's good news for people with older motherboards or graphics cards. Provided you have at least PCI Express 2.0 x8, you're hardly leaving any performance on the table, even on the fastest cards.
Thunderbolt is a data transfer interface that can pass through both PCI Express and DisplayPort signals, depending on what it is plugged into. A Thunderbolt controller consists of two bidirectional data channels, with each channel containing an input and an output lane. The Thunderbolt chips on each end of the cable take in both DisplayPort 1.1a and a four-lane PCIe 2.0 bus. Each channel is independent, and can either carry DisplayPort or PCIe, but not both. Each direction in each channel has a theoretical maximum throughput of 10Gbps--the same as two PCIe 2.0 lanes. As discussed above, due to 8b/10b encoding, 20 percent of the theoretical limit of PCI Express 2.0 is devoted to signal overhead, so the maximum theoretical throughput of a single Thunderbolt channel is 1GB/s in each direction. In first-gen Thunderbolt, that's as fast as you're going to get, since each device can only access one of the two channels, and you can't combine them. It's still pretty damn fast, since you can send high-def video to a DisplayPort monitor at 10Gbps down one channel while reading 1GB/s from an SSD RAID with the other, at the same time.
So how much performance can you actually wring out of a Thunderbolt connection?
Gordon Ung at Maximum PC saw peak read transfer speeds of 931MB/s when reading from a RAID 0 of four 240GB SandForce SF-2281 SSDs in a Pegasus R4 chassis.
AnandTech actually got an SSD RAID in a Pegasus chassis right up to 1002MB/s at its very peak, which seems to be right up at the practical limit of a single Thunderbolt channel, but that was using a RAID 0 of four 128GB 6Gbps SATA SSDs, running sustained 2MB reads at a queue depth of 10.
A four-way RAID 0 of SSDs is going to be too fast for a first-gen Thunderbolt connection. A two-drive RAID 0 can approach twice the speed of its individual drives. As we discussed in Part One, a good 6Gbps SATA SSD can hit 515MB/s. A RAID 0 of two 6Gbps SSDs can easily saturate the 10Gbps connection available in first-gen Thunderbolt. A four-way RAID 0 can go far faster, but not while attached via Thunderbolt.
A very brief note on PCIe SSD performance (compared to Thunderbolt)
Despite the limitations of first-gen Thunderbolt, it's still a far better external storage interface than USB 3.0.
The OCZ RevoDrive 3 x2, a PCIe-attached SSD, can hit 1.5GB/s peak in some sequential read tests on a PCIe 2.0 x4 connection. That drive uses a SAS-to-PCIe controller, rather than a SATA controller to a RAID card to a PCIe connection, but surely that can't explain the entire speed difference. After all, Thunderbolt is a PCIe 2.0 x4 connection too, right? Sort of. Each Thunderbolt storage drive is limited to one channel with a maximum bandwidth of 1GB/s. The RevoDrive 3 x2 can use the entire PCIe x4 connection, with its peak (post-overhead) bandwidth of nearly 2GB/s.
The Next Thunderbolt
The next version of Thunderbolt, cleverly named Thunderbolt 2, will let you combine both channels into one, with a theoretical maximum of 20Gbps (2GB/s, post-encoding), allowing devices to take advantage all four PCIe 2.0 lanes in the Thunderbolt connection. It also brings increased bandwidth to the display side of things; you'll be able to stream 4K video to that fancy 4K monitor you've got lying around. So far Thunderbolt 2 is only available on a couple of motherboards from Asus but it'll ship on the new Mac Pro as well, if and when that beautiful, weird-ass cylinder ever emerges from One Infinite Loop.
Despite the limitations of first-gen Thunderbolt, it's still a far better external storage interface than USB 3.0, which at best is only half the speed of a first-gen Thunderbolt connection (5Gbps maximum) and in real life, as we saw last time, doesn't hit anywhere close to its maximum theoretical throughput.