Editing PCI Express (section)

== Hardware protocol summary ==
The PCIe link is built around dedicated unidirectional couples of serial (1-bit), point-to-point connections known as ''lanes''. This is in sharp contrast to the earlier PCI connection, which is a bus-based system where all the devices share the same bidirectional, 32-bit or 64-bit parallel bus.

PCI Express is a [[layered protocol]], consisting of a ''[[#Transaction layer|transaction layer]]'', a ''[[#Data link layer|data link layer]]'', and a ''[[#Physical layer|physical layer]]''. The Data Link Layer is subdivided to include a [[media access control]] (MAC) sublayer. The Physical Layer is subdivided into logical and electrical sublayers. The Physical logical-sublayer contains a physical coding sublayer (PCS). The terms are borrowed from the [[IEEE 802]] networking protocol model.

=== Physical layer <span class="anchor" id="PHYSICAL-LAYER"></span> ===
{| class="wikitable floatright" style="margin-left: 1.5em; margin-right: 0; margin-top: 0;"
|+ Connector pins and lengths
|-
! rowspan="2" | Lanes
! colspan="2" | Pins
! colspan="2" | Length
|-
! Total
! Variable
! Total
! Variable
|-
| {{0}}x1 || 2×18 = {{0}}36<ref name="9tQ3g" /> || 2×{{0}}7 = {{0}}14 || 25&nbsp;mm || {{0}}7.65&nbsp;mm
|-
| {{0}}x4 || 2×32 = {{0}}64 || 2×21 = {{0}}42 || 39&nbsp;mm || 21.65&nbsp;mm
|-
| {{0}}x8 || 2×49 = {{0}}98 || 2×38 = {{0}}76 || 56&nbsp;mm || 38.65&nbsp;mm
|-
| {{0}}x16 || 2×82 = 164 || 2×71 = 142 || 89&nbsp;mm || 71.65&nbsp;mm
|}

[[File:PCIe J1900 SoC ITX Mainboard IMG 1820.JPG|thumb|An open-end PCI Express x1 connector lets longer cards that use more lanes be plugged while operating at x1 speeds.]]

The PCIe Physical Layer (''PHY'', ''PCIEPHY'', ''PCI Express PHY'', or ''PCIe PHY'') specification is divided into two sub-layers, corresponding to electrical and logical specifications. The logical sublayer is sometimes further divided into a MAC sublayer and a PCS, although this division is not formally part of the PCIe specification. A specification published by Intel, the PHY Interface for PCI Express (PIPE),<ref name="pipe_spec" /> defines the MAC/PCS functional partitioning and the interface between these two sub-layers. The PIPE specification also identifies the ''physical media attachment'' (PMA) layer, which includes the [[SerDes|serializer/deserializer (SerDes)]] and other analog circuitry; however, since SerDes implementations vary greatly among [[Application-specific integrated circuit|ASIC]] vendors, PIPE does not specify an interface between the PCS and PMA.

At the electrical level, each lane consists of two unidirectional [[Differential signaling|differential pair]]s operating at 2.5, 5, 8, 16 or 32&nbsp;[[Gigabit|Gbit]]/s, depending on the negotiated capabilities. Transmit and receive are separate differential pairs, for a total of four data wires per lane.

A connection between any two PCIe devices is known as a ''link'', and is built up from a collection of one or more ''lanes''. All devices must minimally support single-lane (x1) link. Devices may optionally support wider links composed of up to 32 lanes.<ref name="PCIe-System-Architecture">{{Cite web|url=https://www.mindshare.com/files/ebooks/PCI%20Express%20System%20Architecture.pdf|title=PCI Express System Architecture}}</ref><ref name="Intel-PCIe">{{Cite web|url=https://www.intel.com/content/www/us/en/support/ru-banner.html|title=Communications|website=Intel}}</ref> This allows for very good compatibility in two ways:
* A PCIe card physically fits (and works correctly) in any slot that is at least as large as it is (e.g., a x1 sized card works in any sized slot);
* A slot of a large physical size (e.g., x16) can be wired electrically with fewer lanes (e.g., x1, x4, x8, or x12) as long as it provides the ground connections required by the larger physical slot size.
In both cases, PCIe negotiates the highest mutually supported number of lanes. Many graphics cards, motherboards and [[BIOS]] versions are verified to support x1, x4, x8 and x16 connectivity on the same connection.

The width of a PCIe connector is 8.8&nbsp;mm, while the height is 11.25&nbsp;mm, and the length is variable. The fixed section of the connector is 11.65&nbsp;mm in length and contains two rows of 11 pins each (22 pins total), while the length of the other section is variable depending on the number of lanes. The pins are spaced at 1&nbsp;mm intervals, and the thickness of the card going into the connector is 1.6&nbsp;mm.<ref name="pcie_schematics1" /><ref name="pcie_schematics2" />

==== Data transmission ====
PCIe sends all control messages, including interrupts, over the same links used for data. The serial protocol can never be blocked, so latency is still comparable to conventional PCI, which has dedicated interrupt lines. When the problem of IRQ sharing of pin based interrupts is taken into account and the fact that message signaled interrupts (MSI) can bypass an I/O APIC and be delivered to the CPU directly, MSI performance ends up being substantially better.<ref name="vV4Hv" />

Data transmitted on multiple-lane links is interleaved, meaning that each successive byte is sent down successive lanes. The PCIe specification refers to this interleaving as ''data striping''. While requiring significant hardware complexity to synchronize (or [[clock skew|deskew]]) the incoming striped data, striping can significantly reduce the latency of the ''n''th byte on a link. While the lanes are not tightly synchronized, there is a limit to the ''lane to lane skew'' of 20/8/6&nbsp;ns for 2.5/5/8&nbsp;GT/s so the hardware buffers can re-align the striped data.<ref name="iPAaS" /> Due to padding requirements, striping may not necessarily reduce the latency of small data packets on a link.

As with other high data rate serial transmission protocols, the clock is [[self-clocking signal|embedded]] in the signal. At the physical level, PCI Express 2.0 utilizes the [[8b/10b encoding]] scheme<ref name="faq3" /> (line code) to ensure that strings of consecutive identical digits (zeros or ones) are limited in length. This coding was used to prevent the receiver from losing track of where the bit edges are. In this coding scheme every eight (uncoded) payload bits of data are replaced with 10 (encoded) bits of transmit data, causing a 20% overhead in the electrical bandwidth. To improve the available bandwidth, PCI Express version 3.0 instead uses [[64b/66b encoding|128b/130b]] encoding (1.54% overhead). [[Line encoding]] limits the run length of identical-digit strings in data streams and ensures the receiver stays synchronised to the transmitter via [[clock recovery]].

A desirable balance (and therefore [[spectral density]]) of 0 and 1 bits in the data stream is achieved by [[XOR]]ing a known [[Linear-feedback shift register|binary polynomial]] as a "[[scrambler]]" to the data stream in a feedback topology. Because the scrambling polynomial is known, the data can be recovered by applying the XOR a second time. Both the scrambling and descrambling steps are carried out in hardware.

Dual simplex in PCIe means there are two simplex channels on every PCIe lane. Simplex means communication is only possible in one direction. By having two simplex channels, two-way communication is made possible. One differential pair is used for each channel.<ref>{{cite web |title=PCIe Data Transmission Overview  |website=[[Microchip Technology]] |url=https://ww1.microchip.com/downloads/aemDocuments/documents/TCG/ProductDocuments/Brochures/00003818.pdf }}</ref><ref name="auto"/><ref>{{cite book | url=https://books.google.com/books?id=k_HJCgAAQBAJ&dq=pcie+differential+pair&pg=PT128 | isbn=978-0-7686-9003-3 | title=CompTIA A+ Exam Cram (Exams 220-602, 220-603, 220-604) | date=19 July 2007 | publisher=Pearson Education }}</ref>

=== Data link layer ===
The data link layer performs three vital services for the PCIe link:
# sequence the transaction layer packets (TLPs) that are generated by the transaction layer,
# ensure reliable delivery of TLPs between two endpoints via an acknowledgement protocol ([[Acknowledge character|ACK]] and [[Negative-acknowledge character|NAK]] signaling) that explicitly requires replay of unacknowledged/bad TLPs,
# initialize and manage flow control credits

On the transmit side, the data link layer generates an incrementing sequence number for each outgoing TLP. It serves as a unique identification tag for each transmitted TLP, and is inserted into the header of the outgoing TLP. A 32-bit [[cyclic redundancy check]] code (known in this context as Link CRC or LCRC) is also appended to the end of each outgoing TLP.

On the receive side, the received TLP's LCRC and sequence number are both validated in the link layer. If either the LCRC check fails (indicating a data error), or the sequence-number is out of range (non-consecutive from the last valid received TLP), then the bad TLP, as well as any TLPs received after the bad TLP, are considered invalid and discarded. The receiver sends a negative acknowledgement message (NAK) with the sequence-number of the invalid TLP, requesting re-transmission of all TLPs forward of that sequence-number. If the received TLP passes the LCRC check and has the correct sequence number, it is treated as valid. The link receiver increments the sequence-number (which tracks the last received good TLP), and forwards the valid TLP to the receiver's transaction layer. An ACK message is sent to remote transmitter, indicating the TLP was successfully received (and by extension, all TLPs with past sequence-numbers.)

If the transmitter receives a NAK message, or no acknowledgement (NAK or ACK) is received until a timeout period expires, the transmitter must retransmit all TLPs that lack a positive acknowledgement (ACK). Barring a persistent malfunction of the device or transmission medium, the link-layer presents a reliable connection to the transaction layer, since the transmission protocol ensures delivery of TLPs over an unreliable medium.

In addition to sending and receiving TLPs generated by the transaction layer, the data-link layer also generates and consumes data link layer packets (DLLPs). ACK and NAK signals are communicated via DLLPs, as are some power management messages and flow control credit information (on behalf of the transaction layer).

In practice, the number of in-flight, unacknowledged TLPs on the link is limited by two factors: the size of the transmitter's replay buffer (which must store a copy of all transmitted TLPs until the remote receiver ACKs them), and the flow control credits issued by the receiver to a transmitter. PCI Express requires all receivers to issue a minimum number of credits, to guarantee a link allows sending PCIConfig TLPs and message TLPs.

=== Transaction layer ===
PCI Express implements split transactions (transactions with request and response separated by time), allowing the link to carry other traffic while the target device gathers data for the response.

PCI Express uses credit-based flow control. In this scheme, a device advertises an initial amount of credit for each received buffer in its transaction layer. The device at the
opposite end of the link, when sending transactions to this device, counts the number of credits each TLP consumes from its account. The sending device may only transmit a TLP when doing so does not make its consumed credit count exceed its credit limit. When the receiving device finishes processing the TLP from its buffer, it signals a return of credits to the sending device, which increases the credit limit by the restored amount. The credit counters are modular counters, and the comparison of consumed credits to credit limit requires [[modular arithmetic]]. The advantage of this scheme (compared to other methods such as wait states or handshake-based transfer protocols) is that the latency of credit return does not affect performance, provided that the credit limit is not encountered. This assumption is generally met if each device is designed with adequate buffer sizes.

PCIe 1.x is often quoted to support a data rate of 250&nbsp;MB/s in each direction, per lane. This figure is a calculation from the physical signaling rate (2.5&nbsp;[[gigabaud]]) divided by the encoding overhead (10 bits per byte). This means a sixteen lane (x16) PCIe card would then be theoretically capable of 16x250&nbsp;MB/s = 4&nbsp;GB/s in each direction. While this is correct in terms of data bytes, more meaningful calculations are based on the usable data payload rate, which depends on the profile of the traffic, which is a function of the high-level (software) application and intermediate protocol levels.

Like other high data rate serial interconnect systems, PCIe has a protocol and processing overhead due to the additional transfer robustness (CRC and acknowledgements). Long continuous unidirectional transfers (such as those typical in high-performance storage controllers) can approach >95% of PCIe's raw (lane) data rate. These transfers also benefit the most from increased number of lanes (x2, x4, etc.) But in more typical applications (such as a [[Universal Serial Bus|USB]] or [[Ethernet]] controller), the traffic profile is characterized as short data packets with frequent enforced acknowledgements.<ref name="traffic_profile" /> This type of traffic reduces the efficiency of the link, due to overhead from packet parsing and forced interrupts (either in the device's host interface or the PC's CPU). Being a protocol for devices connected to the same [[printed circuit board]], it does not require the same tolerance for transmission errors as a protocol for communication over longer distances, and thus, this loss of efficiency is not particular to PCIe.

=== Efficiency of the link ===
As for any network-like communication links, some of the raw bandwidth is consumed by protocol overhead:<ref name="Xilinx">{{cite web|title=Understanding Performance of PCI Express Systems|url=https://www.xilinx.com/support/documentation/white_papers/wp350.pdf|last=Lawley|first=Jason|publisher=Xilinx|version=1.2|date=2014-10-28}}</ref>

A PCIe 1.x lane for example offers a data rate on top of the physical layer of 250&nbsp;MB/s (simplex). This is not the payload bandwidth but the physical layer bandwidth – a PCIe lane has to carry additional information for full functionality.{{r|Xilinx}}

{| class="wikitable"
|+Gen 2 Transaction Layer Packet{{r|Xilinx|p=3}}
!scope="row" scope="col" style="width: 80px;" |Layer
!scope="col" style="width: 20px;" |PHY
!scope="col" style="width: 120px;" |Data Link Layer
!scope="col" style="width: 400px;" colspan="3" |Transaction
!scope="col" style="width: 120px;" |Data Link Layer
!scope="col" style="width: 20px;" |PHY
|-
!scope="row" |Data
|Start
|Sequence
|scope="col" style="width: 75px;" |Header
|scope="col" style="text-align:center; width: 250px;" |Payload
|scope="col" style="width: 75px;" |ECRC
|LCRC
|End
|-
!scope="row" |Size (Bytes)
|1
|2
|12 or 16
|scope="col" style="text-align:center;" |0 to 4096
|4 (optional)
|4
|1
|}
The Gen2 overhead is then 20, 24, or 28 bytes per transaction.{{Clarify |reason=Don't we also have a 8/10b encoding overhead that's not factored in to any of this?|date=September 2021}}{{Citation needed|reason=I fixed the bad math here, but it needs a source, not me and a calculator|date=September 2021}}

{| class="wikitable"
|+Gen 3 Transaction Layer Packet{{r|Xilinx|p=3}}
!scope="row" scope="col" style="width: 80px;" |Layer
!scope="col" style="width: 40px;" |PHY
!scope="col" style="width: 120px;" |Data Link Layer
!scope="col" colspan="3" style="width: 400px;" |Transaction Layer
!scope="col" style="width: 120px;" |Data Link Layer
|-
!scope="row" |Data
|Start
|Sequence
|scope="col" style="width: 75px;" |Header
|scope="col" style="width: 250px;text-align:center;" |Payload
|scope="col" style="width: 75px;" |ECRC
|LCRC
|-
!scope="row" |Size (Bytes)
|4
|2
|12 or 16
|scope="col" style="text-align:center; |0 to 4096
|4 (optional)
|4
|}
The Gen3 overhead is then 22, 26 or 30 bytes per transaction.<!-- Seriously how did somebody get odd numbers out of this? -->{{Clarify |reason=Don't we also have a 128/130b encoding overhead that's not factored in to any of this?|date=September 2021}}{{Citation needed|reason=I fixed the bad math here, but it needs a source, not me and a calculator|date=September 2021}}

The <math>\text{Packet Efficiency} = \frac{\text{Payload}}{\text{Payload} + \text{Overhead}}</math> for a 128 byte payload is 86%, and 98% for a 1024 byte payload. For small accesses like register settings (4 bytes), the efficiency drops as low as 16%.{{Citation needed|reason=Formula is in text, but it didn't state this anywhere or anything about register settings being 4 bytes or the like... and most of the PCIe config registers aren't on the devices and don't need a bus access, they're just sitting around in a DMA region mapped to the CPU's control registers|date=September 2021}}

The maximum payload size (MPS) is set on all devices based on smallest maximum on any device in the chain. If one device has an MPS of 128 bytes, ''all'' devices of the tree must set their MPS to 128 bytes. In this case the bus will have a peak efficiency of 86% for writes.{{r|Xilinx|p=3}}