The reference clock is multiplied up through a PLL to the line rate (2/5Gb/sec, 5Gb/sec, 8Gb/sec for versions 1.x, 2.x and 3.x respectively); this determines the data rate from a transmitter.
The clock is effectively embedded in the data stream by using line coding which for the 2.5Gb/sec and 5Gb/sec is 8 bit / 10 bit and 128bit/130bit (see third paragraph) for gen.3 (8Gb/sec). Note that this coding is derived from the reference clock (as multiplied up).
This allows the receiver to use standard clock recovery techniques.
It is not necessary to have a common reference clock (for all versions); this is the reason the SKP (skip) ordered set exists. This allows a difference between reference clocks at each different link partner (the specification permits the reference clock to be +/- 300ppm so a relatively inexpensive device may be used) and receivers implement elastic buffers to cross the timing domains.
This clock domain crossing mechanism eliminates skew issues between clocks.
Note that a common reference clock which is almost guaranteed to have a phase difference at link partners will still need a 1 bit FIFO (as was used in Hypertransport which did require a common reference clock).
In one design, I had 8 potential PCIe link partners; here is where a shared reference clock makes sense.
I used one master reference clock ($20) and a single 8 channel clock buffer ($20), a lot cheaper than 8 reference clocks.
For designs where the links traverse cables and/or multiple connectors in multi-PCB designs, shared references are not really suitable as the reference clock at each link partner needs to be nice and clean.