19

According to https://en.wikipedia.org/wiki/Cray-1

The Cray-1 was built as a 64-bit system, a departure from the 7600/6600, which were 60-bit machines (a change was also planned for the 8600). Addressing was 24-bit, with a maximum of 1,048,576 64-bit words (1 megaword) of main memory, where each word also had 8 parity bits for a total of 72 bits per word.[10] There were 64 data bits and 8 check bits.

It seems to me by the nature of parity, it should suffice to have one bit of overhead per word, rather than eight. I can understand on something like an 8088/87, you might be stuck with 1/8 because the memory system deals in eight bits at a time, but why is it that way on a 64-bit machine?

rwallace
  • 60,953
  • 17
  • 229
  • 552
  • Every parity bit you add halves the error rate. Hence 8 bits divide it by 256. (Though as error correction was used as well, the improvement is not so good.) – Yves Daoust Mar 06 '19 at 17:24
  • 7
    8/64 = 1/8. Guess how many parity bits modern computers use for parity on bytes?? – RonJohn Mar 06 '19 at 20:29
  • 1
    Isn't this the most common configuration of ECC? Even today, ECC DIMMs are also 64+8. – user3528438 Mar 30 '20 at 03:42

4 Answers4

29

There were 64 data bits and 8 check bits.

It seems to me by the nature of parity, it should suffice to have one bit of overhead per word, rather than eight. [...]

What you refer to here is a simple single bit parity. Basically counting the number of ones (even parity) or zeros (odd). Such a mechanism can only detect an odd number of bit flips (1 or 3 or 5 or ... flipping). Even numbers of flips can't be detected and will result in undetected computing errors.

What the Cray uses is a parity system based on Hamming encoding. Encoding parity this way allows detection of multiple bit errors within a word and even correction of these on the fly. The 8-bit code used was able to correct single bit errors (SEC) and detect double error (DED).

So while a machine with a single bit parity can detect single bit flips, it will always fail on double flips. Further, even if an error is detected, the only solution is to halt the program. With SEC-DED, a single error detected will be recovered (final) on the fly (at cost of maybe a few cycles) and a multi-bit error will halt the machine.

I can understand on something like an 8088/87, you might be stuck with 1/8 because the memory system deals in eight bits at a time, but why is it that way on a 64-bit machine?

Because it's still just 1/8th, but now with improved flavour :))

Considering the quite important function of invisible error correction, the question is rather why only 8. Longer codes would allow to detect even longer errors and multi-bit corrections. With the 1 Ki by 1 RAMs used (Fairchild 10415FC), any width could have been made. Then again, while the Cray 1 architecture shows a switch to the 'new' standard of 8 bit units - so using 8 parity bits comes naturally. Doesn't it?


Remark#1

Eventually it's the same development the PC took, just instead of going from 9 bit memory (SIMM) over 36 bit (PS/2) to today's 72 Bit DIMM, the Cray-1 leapfrogged all of this and started with 72 Bit right away.


Remark#2

Seymour Cray is known to have said that 'Parity is for Farmers' when designing the 6600. While this quote was famous in inspiring the reply 'Farmers buy Computers' when parity got introduced with the 7600, not may know what he was referring to on an implied level: The Doctrine of Parity, a US policy to make farming profitable again during and after the great depression - a policy that to some degree still results in higher food prices in the US than in most other countries.


Remark#3

The Cray Y-MP of 1990 even went a step further and added parity to (most) registers. Also the code was changed to enable double-bit correction and multi-bit detection.

Raffzahn
  • 222,541
  • 22
  • 631
  • 918
  • 4
    Cray certainly resisted parity and error checking hardware in the Cray-1, because it was a performance hit. AFAIK one (the first production?) Cray-1 was built without parity and delivered to a US government agency (can't remember exactly where), and it did have better benchmarked performance than any of the later production machines. – alephzero Mar 06 '19 at 12:16
  • 2
    @alephzero: Would parity have required a performance hit if its sole function was to sound an alarm in case of parity fault to notify the user that the output from the current job should not be trusted, as opposed to trying to prevent erroneous computations? Even if parity-validation logic wouldn't be able to indicate whether a fetch had received valid data until long after the data had already been used, it could still provide an extremely valuable pass-fail indication of whether the output from a job should be trusted. – supercat Mar 06 '19 at 19:09
  • @supercat: Per my CAL (Cray Assembler Language) reference card next to me, memory cycle time for scalar access is 11 clock periods but 10 clock periods for Serial 1 (which had parity rather than SECDED protection). There was in fact a performance hit. – Edward Barnard Mar 27 '19 at 00:17
  • 1
    @EdwardBarnard: You're saying the 10 cycle duration was for parity but not SECDED? If so, then unless there was some faster mode without any sort of parity protection, it sounds like you're saying there was only a performance hit if one needed to be able to recover from parity errors (as opposed to merely sounding an alarm). – supercat Mar 27 '19 at 03:13
  • 2
    @supercat: Memory access was either "vector mode" or "scalar mode", with access time a bit faster for vector mode - but still 1 clock period faster for Serial 1. There's a third mode, instruction fetch, not relevant here. This was literally wired into the hardware; no option to turn on or off. There WAS an option as to whether or not generate a hardware interrupt to report single, double, or both, but the single-bit-error-correction happened regardless of interrupt settings. I never worked with Serial 1 personally but did other CRAY-1's inside the operating system. – Edward Barnard Mar 27 '19 at 15:31
  • 1
    @EdwardBarnard: If parity were only needed for the purposes of sounding an alarm, I don't see why it should need to have any performance impact at all, given that one could have a "parity storage and monitoring" circuit which took the current address, data, and read/write status as inputs, and had an alarm output, but didn't influence main system behavior in any way whatsoever beyond minimal capacitive loading on the address and data bus lines. Even if computing parity from a word that was being read or written would take multiple cycles, so what? Use clocked registers to perform... – supercat Oct 19 '23 at 17:57
  • ...pipeline computation of the parity, and a matching number of clocked registers to delay the address bus value by the same amount. One might not find out about a parity error until a few cycles after the result had been used, but if the effect of the alarm was to turn on an error indicator that would be checked at the end of each job, the only effect of the delay would be to make it necessary for code to wait a few cycles after finishing each complete job to know whether anything bad might have happened. – supercat Oct 19 '23 at 18:00
  • @supercat There's a distinction between memory access time, memory cycle time, parity check time, error correction time, and double-error detection time. CRAY-1 serial 1 did not have error correction capability, which is why 1 clock cycle faster. In other words, the "correct a single-bit error on the fly, or detect and report a multiple-bit error on the fly" feature required one additional clock cycle. Each memory bank could produce a new value once every 50 ns, which is 4 clock cycles on the original CRAY-1. That's different from the value ARRIVAL time into a register... – Edward Barnard Oct 27 '23 at 18:25
  • Once the FIRST value arrives, additional values can arrive every clock cycle in lock step, for vector and instruction-fetch operations. And for scalar operations as well when the instructions are interleaved correctly. For example, "load content of address 1234 into register S1" followed by "load content of address 1235 into register S2", assuming no conflicts, means S1 contains value 11 clock periods later, and S2 contains its value exactly one clock period later. The "fetch from memory" operation is... – Edward Barnard Oct 27 '23 at 18:29
  • one clock period faster on serial 1. Or, put correctly, the "fetch from memory" operation is one clock period slower on all CRAY-1's from serial 3 onwards. (The status of serial number 2 is debatable/classified.) The added 1-clock-period delay was due to replacing "single error detection" with "single error correction" circuitry. Note that I've still not quite answered your question, but this IS part of the answer. That's because there's no "this is how long error correction takes" information in hardware reference. Need the Boolean for that, which I've not seen since 1981... – Edward Barnard Oct 27 '23 at 18:34
  • In other words the CRAY-1 hardware reference manual (a resource for SOFTWARE people not hardware people) does not tell us the time lapse due to error detection/correction. It only gives us the total time for memory access and how to bring multiple values from memory in parallel. I should also note that, from Seymour Cray's perspective, monitoring is not of particular interest. When a hardware failure is detected, knock the machine down NOW. Given his quarter-century experience at that time, I have to assume that's what his primary customers wanted. – Edward Barnard Oct 27 '23 at 18:39
  • @EdwardBarnard The division between accounting machines being made to never fail and halt immediate if failing and scientific machines trading that for more precision is even older, dating back into the 1950s. IBM's 7302 memory being a great example. It was organized as 16 KiWords of 72 bit each. When connected to a commercial machine like a 7030 it was used as 64 bit + 8 bit ECC, but 709x type scientific CPUs used it as 64 bit word without parity - enabling to store two floats of 36 bit, using the the additional bits for more precission. – Raffzahn Oct 27 '23 at 22:00
  • @EdwardBarnard The point is that undetected failure during a any scientific computation doesn't cause much harm - except for time lost (and a lot of head scratching). In contrast an accounting system running thru an error means possible loss of huge amounts of money. so need for security is quite different. – Raffzahn Oct 27 '23 at 22:05
  • @Raffzahn I see what you mean. The Seymour Cray designs were not ONLY for scientific computing. They were first and foremost codebreaking machines for the NSA and its predecessors. From what little I know of 1970s-era crytanalysis, a single-bit error could render results useless. The memory chip architecture commonly resulted in solid single-bit errors, easily thousands of corrupt results per minute. I saw that personally (as corrected errors allowed to continue until a maintenance window). – Edward Barnard Oct 29 '23 at 20:37
  • ...here is a 2021 article from the NSA. https://www.nsa.gov/portals/75/documents/news-features/declassified-documents/history-today-articles/10%202018/05OCT2018%20SEYMOUR%20CRAY%20and%20NSA.pdf?ver=P3xsKeHprvcBBChHKi77Gw%3D%3D – Edward Barnard Oct 29 '23 at 20:38
12

After the first Cray-1 was built, some calculation determined that the time between failures would be greatly extended by having a single-error-correction-double-error-detection (SECDED) without much cost in speed. The point is that with large memory, random single bit errors occur every few hours; with SECDED, it's every few years or so.

ttw
  • 221
  • 1
  • 3
  • Yes. Mean time between failure was a significant consideration. Multi-day runs for a single program were not uncommon. SECDED, allowing to ride through flipped memory bits, were one of the factors allowing the long runs without hardware failure. – Edward Barnard Mar 27 '19 at 00:21
  • 2
    Some time in 1977 or so the X1 register on the CDC6400 at Northwestern's Vogelback computer center failed in one bit. Gobs of files got corrupted. The on site CDC engineer was able to repair the machine (don't know if he replaced the transistor or the module). Unfortunately the backup system had been misconfigured so files couldn't be recovered. The center was shut down for a day while backups had to be be rerun. If parity had been in place the hardware breakdown wouldn't have caused such an issue. – kd4ttc Oct 01 '20 at 20:50
  • I also recall a run of a program I once experienced that had a faulty result. Reran the job and it worked without change. Only time in my life I had a single bit error affect a computer program. – kd4ttc Oct 01 '20 at 20:50
6

The extra bits are used to allow for error detection and correction (EDAC).

This scheme is described in detail in: Cray 1 Hardware Reference Manual at page 5-5 (~168)

The use of EDAC in the Cray-1 is rather ironic given that Seymour Cray is (in)famous for once saying

Parity is for farmers.

Which I think is a reference to farm subsides in Europe.

Peter Camilleri
  • 1,162
  • 6
  • 13
1

SECDED means single error correction double error detection there were enough memory picks or drops to really benefit from the single error correction. and if a module failed 2 bits in a word it flagged error. the performance hit was say 2 clock cycles maybe 4 for the first operand set through the channel but every clock after that out comes an answer. or then it did. we built it into the first raid arrays we sold to NASA around 88 or 89. That was a big win there for big data to come

Kevin
  • 11
  • 1