How can floating point addition be so slow on a BESM-6?

Question

In the BESM-6 technical manuals — for example, in the ALU description, page 4 — there is a table specifying min/max/average instruction latencies in clock cycles (the last 3 columns; the clock rate was 9 MHz):

The rows are:

1. Addition, {ACC, Y} := ACC + OP
1. Subtraction, {ACC, Y} := ACC - OP
1. Reverse subtraction, {ACC, Y} := OP - ACC
1. Subtraction of absolute values, {ACC, Y} := abs(ACC) - abs(OP)
1. Negation, ACC := OP < 0 ? -ACC : ACC
1. Division, ACC := ACC / OP, requires OP to be normalized, otherwise traps.
1. Multiplication, {ACC, Y} := ACC * OP

The ACCumulator and the OPerand are in the "simplistic" floating point format: 7 base-2 exponent bits, a sign bit, 40 mantissa bits in 2's complement format. All arithmetic instruction except division and negation produce a result with 80 bits of mantissa: the upper bits in the accumulator, the lower bits in the separate Y register.

The question is, how come the additive operations can take up to 280 clock ticks? Unlike multiplication and division, addition requires equalizing the exponents by shifting the mantissa of one of the operands to the right, 1 bit per cycle; if there was no circuit terminating the equalization early as soon as all significant mantissa bits of the operand with the smaller exponent were shifted away, the process might take up to 127 cycles; but then the actual addition, performed as parallel bitwise 3-to-2 reductions, until no carries are left to reduce, should not have anything left to do. Then, normalization to the left might take up to 128 cycles, for at most 256 plus a handful, when zero with the lowest exponent is added to zero with the highest exponent. Still, quite far from 280.

My experiments implementing the algorithm deduced from the documentation show the following statistics, out of 100 million trials each:

Integer addition (equal exponents in the operands, final normalization to the left is disabled):

min = 5, max = 31, avg = 9.66, median = 9

Obviously, the theoretical maximum would be a shade greater than 40, when adding -1 and 1.

Completely random floating point addition (arbitrary exponents, arbitrary potentially denormalized mantissas):

min = 5, max = 169, avg = 51.30, median = 46

Requiring mantissas of the operands to be normalized:

min = 5, max = 151, avg = 50.28, median = 45

Further clamping operands to be within approx. 6 decimal orders of magnitude (absolute difference of exponents no more than 21) of each other, quasi-normally distributed:

min = 5, max = 63, avg = 13.66, median = 13

Thus, the claim of the average (or, more likely, median) latency being 11 cycles, given an equal mix of integer and floating point additions, is substantiated, but the maximum of 280 cycles is a mystery.

And a shift might take two cycles for whatever reason, which would give 254 for the shifts + some small change for remaining stuff. Do you have circuit diagrams, microcode etc. where one could look it up instead of guessing? — dirkt, Jan 09 '19 at 09:25
It seems that the circuit diagrams are in the same document. 9 sheets... Quality of scans seems just good enough to follow the signals. — UncleBod, Jan 09 '19 at 09:45
@Wilson it means 11. Old typewriters didn't allways have a 1 key, so capital i (or lowercase L) were used instead. — UncleBod, Jan 09 '19 at 09:47
My bet its because you need to shift one operand to match the exponent of the other one for +,- operations... *,/ do not need it .... also how much bits the ALU use for computation? IIRC FPUs have more bits for accumulator/subresult than the internal representation of number to increase precision of division ... — Spektre, Jan 09 '19 at 14:03
maybe the extra cycles are to normalize the result, which requires shifting in the opposite direction. — Ken Gober, Jan 09 '19 at 15:04
@Raffzahn No, that's the time the instruction takes just in the ALU, after all fetches. — Leo B., Jan 09 '19 at 15:57
So I don't know. Min and Average looks reasonable, so Max must be rather some kind of exception handling. — Raffzahn, Jan 09 '19 at 16:08
@KenGober That would take up to 80 extra cycles. Still, 127+80 is way less than 280. — Leo B., Jan 09 '19 at 16:49
@Spektre I've accounted for shifting. In the worst case, including normalization to the left, it would add 80 more cycles. Still too little. The actual summing occurs on the 40 bits of the accumulator. If the 1 bits of the operands do not coincide, carry bits do not form and the summing phase terminates. — Leo B., Jan 09 '19 at 17:01
@Raffzahn All instructions may trap due to overflow; so it has to be accounted for in all latencies. — Leo B., Jan 09 '19 at 21:43
Further below, the document says that exponent addition/subtraction/etc takes a minimum of 3, an average of 5, and a maximum of 133(!) cycles. Do you know what it's doing in the maximum case there? Handling traps? I can't imagine it's doing the exponent addition for so many cycles, or else the average would've been much larger. Although thinking about, it could be doing optional normalization there, too. In which case, this doesn't help. (But I'm keeping the comment in case it helps after all) — secondperson, Jan 10 '19 at 01:26
@secondperson Indeed, the exponent addition/etc includes normalization if it is enabled. The only time the 133 cycles can happen then, is when the exponent is 127 after the operation, and the mantissa is 0. That proves that the normalization does run to the bitter end as opposed to detecting the underflow early and zeroing the exponent when mantissa is 0. — Leo B., Jan 10 '19 at 07:38
@Wilson Cyrillic typewriters used I for 1, so that it could be used for (small) Roman numerals as well. У was used instead of V, and П and Ш could be used for "kerning" of II and III resp. ; D and L were missing, but numbered lists don't usually go that far; D is not needed for years since 1900 (MCM), and L could be approximated with Ц and a touch of whiteout. — Leo B., Jan 10 '19 at 19:10
@LeoB. - Okay, in that case, can addition's 280 be explained by: (a) 127 due to exponent matching for the operands, (b) 127 due to normalization of a 0 mantissa, (c) 26 for the operation, which is more or less the same as negation's maximum, despite negation doing neither exponent matching nor normalization (else, its maximum would be > 127) — secondperson, Jan 10 '19 at 19:38
@secondperson The detailed description of the negation operation says that normalization happens if it's not disabled, so I have no idea how the 25 cycles maximum is possible. Even without normalization, negating a 2's complement value requires adding 1 to the mantissa which may need up to 40 cycles for carry reduction. — Leo B., Jan 10 '19 at 20:25
The manual can - of course - have an error somewhere if things don't add up, and the "max' column is a likely place for those errors due to how tricky it is to compute. It should also be noted that for an addition where both exponent-matching and normalization take 127 cycles (one of the operands is right-shifted 127 times & the result is 0), there won't be any carry unless BESM-6's rounding is really weird and a negative operand is right-shifted into an all-ones mantissa instead of into a positive zero. Of course, whether the "max" column takes this into account is anyone's guess. — secondperson, Jan 10 '19 at 21:19
@secondperson A curious case of adding 0 with max exponent 63 and -1 with min exponent (-12^-64 == -5.42e-20) takes 211 cycles and produces -12^63-80 == -7.63e-6. This is not a bug, this is a feature. ;) — Leo B., Jan 10 '19 at 22:52

score 4 · Accepted Answer · answered Jan 11 '19 at 02:36

Compiling details pointed to in the comments by @secondperson and my own findings:

It appears that the documentation is inaccurate. There are apparent typos and/or transcription errors. For example, the max. latency of the negation operation cannot be 25, as the operation involves negating a 40-bit 2's-complement value, which requires up to 40 clock cycles for carry propagation, and, as well as in all other arithmetic operations, there is an optional normalization of the result, which could take up to 127 clock cycles.

The greatest latency of approx. 259, is achieved by adding two numbers with zero mantissas, one with the exponent field set to 0177 (corresponding to 2⁶³), the other with the exponent field set to zero (corresponding to 2^-64). Equalizing the exponents and attempting normalization take ~127 clock cycles each.

It is notable that in all cases of non-trivial operands, requiring actual multi-cycle carry propagation, the latency of the addition instruction as a whole will be substantially less than 260.

Thus it is likely that the max. latency intended to be printed was 260, which is close enough graphically to 280, and the difference was inconsequential for no one to notice. A more glaring case of the max. latency of the negation instruction was not taken care about either.

A curious case had been observed: adding 0 with the max. exponent 63 and -1 with the min. exponent -64 (-1*2^-64 == -5.42⏨-20) takes 211 cycles, and produces, after replicating the sign bit to all 80 mantissa bits during the 127 steps of exponent equalization and subsequent normalization of these 80 bits, the value of -1*2^63-80 == -7.63⏨-6.

How can floating point addition be so slow on a BESM-6?

1 Answers1

Linked