Are there any articles elucidating the history of the POPCOUNT instruction?

Question

Figuring out how many bits in a group of bits are set to 1, known as computing "population count", Hamming weight, or "bit summation", among others, has various applications.

It is also fairly cheap to implement in hardware to be executed as fast as a fixed point addition if not faster.

Likely for these reasons, CDC 6600 (1964) had CXi (47) described as Count the number of 1's in Xk to Xi specifically included in its quite short list of instructions, although IBM STRETCH (1961) had that capability first as a by-product of all logical operations.

Then, in the list of processors with the "popcount" instruction there are large gaps. Since CDCs and Crays, and until very late 1990s, only SPARC (designed in mid-1980s) had the POPC instruction.

Not a single IBM processor (not counting the aforementioned IBM STRETCH where there was no need for a separate instruction), not a single DEC processor (the instruction first appeared in an Alpha CPU when it was already Compaq's), not a single Motorola processor, not a single Intel or AMD processor until late 2000s, ...

As a result, innumerably many a software engineer had to implement the functionality of POPCOUNT by hand with various degrees of efficiency (even GCC introduced __builtin_popcount only in mid-2000s), and to endure silly interview questions about its efficient implementations.

What was going on? The wiki page doesn't discuss that aspect, but the many-decades long omission is so glaring that there had to be some musings on the subject by people "in the know".

I'm not asking people to speculate here; I'm asking for references to existing material about the lack of POPCOUNT in most of the CPUs of the last quarter of the XX century.

C++ compiler writers have been free to optimize std::bitset::count() using popcnt or equivalent since bitset was invented (1994?). When they actually started doing so requires more research. See: https://stackoverflow.com/a/45189920/966071 — snips-n-snails, Sep 10 '17 at 02:25
@traal Could be since 2003. My question is not about software implementations, it is why there were no hardware implementations of popcount in CPUs introduced between 1970 and 1999 except Crays and SPARCs. — Leo B., Sep 10 '17 at 03:40
@Chenmunka With that understanding, I said there had to be some musings on the subject by people "in the know" I'm asking for references to existing treatises. — Leo B., Sep 10 '17 at 07:49
Probably because ADD and SUB has more applications?. I can remember only a few occasions of wanting such an instruction in the last 20 years. — tofro, Sep 10 '17 at 08:09
@tofro - I'm going to guess that you don't spend much time writing operating systems. This faciliity is extremely useful for memory management. If you have a word that contains one bit for whether each of a group of memory pages is currently allocated or not, POPCOUNT will tell you whether it's useful to examine that word in order to attempt to allocate a block of a given size. Since this is a very important operation, I'd say it's worth including in an ISA for this reason only. — Jules, Sep 10 '17 at 11:25
@Jules Your guess is wrong. OSs (or rather, system libraries) are rarely written in assembler anyhow today - That went out of style 25 years ago. You wouldn't even be able to benefit from any such assembler instruction in a decently modern OS that's written in C or C++. — tofro, Sep 10 '17 at 11:40
@Jules "any bit set" or "no bit set" are interesting checks. Bit count is nothing I commonly needed. If there were a need for such an operator, I'd guess any of the high-level languages would have come up with a concept for a hamming-weight operator during the last 30 years, at least. They haven't (I'm not talking about builtin_popcount here - That's not in the high-level language). Note I'm not saying it's useless - It's just not as important as other instructions. And it's also, at least for low bit count CPUs, extremely easy and fast to implement using a LUT. — tofro, Sep 10 '17 at 12:25
@Chenmunka Here are two recent "why" questions inviting opinion-based answers: https://retrocomputing.stackexchange.com/q/4706/4025 https://retrocomputing.stackexchange.com/q/1626/4025 — Leo B., Sep 10 '17 at 16:48
@tofro Any software implementation of popcount is at least an order of magnitude slower than the hardware instruction. If at least one application is known where this difference is critical (cryptanalysis? - unfortunately, the original link there is dead), it makes sense to include it in the instruction set. — Leo B., Sep 10 '17 at 16:55
@Leo it makes sense to include it, assuming you can afford to spare the transistors; that might have been a consideration at least until the late eighties. — Stephen Kitt, Sep 10 '17 at 17:59
@LeoB. If POPCOUNT would have been a clear marketing distinction and unique selling point, CPU makers would certainly have used that. Apparently it wasn't. — tofro, Sep 10 '17 at 18:35
@jules Why on earth should an OS of any somewhat modern CPU use single bits for memory management? For one, it'll add a serious bloat of handling, as the smalest item agerage CPUs can handle without preperation is a byte. But more important, the prue used information is not realy usefull, as at least an additional table with information about properties and owner and alike is needed. A reasonable memory management will prefer data structures organized to speed up access and handling, not primary about space usage. — Raffzahn, Sep 10 '17 at 19:19
@LeoB. Having others sliped a certain rule, doesn't make this one less opinion based. Also, at least your first example is clearly about ahistoric context, not asking for opinion. — Raffzahn, Sep 10 '17 at 19:21
Could the CISC versus RISC debate of the '80s and early-'90s be the interceding event? RISC that really is RISC wouldn't implement such a thing, and it took a while for thesis and antithesis to boil down into synthesis. — Tommy, Sep 10 '17 at 21:15
@LeoB. No, it isn't. Hamming weight of a single byte is one single instruction when implemented through a LUT — tofro, Sep 10 '17 at 22:32
Originally, it may have been a "spandrel" (in the sense of https://en.wikipedia.org/wiki/Spandrel_(biology)). For example in the IBM STRETCH (1960) a count of 1-bits was a by-product of all the logical machine code instructions. It's rumoured that the NSA requested the instruction (vectorised of course) in the early Cray machines specifically for use in cryptography. — alephzero, Sep 11 '17 at 02:51
@alephzero Exactly; knowing that there is a consumer, adding the instruction to a new architecture would be a logical business decision, especially as it becomes relatively more and more cheap; so why wasn't it added to a MIPS, PA-RISC or a Pentium in early 1990s given that it exists in the SPARC architecture? — Leo B., Sep 11 '17 at 05:47
Anyone who thinks that modern operating systems don't use bitmaps to track information is sorely mistaken: bitmap.h — Jonathon Reinhart, Apr 03 '18 at 12:20
@tofro Not true, you'd use a compiler builtin that maps to the assembly instruction. I hear that clang is now even smart enough to take a hand-written popcount loop and convert it to the instruction, perhaps because enough people weren't using the builtin. — user253751, Oct 25 '18 at 04:32
@immibis I guess that there are just many ways to write popcount code like bit loop, byte+lookup table loop, thing that first adds even and odd bits and continue adding 4-bit groups, then 8-bit groups etc. And some of them definitely wouldn't be detected by clang. So it's more of a benchmark-optimization. — lvd, Jan 05 '19 at 14:14
I find it rather strange that nobody mentioned error correction. Hamw of two codes that differ gives the number of error-bits.. — Natural Number Guy, Feb 04 '20 at 17:35
@NaturalNumberGuy Please re-read the first paragraph of the question carefully. — Leo B., Feb 05 '20 at 17:52
Nothing much historical there currently, but for those following along at home, I did come across an article on WikiChip that explains it and has references to (modern) instruction sets that implement it: https://en.wikichip.org/wiki/population_count — natevw, Feb 05 '20 at 22:00

score 14 · Answer 1 · answered Jan 05 '19 at 00:08

14

The closest thing I've found to an article explaining the history of the POPCOUNT instruction is a cryptography forum discussion. That discussion claims that population count was added to machines at the NSA's request for cryptanalysis. See also The NSA Instruction which discusses various systems with POPCOUNT.

The book "Hacker's Delight" claims that "According to computer folklore, the population count function is important to the National Security Agency" (page 96).

Serveral sources state that IBM STRETCH supported population count as a by-product of all logical operations. This is technically true but misleading - STRETCH implemented a population count register (All Ones Count / AOC) that would count the ones resulting from a logical operation. So while it didn't have an explicit population count instruction, it had hardware to compute the population count into a register. (See page 10 of the IBM 7030 manual.)

IBM built HARVEST as an extension of STRETCH for the NSA. This machine added a streaming bit count instruction (SCSI) (see HARVEST system manual page H2.40) among other features.

To answer the original question, there are lots of rumors that POPCOUNT appeared in instruction sets when the NSA wanted it, but unsurprisingly there is no authoritative article on this.

answered Jan 05 '19 at 00:08

Ken Shirriff

1,133
11
10

That doesn't seem like an answer. The question is, why there was no POPCOUNT instruction in most of the CPUs despite its usefullness, whether NSA wants it or not, and ease of implementation. – Leo B. Jan 05 '19 at 06:41
3

@LeoB - its usefulness does not seem to be a settled conclusion, and that's presumably why it has not shown up in "most CPUs". All that there seems to exist is the folklore that says that cryptanalysis finds it useful, but no substantiation of that. Anecdotes prove nothing, of course, but in a long programming career generally of a system-software implementation nature, including a fair amount of allocation-tracking bitmappery, I don't recall ever once needing to count bits beyond 'all zero or not all zero'. – dave Jan 05 '19 at 13:37
@another-dave Many kinds of discrete optimization algorithms may benefit from the POPCOUNT instruction, from chess to electronic design. – Leo B. Jan 05 '19 at 17:05
3

Just having some potential users of an instruction isn't enough to warrant adding it to a processor. For example, MIPS had the rule that a proposed new instruction had to provide a 1% performance gain over a range of applications or it would be rejected. – Ken Shirriff Jan 05 '19 at 17:43
@KenShirriff A similarly RISC-y SPARC had the POPC instruction. The fact that people have to implement the popcount function again and again apparently was enough to warrant adding it for the designers of the SPARC architecture. – Leo B. Jan 05 '19 at 22:17
5

During the 1980s and 1990s we were aware of the population count and leading-zero count instructions in the Seymour Cray designs - but, within Cray Research, never had confirmation as to WHY they were in the instruction set, and a separate functional unit in the hardware. We assumed NSA but that wasn't something that could be confirmed or denied. – Edward Barnard Mar 27 '19 at 00:37

tofro · Answer 2 · 2017-09-10T20:39:38.123

10

I would guess the answer is very simple and obvious:

CPU makers don't just implement what they want, what "looks nice in a spec" or "what's easily done". They are driven by the market and implement what programmers need and ask for. In case a certain instruction would make a clear distinction in the market or create a unique selling point for a new CPU, they would certainly have implemented it. There is and has definitely been enough peer pressure in the market to justify.

But apparently it wasn't and they didn't.

edited Sep 10 '17 at 20:39

answered Sep 10 '17 at 18:49

tofro

34,832
4
89
170

2

What, then, did the programmers need it for in the 60s, then stopped for several decades, and now need again? – Leo B. Sep 10 '17 at 20:13
2

Cryptography definitely is an area today. The sixties, I don't know-maybe it turned out it wasn't needed at all and thus abandoned – tofro Sep 10 '17 at 20:27
3

@LeoB.: popcnt is very useful with SIMD. e.g. to count matches, do a packed-compare into a bitmask (e.g. x86 pcmpeqb xmm0, xmm1 / pmovmskb eax, xmm0), and popcnt the bitmask. In a SIMD loop that processes a variable amount of data each iteration, pointer += popcnt(something) is often useful to step through an input or output array. e.g. see my answer on filtering an array into a new array ("left packing") using AVX2 + BMI2 – Peter Cordes Dec 15 '17 at 04:24

Raffzahn · Answer 3 · 2020-02-04T16:28:35.507

I'm not asking people to speculate here; I'm asking for references to existing material about the lack of POPCOUNT in most of the CPUs of the last quarter of the XX century.

It's always a bit hard to find documentation about why things have not been done. But maybe a few points to think about

Machines with a Popcount were already a minority early on.
Popcount only makes sense for a narrow class of problems.
Having a Popcount instruction makes only sense if the problem to be solved already needs the upper end of power available.
At the absolute upper end, there were always machines having it (Cray).
Every new class (mini, micro) started out as a most simple implementation, only providing the bare minimum.
The vast majority of applications (for those) do not need Popcount.
Minis never really reached the upper end of performance, making such specialized instructions worthwhile.
It wasn't until the late 1980s that massive parallel microprocessor systems were considered for high performance computing.
While some, like SPARC, defined it, only their upper end implementations really had it.
Likewise Intel implemented it for the (~2000) Itanium, dedicated to high end computing, while letting another decade pass before adding it to x86.
It wasn't until that time (late 2000s) that generic mainstream microprocessor architecture (x86) was really considered viable for high end computing.
By the 2000s microprocessors were quite mature (read bloated), making the addition of additional instructions quite cheap, thus allowing the inclusion of rarely needed ones as well.
Even today, the majority of CPUs used (and designed) do live quite well without.

On a more basic, but equally important level:

Popcount is a fairly simple integer instruction
Popcount can (on most CPU) easy be synthesised by a few, fast instructions.
Performance gain is thus rather small
It's way more efficient to first add instructions with more potential gain

Bottom line: No real need, and easy to implement using basic instructions at the same time

At that point it's also worth to remember, that many things look very different in hindsight.

To avoid lengthy arguments about how complicated a Popcount implementation is, let's look at a most primitive 6502 (*1). Here two basic instruction of 7 cycles total are needed per byte whose bits are to be counted.

        CLC
        LDY  FACC
        LDA  PCTAB,Y
        LDY  FACC+1
        ADC  PCTAB,Y
        ...

PCTAB   DB   0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,...

On an 8086, it can easy be done with string instruction handling even lengthy sequences of up to 65535 bit close to full bus speed:

        MOV   AH,0
        MOV   DX,0
CNTLP:
        LODSB
        XLAT  PCTAB
        ADD   DX,AX
        LOOP

On a /360, the basic counting can be done with a single TR instruction, while 'reuse' of packed decimal instructions and another TR will speed up the adding substantial.

*1 - Ofc, only in sense of complex instructions - otherwise it always will be the second best CPU, right after /360, ever.

Thank you for digging that up; to summarize, the early "upper end" computers were also word-based, thus an efficient implementation would be elusive; whereas for computers with byte-addressable data it can easily be done with simple table lookups in a tight loop, until the steadily growing disparity between the CPU speed and the memory latency makes the potential performance gain worth implementing POPCOUNT as an atomic instruction. — Leo B., Feb 04 '20 at 16:11
@LeoB. Yes, you're certainly right about the word based nature of the very early machines. I should have added that as well. Performance wise lookup and a tight loop is still a good choice, as either will be held in on chip cache, thus keeping it mostly independent of memory latency - except for the data to be counted, but that would hot a memory based Popcount as well, wouldn't it? — Raffzahn, Feb 04 '20 at 16:23
True, but even with L1 cache latencies, for a 64-bit value a loop would take tens of clock cycles; in particular because at least one branch out of 8 iterations of the loop would be mispredicted. — Leo B., Feb 04 '20 at 16:38
@LeoB. Oh, no doubt. A specific instruction is (or at least should) always be faster. Especially if no additional data is needed (that is beside the primary one it works on). Just the gain isn't as big as it seams (also the number of clock cycles depends much on how well the loop can be turned into parallel micro-ops), so it comes again down to how prevalent the usage of the within an application is. If it's the one in a core loop, then it'll make a huge difference, but if it's one of several hundret, it might not show up at all. But I guess at that point we're knee deep in opinion swamp. — Raffzahn, Feb 04 '20 at 16:47

score 6 · Answer 4 · answered Feb 04 '20 at 03:40

This paper by Simon Lavington on Manchester University computers says that the Ferranti Mark I, the commercialization of the Manchester Mark I, incorporated a "sideways add" instruction at the request of Manchester mathematicians for "their number-theory problems".

Are there any articles elucidating the history of the POPCOUNT instruction?

4 Answers4