20

Google tells me that John McCarthy invented garbage collection, for Lisp in 1959. However, a video on C that I was watching (‘Learn C Programming with Dr. Chuck’, c. 6:40) mentions the lack of a garbage collector and says that when Dennis Richie invented C in 1972, computers were decades away from a garbage collector.

Why would a C video say this if Lisp had one in '59?

Toby Speight
  • 1,611
  • 14
  • 31
Neil Meyer
  • 6,275
  • 9
  • 29
  • 44
  • Comments are not for extended discussion; this conversation has been moved to chat. – Chenmunka Jan 21 '23 at 09:01
  • I have reverted the edit. I don’t like it either that the question was based on a misreading of what the source said, but distorting the content of the question is even worse. Better just close it. – user3840170 Jan 21 '23 at 12:32
  • 1
    I don't know why you suppress this edit, @user3840170, because, it was a transcription of the video and helped understand the problem without having to watch it. It didn't distort the question at all. By the way it wasn't a misreading : The text is clear : There are two problems caused by dynamic memory allocation, and cleaning both can be placed under the vocable "garbage collection". – Camion Feb 01 '23 at 11:27

4 Answers4

34

TL;DR:

  • Garbage collection is not a standalone entity, but part of a storage management system.
  • Only languages having dynamic storage objects needs management to handle them.
  • C is neither a storage management system nor does it have dynamic storage objects.
  • Languages like LISP or BASIC do need it.
  • Both of those had it years before C was developed.

Garbage collection is never a standalone 'invention'. It's always tightly coupled with the structures it optimizes and part of the software that manages those structures. Thus it can not be 'invented' as an entity in itself, but only developed as a part of a management of dynamic storage structures. Which means that a language that does not provide such a memory management does not need such a provision.

  • LISP, as a language that is constantly creating, manipulating and destroying (lists of) atoms, does need such management and garbage collection as part of it. Thus it had one almost from the start in 1959.
  • BASIC got dynamic strings (and GC) in 1967 with Dartmouth Fourth Edition
  • C in turn was intended as a very basic, minimalist language, geared toward handling data close to its physical representation. Dynamic storage objects would have been way out of scope. Without them, there was no need for dynamic management and thus of course no need for garbage collection as part of that management either.
Luke Sawczak
  • 103
  • 4
Raffzahn
  • 222,541
  • 22
  • 631
  • 918
  • Comments are not for extended discussion; this conversation has been moved to chat. – Chenmunka Jan 21 '23 at 08:59
  • "Only languages having dynamic storage objects management to handle them" << Probably a typo or missing verb in that sentence, I can't make sense of it. – Stef Jan 21 '23 at 10:17
  • 2
    BBC BASIC had dynamic strings without garbage collection - reassignment used the original memory block if possible and good practice was to allocate the maximum anticipated length to each variable at program startup. – grahamj42 Jan 21 '23 at 15:08
  • @grahamj42 interesting as this site suggests otherwise, at least starting with V2. In any case would BBC BASIC be neither the first or only, as there were several that worked with predefined string allocation, requireing strings to be DIMed. – Raffzahn Jan 21 '23 at 16:59
  • A clarification - dynamic memory does not ipso facto require garbage collection. In simple cases, reference-counting will suffice. This seems especially easy in languages that don't have user-defined aggregate data types. – dave Jan 21 '23 at 18:00
  • @Raffzahn your link exaggerates a little: if the last string variable on the heap is freed, its memory is also freed - not really garbage collection. – grahamj42 Jan 21 '23 at 21:48
  • @grahamj42: Yeah that's right - it handles the "easy" case where you're freeing the topmost thing on an upward-growing heap. Anything anywhere else in the heap is just discarded (permanently lost). It's a useful optimisation, though, because it handles some cases like doing A$ = A$ + "x" in a loop. – psmears Jan 21 '23 at 22:48
  • Can you elaborate why C doesn't need a GC? There are GC implementations for C, such as: https://www.hboehm.info/gc/ – HolyBlackCat Jan 22 '23 at 07:01
  • @another-dave Reference counting is a form of garbage collection - http://wiki.c2.com/?ReferenceCounting – Aster Jan 22 '23 at 08:54
  • 1
    I disagree. I must be a purist. It is 'manual' memory management in the sense that, at the point at which a reference is overwritten, it can be determined instantly whether the previous referent is still referenced. – dave Jan 22 '23 at 14:00
  • 1
    Wut? reference counting definitely is GC - there's nothing "manual" about it in the sense that the programmer does anything. It's got a chapter in the GC handbook. Etc. – davidbak Jan 22 '23 at 17:18
  • 4
    BTW, this thread is a good example of why the SE "chat" feature is useless and why it is annoying that moderators here on Retrocomputing where we use comments a lot with a local culture that allows discussion in comments move comments to chat. Nobody looks in chat to see what was discussed already; nobody goes to chat to add additional discussion. Some of the points in this thread were already made in the comments that were moved to chat - some just a few hours after the earlier ones were moved to chat. – davidbak Jan 22 '23 at 17:20
  • w.r.t. ref-cnting the programmer, during design, must be aware of ref-cnting and ensure there are no cycles - either by never creating them, using weak ptrs to avoid them, or by breaking them when necessary. But that's the same with all GC: the programmer must be aware he's in a "managed environment". Just ask any Java programmer working on a high-perf low-latency app. You're always aware of the necessity to reduce memory pressure; even using otherwise inappropriate techniques, e.g., raw arrarys (which in Java are second class citizens) instead of classes, ... 1/2 – davidbak Jan 22 '23 at 17:27
  • ... and custom data structures - including collections - designed specifically to reduce memory pressure in your particular case (which is something you rarely bother with in other languages including C++), and even modifying your design or architecture (sometimes drastically) just to reduce memory pressure as a priority above all other considerations. Yes, ref counting is definitely recognized as a valid form of GC. 2/2 – davidbak Jan 22 '23 at 17:28
  • @HolyBlackCat - C/C++ don't need GC because they fully support (and expect!) manual memory management - there's library support (from the beginning) for it - malloc/realloc/free - and also tool support - e.g., debug versions of malloc/free, etc. You can bolt GC on to the side - BDW GC is the most successful and long-lived example (but there were others, e.g., Ravenbrook, now open source) but the best that can be said of them is that they work (well) if you work within their limitations. – davidbak Jan 22 '23 at 17:39
  • @davidbak: A compiler that knows how to interact with a Java-style GC will be able to efficiently guarantee that race conditions cannot affect memory safety in ways that wouldn't be possible in C++. If one takes the attitude that race conditions should trigger "anything can happen" UB rather than merely yielding on each thread an Unspecified choice between older and newer data which need not be consistent with what other threads see, then such guarantees would be meaningless, but for many tasks such guarantees are useful. They cannot efficiently be "bolted on", however. – supercat Jan 22 '23 at 20:34
  • 1
    @supercat - ok, but I don't exactly get what point you're arguing in favor of?? (though to be honest I'm not sure how bad a problem it is - presumably in a multithreaded code the code author is already providing those memory fences that are required by the language semantics (i.e., memory model) - and then what the GC library needs to provide is hard fences whenever it is about to start screwing around - certainly easier for a stop-the-world collector (maybe only possible for a stop-the-world collector I dunno) and GC is already expensive relative to manual mem mgmt ...) – davidbak Jan 22 '23 at 22:25
  • @davidbak: Sometimes tasks require the ability to run untrusted code in a sandbox. If one wants to be able to sandbox multi-threaded code, a JVM-style GC can do so at a cost which would often be lower than the memory fences that would otherwise be required on every operation. Further, Java-style memory semantics make it possible for a task running on one thread to provide progess reports to other threads without requiring synchronization in cases where reports don't need to be perfectly up to date. – supercat Jan 22 '23 at 23:12
  • @supercat - ok, but ... I thought we were talking about C (and C++) in this question? and there are by the way a bunch of reservations about the current java memory model and its cost ... I've been reading up on this lately though I don't have references to hand ... – davidbak Jan 22 '23 at 23:14
  • @davidbak: My point was that that there's no way for library code in C++ to implement something like the JVM memory model without the compiler understanding the memory model. – supercat Jan 22 '23 at 23:16
  • @supercat - Boehm wrote a paper quite awhile ago explaining the problem with implementing theads in a library. It was as a result of that paper that the current C++ memory model was created - and he did a lot of work to get it to happen! So now the situation is different. Now, assuming the programmer assiduously follows the C++ memory model - using C++ library fences and atomics and such properly - now it is possible to implement GC as a library. To the best of my knowledge. (But the big issue is: It's nearly impossible ... 1/2 – davidbak Jan 22 '23 at 23:22
  • ... for a programmer to program correctly in the face of concurrency without extensive tool help and/or being extremely careful to avoid all problems!) (Really the best way to avoid problems is to totally limit your concurrency to message passing via properly written concurrent queues.) (Plus GC as a library requires the programmer to also not hide pointers, add extra tag bits to unused bits in pointers, point to the middle of a memory allocation w/o pointing also to its head, etc. etc.) 2/2 – davidbak Jan 22 '23 at 23:25
  • @davidbak: Many platforms would allow languages to, at no cost in straightforwardly processed code, offer memory semantics stronger than would be cheaply available in C++. Code written to exploit platform semantics could be more efficient than would be possible in portable C++, at the expense of being difficult to handle efficiently on other platforms. The JVM semantic model is a closer fit than the C++ model for the semantics that most platforms would be able to offer. C was designed to be more efficient than other languages in cases where a pointer into the middle of an allocation... – supercat Jan 22 '23 at 23:34
  • ...could serve the combined purposes of a base pointer and an offset. Situations where using pointers in such fashion would usefully improve efficiency are far less common than in the 1970s and 1980s, especially in programs which would be required to behave in tolerably useless fashion even when given malicious inputs. – supercat Jan 22 '23 at 23:38
24

Part of the problem here is the definition of "garbage collection". This is what the video says:

The more difficult problem [than forgetting to free memory] is after a series of calls to malloc() and free() the heap space becomes fragmented and some cleanup is needed. This clean up is called "garbage collection". Efficient memory allocation and garbage collection has been the subject of decades of computer science research. The Java language has build a number of increasingly effective garbage collection approaches over the years.

Kernighan and Ritchie in one simple paragraph define most of the problem as out of scope for the C language. Which makes it a bit challenging for us to make good use of dynamic memory allocation in C - but when we do it properly - it performs very well.

If you are using a language like Java, Python, or PHP, every time you create a new string through concatenation without thinking about memory allocation, remember to appreciate the decades of work by computer scientists that made it easy for you. Kernighan and Ritchie knew "garbage collection" was difficult. So they left it out of the C language and put it into a run-time library.

I think this is a mischaracterisation of what garbage collection is. It's not just the problem of defragmenting the heap, it's also the problem of making sure that memory that is no longer used is returned to the heap. It's true that there has been decades of research into making both of these problems more tractable and making garbage collection faster, but it is not as if there weren't adequate solutions at the time C was invented and it's not as if the defragmentation issue was a new problem even then.

tl;dr I think the video is wrong.

JeremyP
  • 11,631
  • 1
  • 37
  • 53
  • 3
    Thanks for putting in the effort to actually listen to the video/transcribe it. You're right: the video is on the wrong track - possibly the guy who wrote it was confused. In fact there are plenty of heap algorithms that "defragment" the heap coalescing returned items (using various techniques). (Up to a point since they can't move allocated things around.) "Buddy allocation" was definitely known when C was invented, probably others as well. – davidbak Jan 20 '23 at 15:27
  • 6
    The video seems to be just a recording of someone reading from https://www.cc4e.com/book/chap00.md (the password is 42; yay for gratuitously breaking the Wayback Machine!) – user3840170 Jan 20 '23 at 15:27
  • Certainly for nearly 30 years the gnu libc implementation of malloc and free have been defragmenting by coalescing and allocating smartly. The difficulty presented by C is at the sweep stage: C doesn't mandate how references are stored or flag cells as containing pointers or type except very loosely. There is also a very common possibility of uninitialised junk (especially historically). This means you can never really know in C if something is a reference or if you have all the references. With well-behaved programmers you can use Boehm, etc, but it's not perfect. – Dan Jan 21 '23 at 02:09
  • Come to that, coalescing blocks in the free-memory pool is not rocket science. RSX-11M managed its kernel-mode free-memory pool like that (coalesce on deallocate) in the mid-1970s. But without compaction, it does not actually prevent fragmentation, though it helps a lot. – dave Jan 21 '23 at 14:20
  • 2
    I think garbage collection is not about defragmenting the heap at all. Garbage collection is what it says on the tin: collecting (freeing) garbage (unreachable objects). Many garbage collection methods also defragment the surviving objects, but many don't (e.g. refcounting GC, the Boehm GC, and CPython's cycle collector) and they still count as garbage collection. – benrg Jan 21 '23 at 16:49
  • 1
    @benrg - are you implying the 'expert' in the video isn't? ;-) – dave Jan 21 '23 at 18:01
  • @benrg: The term still embodies multiple distinct concepts, some of which involve determining for each object whether it is still needed, and some of which involve identifying areas of memory that don't contain any live objects, without knowing or caring what was there previously. Defragmentation allows the latter kind to be greatly simplified. If all live objects that had been in a region of memory can be consolidated, then any part of the region outside that consolidated chunk can be blindly recycled without regard for any live objects that might be inside, because there won't be any. – supercat Jan 21 '23 at 19:08
  • @davidbak It didn't take much effort. The narrator was just reading the document that was displayed on the screen. I simply stopped the video when it was showing the text above and copy-pasted the text. Recent versions of macOS can pull text from pretty much any image displayed on the screen - even a paused youtube video. – JeremyP Jan 23 '23 at 10:25
  • @benrg There's always a defrag phase. It my be hidden e.g. if a reference counting algorithm used malloc and free, the underlying C library defrag algorithm is used. – JeremyP Jan 23 '23 at 10:43
  • @JeremyP I think the quoted text is talking about moving the live objects (as I was). Free-list maintenance is definitely not part of garbage collection. If the GC compacts/copies the live objects then the issue never arises, and if it doesn't then the background heap maintenance is unrelated to the GC process and no one calls it GC. – benrg Jan 23 '23 at 20:04
0

It is possible and in fact not particularly hard to have a fully conformant C implementation that has defragmentable heap. Many implementations are possible, and some were even supported by widely available consumer hardware.

On x86, 16-bit C code built in large mode (multiple code and data segments, objects limited to 64kb length) could be made to run in 286 protected mode directly.

With changes only to the C runtime library, but not compiler itself, the segments could be used as identifiers either for memory arenas (for small objects), or for objects themselves (for larger objects that each had their own "arena"). The allocator maintained a couple arenas for "small" objects, and anything that didn't fit in those got its own segment descriptor.

Windows 3.x couldn't quite do it that way, since there was no segment descriptor indirection on 8086. Thus, Windows heap required explicit handles and object pinning in place of indirection. In any case, the segment descriptor cache misses and churn were expensive enough that it was better that Windows did it this way back then, as the performance hit in some cases was substantial.

If you could trade off some performance for not running out of heap due to fragmentation, then using segment descriptors as an indirection layer to allow moving objects around in the heap was a reasonable idea. It also facilitated paging infrequently used objects to disk, even on 286. There was a small software shop somewhere that made a substitute runtime for Borland products (Pascal and C) that did memory allocation this way on 286, paging included. It was a niche product as far as I remember, and I have no recollection of how it was called, who made it, etc. :(

  • This is true, but I don't think it answers the question, since everything you mentioned is from the 80s. – benrg Jan 21 '23 at 16:50
0

This is not an answer but complementary information :

There are two different concept which get called "garbage collection".

  • The one we are used in languages like java which means automatic release of unreferenced dynamically allocated zones. This was invented because people tend to forget to manually released, and because in very complex programs, it might be hard to figure out what is not used anymore.
  • And The defragmentation of the heap which was already needed in language like basic which had no explicit dynamic allocation, but used resizable objects.

The point is that fragmentation occurred in manual dynamic allocation too, which lead the designers of mac OSes, to use what they called handlers instead of pointers, in which what was given in allocation was not the address of a memory space, but the address of a system managed unique pointer. That way, the system was able to defragment the memory by moving the data without the program having to know it.

Camion
  • 119
  • 2
  • Does anyone apart from the author of that book refer to heap defragmentation alone as ‘garbage collection’? – user3840170 Jan 25 '23 at 19:16
  • I don't know what book you're talking about, but for reference, in the old "Locomotive Basic" used in Amstrad CPC464/664/6128, the massive use of string could cause the start of a process which was called garbage collection in the documentation (it was the first time in my life I met this vocable), and could freeze the machine for a very long time (could take more than 10-20 minutes). – Camion Feb 01 '23 at 10:56