Writing a simple pool allocator in C

8dcc.github.io

・

307 points

・

8dcc

・

6 months ago

148 comments

jll29 ・ 6 months ago

This code is very beautifully written, thanks for sharing.

However, you should consult the book "C: Interfaces and Implementations" by Dave Hanson, which has a library called "Arena" that is very similar to your "Pool"s, and it shows a few more tricks to make the code safer and easier to read (Chapters 5+6, 6 in particular).

D. Hanson's book (written in the 'literary programming' style invented by Donal Knuth) also demonstrates having debugging and production implementations for the same C API to catch memory errors, and his book is from 1997:

  > Copyright © 1997 by David R. Hanson

(He used to be a SE professor I Princeton before getting hired by Microsoft, I believe.)

cb321 ・ 6 months ago

Kernighan & Ritchie's "The C Programming Language" also has an example implementation of malloc() & free() { for any-size requests as @kragen notes in sibling } in it in like 3 pages (though, sure, it is far from the most efficient implementation). Since it's the last section of the main text, I'd guess that most readers don't get that far (even though it's a short book).
- IgorPartola ・ 6 months ago
  
  Even if you don’t program in C, I think K&R should be required reading for anyone who writes software for a living. You can finish it in an afternoon, or two days if you actually write and run the code outlined in the book.
  My bucket list includes an item to write a novel efficient implementation of malloc/free. I have taken several stabs at this in the past but have never been happy with it. I have no real need of a custom allocator but it is something that I think would be very cool to have under my belt. This was inspired by K&R.
kragen ・ 6 months ago

I think you mean "literate programming".
The arena allocator in Hanson's chapter 6 is an arena allocator, not a "pool allocator" in the sense that 8dcc means, which is to say, a constant-time fixed-size-block allocator. It seems that you had the same reading comprehension problem that I did.
The difference is that Hanson's arena allocator can make allocations of any size (his Arena_alloc takes an nbytes argument) and cannot deallocate individual allocations, only the entire arena (his Arena_free takes only an arena pointer as a parameter). A fixed-size-block allocator like the one we are discussing here can only make allocations of a fixed size, but can recycle them individually, not only all at once.
(If Hanson's name sounds familiar, it might be because you've read the lcc book.)
It may be worth noting that Donald Knuth also invented the font used in 8dcc's article.
- actionfromafar ・ 6 months ago
  
  Honestly, "literary programming" does better explain how Knuth writes code.
  
  kragen ・ 6 months ago
  ・ 5 more
  
  Yes, but that name lacks the benefit of implicitly describing other approaches to programming as "illiterate".
  
  kevin_thibedeau ・ 6 months ago
  ・ 2 more
  
  Reading the Metafont source is pretty challenging because of all the transclusions. There's a reason why programs aren't written like journal papers.
  
  kragen ・ 6 months ago
  
  I think we can probably do better than METAFONT with the benefit of hindsight. Unit tests would help with comprehensibility. Including example outputs, too, like Darius Bacon's "halp", or R-Markdown, Jupyter, doctest, and org-mode can do. Higher-level languages than Pascal would probably be an improvement. Maybe building up the program as a series of versions, like Kartik Agaram's idea of "layers"; I did something similar but less complete in my literate-programming system "handaxeweb". We can do decent typography on a computer monitor now, so in the worst case, we can attempt to leverage interactivity, as so-called "explorable explanations" do.
  Both programs and journal papers are mostly written the way they are because of tradition, not because it's the best way we've found to write them. Two examples:
  Nobody would intentionally design a programming language to make it impossible to include the kind of diagrams that illustrate this article, but most programming languages are in fact designed in such a way.
  Buckminster-Fuller-style "ventilated prose" turns out to be more readable than traditional paragraph formatting in psychology experiments, which is why we use it universally in our programs. But you can't publish a paper formatted that way unless you call it a poem.
  
  actionfromafar ・ 6 months ago
  ・ 2 more
  
  Haha, never thought about that!
  
  kragen ・ 6 months ago
  
  Knuth did.
8dcc ・ 6 months ago

Thank you so much! I will definitely check it out. I also planned on writing an article about arena allocators soon.
johnisgood ・ 6 months ago

My minor nitpick regarding the code is the choice of variable names, but other than that, it is indeed easy to read and it is how I implement data structures and whatnot.
I like that there is a more or less exhaustive blog post about it which happens to look nice as well.
msla ・ 6 months ago

Aside from the fact it persistently confuses linear time and constant time, it is good.
- 8dcc ・ 6 months ago
  
  You are right, sorry about that. I fixed it on the most recent commit (73526e0).

kazinator ・ 6 months ago

If you write your own allocator in C, do yourself a favor and use the valgrind API inside it. Its use can be conditional so you can "compile it out". Or should I say it has to be, so you can build the code in an environment where there's no valgrind API.

tavianator ・ 6 months ago

I've replaced Valgrind with sanitizers for most of my workflow. Using the sanitizer APIs in your custom allocator is also easy: https://github.com/tavianator/bfs/blob/main/src/sanity.h https://github.com/tavianator/bfs/blob/31dffd6441ba80ea998ce...
gopalv ・ 6 months ago

+1 - this isn't even that hard, it is include a single header + annotate allocations/frees, with some redzones around the actual allocation to know when the user-code is writing into the pool pointer areas.
https://developers.redhat.com/articles/2022/03/23/use-valgri...
- hooli_gan ・ 6 months ago
  
  https://en.wikipedia.org/wiki/Red_zone_(computing)
- undefined ・ 6 months ago
  
  [deleted]
- swah ・ 6 months ago
  
  Thats with Claude, right?
  
  gopalv ・ 6 months ago
  
  > with Claude
  The comment itself is AI generated or the code reference?
  Here's some of my old code
  https://github.com/php/pecl-caching-apc/blob/0dd2a69bab58822...
- dsp_person ・ 6 months ago
  
  oof love on this page how scrolling hijacks the back button
lsllc ・ 6 months ago

The Address Sanitizer (ASAN) in both clang and gcc is/are excellent:
https://github.com/google/sanitizers/wiki/addresssanitizer
I believe to integrate a custom allocator with ASAN, you need to look at the memory "poisoning" mechanisms:
https://github.com/google/sanitizers/wiki/AddressSanitizerMa...
- pjmlp ・ 6 months ago
  
  It is also available on MVSC nowadays.
  https://learn.microsoft.com/en-us/cpp/sanitizers/asan?view=m...
  .NET team also makes use of it for the runtime parts,
  https://devblogs.microsoft.com/cppblog/msvc-address-sanitize...
8dcc ・ 6 months ago

I have never heard about this (using valgrind with a "non-standard allocator"), but it looks really interesting, specially for other projects I am working on. I will definitely look into it.
- kazinator ・ 6 months ago
  
  I integrated the TXR Lisp garbage collector with valgrind. That came in handy on a number of occasions.
  It's not always the case that you want special behavior for the garbage collector to assist with debugging, but also this: if you want to use Valgrind to debug anything that does not have to do with the garbage collector, you want all the false positives generated by it out of your way.
  Some of the scanning that a conservative garbage collector has to do looks like access errors to Valgrind. By using the Valgrind API, you can grant yourself access to that memory, and then lock it again when done.
  That aspect doesn't necessarily apply to a pool allocator being described in this article; that should not be triggering any false positives.
  
  8dcc ・ 6 months ago
  ・ 2 more
  
  (In case you missed my other post, I ended up adding valgrind support in the article and in the 'libpool' project.)
  > I integrated the TXR Lisp garbage collector with valgrind.
  That's funny, because when I said "for other projects I am working on", I specifically meant a Lisp interpreter which uses a memory pool and garbage collection (I am still working on implementing this last part, anyway). I am very happy about the fact that I can continue using valgrind in this project.
  
  kragen ・ 6 months ago
  
  This is awesome!
- mananaysiempre ・ 6 months ago
  
  You can also integrate with AddressSanitizer to some extent, look into[1] ASAN_{UN,}POISON_MEMORY_REGION from <sanitizer/asan_interface.h> or the underlying intrinsics __asan_{un,}poison_memory_region.
  [1] https://github.com/google/sanitizers/wiki/AddressSanitizerMa...
kragen ・ 6 months ago

Thanks for the tip!

cb321 ・ 6 months ago

One aspect people don't tend to mention (and TFA does not because it has "general purpose" as a scope) about doing your own allocation arenas is that because a "pointer" is just the name (really number) of an allocated region, this also affords flexibility in the width of those numbers aka "pointer size" that decides your "address" space.

So, for example, if you are willing to limit yourself to 64 KiNodes of equal sized objects, a pool like https://github.com/c-blake/bst/blob/main/POOL.c can use an unsigned short 2-byte pointer. This small address space specialization can give a full 4x space overhead optimization over modern 64-bit pointers.

You could even use 8x less with 1-byte pointers if 256 nodes is enough or use bit fields or equivalents for even more fine-grained address space size selection. Admittedly, linked structures are usually less interesting for very small address space sizes, but this is sometimes useful. People may complain about arbitrary limits that relate to this kind of optimization, but the linked part could be "small by design" without compromising overall scale { such as for structure within a B-tree node or other multi-level indexing setups }.

In any event, I think it is useful to highlight pointer size arbitrariness to junior engineers. It is often learned once & immediately forgotten behind an onslaught of general purpose-ness (& pointer type systems/syntax). Relocation ideas also relate. E.g., if all the pointers are implicitly indices into a (small?) array, and all the code needs a "base address" of that array to get a VMem address, then you can relocate the whole bundle of objects pretty easily changing only the base address.

kps ・ 6 months ago

‘Handles are the better pointers’¹ is a related writeup that has been discussed here before².
¹ https://floooh.github.io/2018/06/17/handles-vs-pointers.html
² https://news.ycombinator.com/item?id=36419739
- IgorPartola ・ 6 months ago
  
  I wish I had caught the original discussions. The moral of the story is that pointers contain too little metadata and not enough places to do bounds checking and type checking to be safer to handle (pun intended). Taking charge of things as “objects” is more robust.
  Do notice that it means runtime bounds checking. It also mentions the problem of having a handle reference an object that’s no longer there or that has changed and that the probability of this happening rises with time. This pattern works great in single thread/single process code but starts decaying hard when you have preemptive concurrency. Here you really cannot safely do much of anything without some sort of locking without copy on write semantics (which is also the case with a lot of other systems, just pointing out that this isn’t solved magically by handles though the additional metadata can help detect use-after-free and handle-now-refers-to-a-different-object problem).
  So the age old objection that bounds checking slows you down but is necessary to detect problems at runtime is here. You won’t get branch-free code with way compared to straight pointer references but on the other hand you get a lot closer to the concepts of owners and borrowers in straight C which is cool.
  
  cb321 ・ 6 months ago
  
  Those are good insights and I've probably written way too many words here already, but I wanted to amplify that your "age old" applies to the whole related "BUNDLE OF PRIMORDIAL QUESTIONS:", i.e. "What is a pointer (or name), anyway? and what do we want from it? and how do we want it? at compile/run time?" It's honestly a sub-part of Martin Fowler's famous Two Hard Things in CS joke, although I think that is more commonly (and shallowly) interpreted to be about simple identifier bike-shedding these days. There is probably some good blog post "What's in a Reference?" already written somewhere.
  Different answers have different trade-offs in different settings and are variously (un)popular (e.g., x86 memory segments, LBA disk addressing, static/dynamic name scoping in PLs, "virtual" memory itself, ...). Besides the semantic/run-time/safety aspects of "referencing" you mention there are the syntactic ones I mentioned over in https://news.ycombinator.com/item?id=42644009 . Because it is so hard to get one-size-fits-all answers, I think a lot of user-defined layering makes sense, but because of all the hazards you mention, systems support for safety & perf is often really ambitious which makes the user-definition harder (for some/all programmers in the collaboration).
  As another attempted witticism in this direction: "Some Assembly Required" is the general context for almost all engineering, but "how much 'some' is offputting" and 'Wait - assembly by WHO?" are big open questions that crystallize around various axes in various human communities. Lately, my one-liner is "Delegation affords so much, but trust sure is tricky!".
8dcc ・ 6 months ago

Yes, this is true. I didn't use that in the code because I thought it would make it less readable, though. I might mention it somewhere, thank you.
- cb321 ・ 6 months ago
  
  In many prog.lang's, it absolutely makes things less readable. This is not an accident, but because the PL itself specifically tries to syntactically support pointer types (i.e. to literally aid readability). Most PLs do so only for general purpose VMem pointers with user-defined pointers being afterthoughts (at best). You probably need (at least!) operator overloading to get the same level of readability as "PL-native" pointers { though "readability" is a bit subjective and powerful PL features can be divisive; Scare "quotes" for interpretation hazard :-) }.
  Anyway, you probably knew all that & I don't mean to suggest otherwise, but it made sense to write since HN loves PL talk.

8dcc ・ 6 months ago

Hello, I am the author. Thank you all for the instructive comments, I made some changes to the article since I first posted it:

- Added a link to this HN post.

- Renamed some variables and types in the code for readability.

- Mention (in a footnote) that we could allocate the 'Pool' structure and the chunk arrays with a single call, or even return the 'Pool' on the stack. Suggested by 'liontwist' and 'unwind'.

- Mention (in a footnote) that we don't always need to store the full address when building the linked list of free chunks. Suggested by 'cb321'.

- Added valgrind support (also to my 'libpool' project). Suggested by 'kazinator'.

cb321 ・ 6 months ago

FWIW, while it is true that a smaller pointer lets you have smaller minimum fixed size objects in your pool, as your new footnote suggests, the space concern I was mostly alluding to in [1] is all the references in your client code { such as the BST code I linked to where each binary search tree (BST) node has 2 pointers (or 3 if they have parent pointers to give BST nodes "doubly linked list-like autonomy") }.
This can really add up. For example, like 20 years ago, I did a `print sizeof` in gdb for some g++ STL map node with a data payload of like 4 bytes and it was some obscene size approximately >= 1 order of magnitude bigger than necessary (like 80 bytes instead of, say 8 with 2Byte ptrs). While main memory is cheap, CPU cache memory is not and for "linky" structures, it often pays to be more cache-friendly. It's pretty easy to imagine a 10X smaller thing fitting into an L2 CPU cache where the larger thing would not even fit in L3, netting you a big latency boost (though the more root-ward levels of the tree likely always stay in caches in a hot loop).
EDIT: Anyway, you don't have to believe me. Others did an x32 ABI for x86_64 mostly to have "merely 2X" smaller pointers: https://en.wikipedia.org/wiki/X32_ABI
[1] https://news.ycombinator.com/item?id=42643410
tandr ・ 6 months ago

As a last code snippet readability suggestion - maybe extract "malloc" and "free" calls to your own simple wrappers (like my_malloc and my_free), and do "valgrinding" inside of them? You can also pass additional params to these functions to specify type of allocation if needed, or make separate functions.
- 8dcc ・ 6 months ago
  
  I am not sure what you mean, exactly. There are some other valgrind macros that need to be called for pointers other than the ones returned by 'malloc'. More importantly, the memory regions need to be marked as 'NOACCESS' after we are done using them from the pool allocator, or valgrind will complain; that's why I set them back to 'DEFINED' before writing to them in other functions.
  This is my first time using valgrind, however, so I might just be confused about what you mean, and there is a simpler way of doing all this.
veltas ・ 6 months ago

Also the font isn't incredibly easy to read on Windows. I'm assuming Georgia was selected? Which has been on Windows for years but renders so badly it looks like the font color isn't black, even though I think it is black? This isn't really your fault but I would have stuck to Times New Roman for Windows' sake.

8fingerlouie ・ 6 months ago

A long time ago, in a galaxy far far away, i wrote something similar for an embedded platform. I was implementing a WAP Browser at the time (yes, WAP, https://en.wikipedia.org/wiki/Wireless_Application_Protocol), and we needed dynamic memory allocation in an environment where everything was static due to realtime constraints.

So i ended up with this : https://github.com/jinie/memmgr

keyle ・ 6 months ago

The parent article is quite interesting and slightly different being cpp [1]. There is also a post about writing a memory allocator [2].

[1]: http://dmitrysoshnikov.com/compilers/writing-a-pool-allocato...

[2]: http://dmitrysoshnikov.com/compilers/writing-a-memory-alloca...

exDM69 ・ 6 months ago

Here's another interesting O(1) memory allocator but with arbitrary sized allocations and low fragmentation. Negative side is relatively high memory overhead (a few dozen bytes per allocation).

This kind of allocator is often used to suballocate GPU memory in game and graphics applications.

I'm using a variant of this algorithm with added support for shrinkable and aligned allocations and flexible bin sizing.

You can also extend this idea to two dimensions to create texture atlases, which is possible in O(1) for power of two sized allocations.

Original: https://github.com/sebbbi/OffsetAllocator Rust port: https://crates.io/crates/offset-allocator

alecco ・ 6 months ago

(related) LLFree

Paper: LLFree: Scalable and Optionally-Persistent Page-Frame Allocation https://www.usenix.org/system/files/atc23-wrenger.pdf

Video presentation: https://www.youtube.com/watch?v=yvd3D5VOHc8

Implementation and benchmarks are well documented at the repos:

Rust repo https://github.com/luhsra/llfree-rs C repo https://github.com/luhsra/llfree-c

teo_zero ・ 6 months ago

What I find weird is that the chunk's size is fixed, while the caller must specify how many chunks a pool contains at pool creation. I would do exactly the dual: the chunk's size should be defined at pool creation, so that you can create multiple pools, each dedicated to one specific kind of object. (You need a third field in struct Pool, though.) Instead, the number of chunks per pool should not something the user needs to care about: if they knew in advance how many objects are required, they would simply allocate an array big enough! The library implementation should define a sensible number and stick to it.

Similarly, I don't like exposing pool_expand(): too much burden on the user. Expansion should be automatically triggered by pool_alloc() whenever the current pool is exhausted.

This would also allow a much simpler pool_new() that just initializes its pointers to NULL, leaving it to the first invocation of pool_alloc() to actually do the first allocation. This would avoid the duplication of code between pool_new() and pool_expand().

Another benefit of fixing the number of chunks is a possible simplification of ArrayStart: this could include a proper array instead of a pointer, avoiding a malloc(). Something like this:

  struct ArrayStart {
    struct ArrayStart *next;
    Chunk arr[CHUNKS_PER_ARRAY];
  };

8dcc ・ 6 months ago

> I would do exactly the dual: the chunk's size should be defined at pool creation, so that you can create multiple pools, each dedicated to one specific kind of object.
This is supported in my 'libpool' project. I thought I mentioned it in the article, but now I am not so sure.
https://github.com/8dcc/libpool/blob/main/src/libpool.h#L51
> Similarly, I don't like exposing pool_expand(): too much burden on the user. Expansion should be automatically triggered by pool_alloc() whenever the current pool is exhausted.
I feel like this shouldn't be too hard to do, but I actually did write a function for this in another project I am working on.
https://github.com/8dcc/sl/blob/9ddd84d75ffc3b0ba1373bc13bc6...
> This would also allow a much simpler pool_new() that just initializes its pointers to NULL, leaving it to the first invocation of pool_alloc() to actually do the first allocation.
I didn't think of this, but I rather keep the two functions separate, specially for readability in the article.

voctor ・ 6 months ago

There is a small series of posts on the BitSquid blog about memory allocation which is worth reading! http://bitsquid.blogspot.com/2015/08/allocation-adventures-3...

sylware ・ 6 months ago

It all depends on the load profile using the allocator. You never know, that said you cannot beat, in theory, a semantically specialized allocator for a specific load... in other words, the right way(TM). This means applications should bundle their own allocators for the various load types they have. The "generic" allocator is sort of an heresy for the lazy, short termists or those who are in hurry. Don't worry I still use such generic allocator, sometimes, but often I do mmap myself the memory and deal directly with it.

astrange ・ 6 months ago

The system allocator is the one that comes with all the system's memory debugging tools and security (or lack of it), so you'd better not mind losing those if you want to ditch it.

drysine ・ 6 months ago

Also "Arena allocator tips and tricks (nullprogram.com)" [0]

[0] https://news.ycombinator.com/item?id=37670740

liontwist ・ 6 months ago

The pool struct Is two pointers, why are you allocating it with Malloc?

unwind ・ 6 months ago

Next question: why are there two calls to `malloc()`, one for the Pool structure and one for the Chunks?
It is trivial to co-allocate them which removes the risk of memory fragmentation and is just generally a good idea if you're chasing performance.
The answer might be "for clarity/simplicity" which I guess is fine in an informative article, but it could at least be highlighted in the text which I didn't see.
- kevin_thibedeau ・ 6 months ago
  
  He's targeting ANSI C, C89. Flexible array members are not supported officially until C99. 26 years later, it's time to stop clinging to outdated standards and crusty tooling. Even Linux had to read the room and adopt C11.
  A C11 implementation could go one step further and use _Alignas(max_align_t) to keep the pool array aligned with no manual effort. The double allocation does this implicitly.
  
  listeria ・ 6 months ago
  
  on the topic of alignment, the library (libpool) fails to align chunk_sz to allow storing a pointer when in the free_chunk list.
  This issue is sidestepped in TFA by using a union, which ensures appropiate alignment for all it's members, but in the library there's nothing which prevents me from asking for a chunk size of, say, 9 bytes, which would store pointers at misaligned addresses when creating the free-list.
  
  liontwist ・ 6 months ago
  
  The pool needs to align every allocation anyway.
- jdnxbdn ・ 6 months ago
  
  It is certainly not trivial and would add little to the topic
- 8dcc ・ 6 months ago
  
  I will also take note of this, you are right.
o11c ・ 6 months ago

Some people are allergic to `main() { Foo foo; foo_init(&foo); }`
Admittedly, sometimes for good reason - it locks in the ABI.
- kazinator ・ 6 months ago
  
  Mainly just the size and alignment. If you make the structure oversized, and with strict alignment, it can be future proofed. Old binary clients will provide a structure big enough and aligned enough for the new version.
  The dynamic constructor API can allocate the structure exactly sized without the padding.
- MathMonkeyMan ・ 6 months ago
  
  Is that why some structs have "char reserved[16]" data members? I think I saw this in NGINX, though that might have been to allow module compatibility between the open source offering and the paid version.
  
  o11c ・ 6 months ago
  
  Yes, that slightly improves ABI flexibility, though the fact that callers can still access fields limits that.
  An alternative is to make the only member `char opaque[16]` (IIRC some locale-related thing in glibc does this, also I think libpng went through a transition of some sort related to this?), but calculating the size can be difficult outside of trivial cases since you can't use sizeof/offsetof.
- liontwist ・ 6 months ago
  
  ABI? Better make it an anonymous struct too.
  It’s a performance oriented custom memory allocator.
lyu07282 ・ 6 months ago

Couldn't this also just syscall mmap directly? I mean malloc itself is a memory allocator it feels a bit strange to use it in an allocator implementation, but perhaps I'm missing something.

accelbred ・ 6 months ago

Note that using this will likely violate strict aliasing due to using the void pointer returned as one type, freeing it, and then getting that same chunk again and using it as another type. You'll probably want to compile with -fno-strict-aliasing to be safe.

A good reference on strict aliasing: https://gist.github.com/shafik/848ae25ee209f698763cffee272a5...

alextingle ・ 6 months ago

There's no reason an allocator should ever trip over strict aliasing rules.
malloc() returns an address. The user code writes a value, and then reads a value of the same type. No problem.
Later, malloc() returns the same address. The user code writes a value of another type then reads a value of that same type. Again, no problem. There's no aliasing going on here.
The act of writing a value of a different type tells the compiler that the lifetime of the previous object has ended. There's no special magic required.
Strict aliasing is about situations where we write a value of one type, and then attempt to read a value of a different type from the same location. If you want to do that, then you have to be extremely cautious. But memory allocators don't do that, so it's not an issue.
(Obviously just talking about C here, or POD in C++ terms. If you are dealing with C++ objects with destructors, then you have an extra layer of complexity to deal with, but again that's nothing to do with aliasing.)
- clairebyte ・ 6 months ago
  
  >The act of writing a value of a different type tells the compiler that the lifetime of the previous object has ended.
  afaik only memcpy has that magic property, so I think parent is almost correct.
  void *p = malloc(n); *(int *)p = 42; // ok, *p is now an int. //*(float *)p = 3.14f; // I think this is not allowed, p points to an int object, regular stores do not change effective type float x = 3.14f; memcpy(p, &x, sizeof(float)); // but this is fine, *p now has effective type float
  So in the new, pool_new:
  pool->chunk_arr[i].next = &pool->chunk_arr[i + 1];
  This sets the effect type of the chunk block to 'Chunk'
  Later in pool_alloc:
  Chunk* result = pool->free_chunk; ... return result;
  result has effective type 'Chunk'
  In user code:
  int *x = pool_alloc(); *x = 42; // aliasing violation, *x has effective type 'Chunk' but tried to access it as an int*
  User code would need to look like this:
  int *x = pool_alloc(); memcpy(x, &(int){0}, sizeof(int)); // establish new effective type as 'int' // now we can do *x = 42;**
  
  bigpingo ・ 6 months ago
  ・ 7 more
  
  And this is why type based alias analysis (TBAA) is insane and why projects like linux complies with fno-strict-aliasing.
  C should issue a defect report and get rid of that nonsense from the standard.
  
  astrange ・ 6 months ago
  ・ 6 more
  
  C doesn't have "alias analysis" in the standard. It has an (informally specified) memory model which has "memory objects" which have a single type, which means treating them as a different type is undefined behavior.
  This enables security analysis like valgrind/ASan and secure hardware like MTE/CHERI so it's very important and you can't get rid of it.
  However, it's not possible to implement malloc() in C because malloc() is defined as returning new "memory objects" and there is no C operation which creates "memory objects" except malloc() itself. So it only works as long as you can't see into the implementation, or if the compiler gives you special forgiveness somehow.
  C++ has such an operation called "placement new", so you want something like that.
  
  kevin_thibedeau ・ 6 months ago
  ・ 5 more
  
  You can definitely implement malloc in C. It does nothing special in its most basic form but cough up void pointers into its own arena.
  It gets complicated when you have virtual memory and an OS involved but even then you can override the system malloc with a simple implementation that allocates from a large static array.
  
  astrange ・ 6 months ago
  ・ 4 more
  
  No, returning parts of an array does not implement malloc as described in the standard. That's not a new memory object, it's a part of an existing one.
  
  kevin_thibedeau ・ 6 months ago
  ・ 3 more
  
  The standard is written to accommodate obsolete tagged memory architectures that require special support. They aren't relevant today and data pointers are fungible regardless of where they originate.
  
  astrange ・ 6 months ago
  
  > data pointers are fungible regardless of where they originate.
  This was never true because of something called provenance: https://www.ralfj.de/blog/2020/12/14/provenance.html. Though it usually doesn't matter and I think it annoys anyone who finds out about it.
  But in practice it's not always true on Apple A12 or later because they support PAC (so pointers of different type to the same address can be not equal bit-wise) and is even less true on very latest Android because it supports the really big gun MTE. And MTE is great; you don't want to miss out on it. No explainer here because there's no Wikipedia article for it(!).
  Also becomes not true on any system if you use -fbounds-safety or some of the sanitizers.
  
  kragen ・ 6 months ago
  
  Morello is.
  
  astrange ・ 6 months ago
  ・ 5 more
  
  There are other issues besides changing the memory type. For instance, C has those rules about out of bounds pointers being undefined, but you can't implement that - if you return part of the pool and someone calculates an out of bounds address they're getting a valid address to the rest of the pool. That's why you can't implement malloc() in C.
  (The difference here is that system malloc() works with valgrind, -fbounds-safety, theoretical secure hardware with bounds checking etc., and this one doesn't.)
  
  kragen ・ 6 months ago
  ・ 4 more
  
  Undefined behavior is behavior you can't avoid implementing, because no matter what your compiler and runtime do, it complies with the spec. In particular getting valid addresses to other objects from out-of-bounds address arithmetic is not just conformant with the C standard but by far the most common conforming behavior.
  
  astrange ・ 6 months ago
  ・ 3 more
  
  Meant to say you can't implement it as an invalid/trap state. This is possible in some implementations but they have to cooperate with you to do it.
  > In particular getting valid addresses to other objects from out-of-bounds address arithmetic is not just conformant with the C standard but by far the most common conforming behavior.
  One reason calculating out of bounds addresses might not work out is the calculation might cause the pointer to overflow, and then surprising things might happen like comparisons failing or tag bits in the high bytes getting corrupted.
  
  kragen ・ 6 months ago
  ・ 2 more
  
  Oh, then I agree. My apologies for interpreting you as saying something so obviously incorrect. Yes, in particular CHERI has a mechanism to shrink the bounds of a pointer, but just returning a pointer into an array won't do it.
  
  undefined ・ 6 months ago
  
  [deleted]
  
  oguz-ismail ・ 6 months ago
  ・ 3 more
  
  > aliasing violation, *x has effective type 'Chunk'
  This doesn't make any sense. How do you know its effective type if you don't have access to the definition of `pool_alloc()'?
  
  clairebyte ・ 6 months ago
  ・ 2 more
  
  If you can guarantee its always compiled in a separate TU and never inlined, sure, might be a practical way so 'solve' this issue, but if you then do some LTO (or do a unity build or something) the compiler might suddenly break your code. Another way is to add an inline asm block with "memory" clobber to escape the pointer so the optimizer can't destroy your code.
  It's really quite ridiculous that compiler implementers have managed to overzealously nitpick the C standard so that you can't implement a memory allocator in C.
  
  astrange ・ 6 months ago
  
  > It's really quite ridiculous that compiler implementers have managed to overzealously nitpick the C standard so that you can't implement a memory allocator in C.
  This is good because it's also what gives you valgrind and CHERI. Take one away and you can't have the other.
  (Because if you define all undefined behavior, then programs will rely on it and you can't assume it's an error anymore.)
  
  uecker ・ 6 months ago
  
  Any type-changing store to allocated storage has this property in C.
- gpderetta ・ 6 months ago
  
  > The act of writing a value of a different type tells the compiler that the lifetime of the previous object has ended. There's no special magic required.
  I think strictly speaking in C this is only true for anonymous memory, i.e. memory you got from some allocator-like function. So if your custom allocator gets its memory from malloc (or sbrk, or mmap), everything is fine, but, for example, you are allocating from a static char array, it is formally UB.
  In C++ you can use placement new to change the type of anything.
  
  uecker ・ 6 months ago
  
  Yes, type-changing stores are guaranteed by the C standard to work only for allocated storage. In practice no compiler makes a difference between declared and allocated storage. A bigger problem is that Clang does not implement the C rules as specified..
undefined ・ 6 months ago

[deleted]
leni536 ・ 6 months ago

That's what allocators do. If C's object model doesn't allow users of the language to write their own allocators then that object model is broken.
C++ relatively has fixes to allow allocators to work, it requires calls to std::launder.
- dwattttt ・ 6 months ago
  
  I understand the C standard hides such actions behind a wall of "this is the allocator", and expected the compiler authors to also be the allocator authors, allowing them to know when/how they can break such rules (in the context of their own compiler)
  
  unwind ・ 6 months ago
  ・ 4 more
  
  No, allocators are not magic (in that regard) in C. There is nothing out of the ordinary going on with an allocator, the parent comment is simply mistaken (as pointed out by another answer).
  
  dwattttt ・ 6 months ago
  ・ 3 more
  
  Ah, I see that it's because the char type never violates Strict Aliasing. I was wondering how you could define a type such as Chunk, yet hand out a pointer to it to a user who will cast it to some other type.
  
  SleepyMyroslav ・ 6 months ago
  ・ 2 more
  
  Well the pool code fails to use char* type to write inside block of memory.
  If you look at 'pool_free' code you can see that it receives used block from user as 'ptr' then casts it to 'Chunk pointer' and writes to into Chunk member value of type 'Chunk pointer'. I had to change that line to memcpy when I did it last time around 15 years ago in C++ with GCC 4.something. In short if you are writing your own allocators you need either to disable strict aliasing like Linux does or do the dance of 'memcpy' when you access memory as 'internal allocator structures' right after it was used as user type. When it happened to me write was reordered and I observed code that writes into Chunk* next executed first and write of zeros into user type second.
  
  uecker ・ 6 months ago
  
  Note that C++ has different rules than C. C has type changing stores that do not require memcpy for allocated storage (i.e. no declared type). Older compilers have plenty of bugs related to TBAA though. Newer versions of GCC should be ok.
- pjmlp ・ 6 months ago
  
  Which is why pointer provenance is an issue as well.

jhallenworld ・ 6 months ago

A sometimes useful thing is to treat the pool as a stack and provide a call to free all recently allocated items up to a certain index. So you make a bunch of deep subroutine calls that use the pool allocator, and then on return free them all. It's like a secondary automatic storage class.

Also sometimes useful to provide another call to promote an automatic item to a more permanent one so that it is excluded from being freed like this.

ch33zer ・ 6 months ago

Some interesting things that would be interesting to add to this:

- thread safety ( not too hard to add on top of your liked list) - variable sized allocations. Ideally something more akin to malloc. Could be done wastefully by rounding up to the nearest chunk size - (as mentioned in the article) variable chunk sizes - zeroing of memory before handing it back to users - double free detection or even better full asan support

8dcc ・ 6 months ago

I don't currently have much experience with thread safety, but it's something I should look into. Thank you.

9029 ・ 6 months ago

Another (admittedly less portable) way to solve the resizing problem is to reserve a large virtual memory space using linux mmap with the PROT_NONE flag and then commit pages as needed using mprotect with PROT_READ|PROT_WRITE flags. I believe windows' VirtualAlloc/VirtualProtect also allows this

jeffbee ・ 6 months ago

You don't need to do anything with the flags. You can just mmap a huge virtual area and the pages will be realized whenever you first touch them.
- 9029 ・ 6 months ago
  
  Today I learned, thanks

pajko ・ 6 months ago

ThreadX has pool allocators: https://github.com/eclipse-threadx/threadx/blob/master/commo...

kazinator ・ 6 months ago

Everyone and his dog who has programmed C for a few decades has several as well. :)
stefantalpalaru ・ 6 months ago

[dead]

2rsf ・ 6 months ago

This bring old memories back, I implemented something similar on a motorola 68360 ages ago. Since the size of the buffer I used was not huge I skipped the pointers and simply enumerated each chunk, it was remarkably efficient and stable.

codr7 ・ 6 months ago

I usually default to slab allocators if I have to write my own:

https://github.com/codr7/libcbs#slab-allocation

harrison_clarke ・ 6 months ago

you can avoid resizing and moving the pools if you allocate massive pools up front. most OSs let you overcommit memory, and most programs using memory pools only have a handful of them

(on windows, you'd need to handle this with VirtualAlloc. osx and linux let you overcommit with malloc, and handle it with a copy-on-write page fault handler)

matheusmoreira ・ 6 months ago

This is really informative. I wonder if there is a better solution for the reallocation problem, one that preserves pointers.

kragen ・ 6 months ago

You can keep the raw pointers inside your manager object for a particular object type and only hand out some kind of handle that requires an indirection step, such as an array index.
- matheusmoreira ・ 6 months ago
  
  I've seen that design before. Base and offset easily allows relocations of data in memory.
  https://www.man7.org/linux/man-pages/man2/mremap.2.html
  If the mapping is relocated, then absolute pointers into the old mapping location become invalid (offsets relative to the starting address of the mapping should be employed).
  I've always assumed that forcing an indirection causes a noticeable performance impact. I mean, everyone uses absolute pointers by default, right?
  
  kragen ・ 6 months ago
  ・ 5 more
  
  There are a lot of variations on it:
  - Unix file descriptors are small-integer array indices into, normally‚ arrays in kernel space.
  - Many Smalltalk implementations, especially on older machines, named objects in memory with small-integer indices into a global object table (typically shifted in some way to accommodate tagged SmallIntegers) which was implemented as an array of pointers to the real object locations.
  - Functions exported from DLLs on Windows can be identified by "ordinals" that index a table of all exported functions instead of by name.
  - PalmOS, Win16, and Pre-X MacOS identified allocations of memory from the operating system by "handles" which were, again, indices into a table of pointers, the master pointer block. To access the actual memory you would call HLock() (MacOS name) and index the table (yourself, on MacOS, or using the return value from the lock function on PalmOS or Win16) to get the physical address, which could change if the handle was unlocked.
  - Forth maintained an in-memory buffer cache of 1024-byte disk blocks, which it identified by ordinal number rather than, say, cylinder/head/sector tuples or inode+offset tuples. To get the memory address of block 53, you would say 53 block (Forth for "block(53)") and freely index into those 1024 bytes, which were guaranteed to remain resident until the next call to block or until you yielded the CPU with another I/O operation. If you modified the data there, you would call update to tell Forth the buffer was modified. If block 53 was not currently in memory, Forth would pause your task until it had loaded it, running other tasks meanwhile. If there was no free block buffer, Forth would free one up, by writing out a modified buffer to disk if necessary. This provided a sort of virtual memory in about 100 lines of code without requiring memory-mapping hardware. (The "BLOCK I/O" part of F83's KERNEL86.FS is 8 16-line screens long, but a fair bit of that is actually dealing with MS-DOS's file I/O functions.)
  - Because standard Pascal had no string type and no way to handle variable-length arrays of any kind, TeX uses indices into a string table as its string type. It inserts newly encountered strings in the string table.
  - Symbols in some Lisps.
  - Pretty much anything in any Fortran program. See http://www.the-adam.com/adam/rantrave/st02.pdf on how to program Fortran in any language.
  How noticeable the performance impact is depends a lot on how often you have to traverse the extra level of indirection. (Doing a base+offset addition before each memory access could be a significant performance issue on things like a Commodore 64 or a TRS-80 but basically doesn't matter on 8086 or better hardware.)
  
  matheusmoreira ・ 6 months ago
  ・ 4 more
  
  I really appreciate your replies, they are always insightful. Thank you.
  
  kragen ・ 6 months ago
  ・ 3 more
  
  Aw, thanks! I worry I'm often too disagreeable in my replies, and I usually only see people arguing with them, so it's nice to see that someone appreciates them.
  
  matheusmoreira ・ 6 months ago
  ・ 2 more
  
  Yeah I have same worry. Usually when people agree there is only silence. Just thought I'd point out I've learned a lot from your comments.
  
  kragen ・ 6 months ago
  
  :-)

lelandbatey ・ 6 months ago

It seems like almost all of the complexity of this allocator comes from having to manage the fact that it's being implemented on top of the system allocator. I thought the whole point of using a custom allocator was to avoid the system allocator for exactly the problems they mention e.g. calling the system-provided realloc() might move your stuff around, having to track separate pool starts so you can call the system-provided free() on them, etc.

Like, yeah you have to do that, but I thought the whole point of writing a custom allocator was to not use the system allocator beyond statically allocating a huge block of memory upfront and then managing that yourself, is it not?

8dcc ・ 6 months ago

> I thought the whole point of writing a custom allocator was to not use the system allocator beyond statically allocating a huge block of memory upfront and then managing that yourself, is it not?
Yes, that's what the article is basically about. The other problems you mention only happen when trying to expand an existing pool, which might not be necessary for some people.

guardian5x ・ 6 months ago

The Latin Modern font on this website does not look so good in Chrome, is it because of the font size, or is it just me? (Chrome)

dmonitor ・ 6 months ago

Sounds like a Windows issue based on other people in the thread. I'm on Firefox with the same issue. Removing Latin-modern from the CSS file made it much more readable for me.
johnisgood ・ 6 months ago

I was going to say this. Looks awful to me as well. It is the font-family.

kragen ・ 6 months ago

This is a very nice article! The diagrams in particular are very clear. (They seem to have been done in draw.io: https://github.com/8dcc/8dcc.github.io/blob/main/img/pool-al... which seems to be a pretty decent Apache-2.0 free software licensed diagram editor: https://github.com/jgraph/drawio/)

I think it would be improved further if it began with the calling interface callers use to invoke the allocator, because it's easier to understand the implementation of some service such as pool allocation when you begin by knowing what service, exactly, is required. I began with the assumption that the author was describing an arena allocator under another name, rather than a fixed-size block allocator, both because the APR's influential arena allocator is called "apr_pool_t" and because one of the main headings in the table of contents (a greatly appreciated feature, like all the typography of the article) was "Closing the pool". It took me quite a while to figure out that I was mistaken. It's plausible that I only had this problem because I am an unusually stupid, ignorant, or pigheaded person (some would say all three), but, if not, many other people will have the same difficulty I had of taking several minutes to figure out what the topic of the article is. (I could have saved myself by clicking either of the first two links in the Introduction, but I didn't because I thought I already knew what it was.)

The current top-voted comment on this thread is by someone who had the same reading comprehension problem I did, but never realized their erorr: https://news.ycombinator.com/item?id=42643951 and nobody had pointed out their mistaek until I did just now. So I think we can say with some confidence that many HN readers will have the same difficulty.

I think that the article would also be improved by a few other small additions:

- It says, "The pool allocator, however, is much faster than malloc", but doesn't provide any benchmarks or even an order of magnitude.

- Probably when one benchmarks the implementation given, one will find that it is only slightly faster than jemalloc or even gnumalloc, because they both dispatch small allocations to a fixed-block allocator similar to this one, and the dispatch isn't very expensive. This can be fixed by inlining the allocation fast path and the deallocation function, which will then compile down to a few instructions each. A simple demonstration of how to do this is in my arena allocator kmregion in http://canonical.org/~kragen/sw/dev3/kmregion.h and http://canonical.org/~kragen/sw/dev3/kmregion.c (GPLv3+). The current implementation of 8dcc libpool is guaranteed to be unable to do this unless you're using link-time optimization or a so-called "unity" or "amalgamation" build process.

- It's worth distinguishing between average-case performance (in which this allocator is slightly better than malloc) and worst-case performance (in which case malloc, generally speaking, isn't even in the running).

- It may be worth mentioning that often such pool allocators are used in conjunction with initializing the objects in the pool to a valid state when the pool is created; that way you don't have to reinitialize objects to a valid state every time you deallocate and reallocate them, which is usually more work than malloc does to find the right free list. This does require not overwriting the user data with your freelist pointer, so it's a space/time tradeoff.

- Maybe you should mention alignment, especially now that even on amd64 GCC has taken to using new instructions that require alignment.

Even as it is, though, it's highly educational, visually pleasant, and very understandable (once I got over my initial misconception of having a totally wrong idea of what the article was about).

8dcc ・ 6 months ago

> The diagrams in particular are very clear. (They seem to have been done in draw.io: https://github.com/8dcc/8dcc.github.io/blob/main/img/pool-al... which seems to be a pretty decent Apache-2.0 free software licensed diagram editor: https://github.com/jgraph/drawio/)
Yes, this is true. Furthermore, the diagram information for draw.io is included in the SVG images themselves, so one could edit them there.
> It says, "The pool allocator, however, is much faster than malloc", but doesn't provide any benchmarks or even an order of magnitude.
You are right, I was referring to "malloc-like" allocators (i.e. for variable-size blocks). It would be a good idea to benchmark them.
- kragen ・ 6 months ago
  
  In general when you're working on an optimization (and this is an optimization) benchmarking it is essential. About a third of the optimizations I try end up making my code slower instead of faster, often because of performance considerations I hadn't thought of. You can't just say "it's much faster" without measuring.
  
  cb321 ・ 6 months ago
  ・ 2 more
  
  Modern CPUs (with all their buffers & caches & predictor warm-ups & spin-ups) can make benchmarking really tricky unless you have a very specific workload in mind and even then it requires care (like looking at both the edge of[1] and full timing distributions[2] for best-case, average/median/typical case, worst case analysis, etc.). Briefly, the number of ways to get (un)lucky has proliferated over the decades, as has the variety of deployment platforms.
  Given the depth of that rabbit hole, as writing advice, I might suggest instead motivating via "more space efficient" and refer to exactly fitting fixed-size objects where more general purpose allocators will (usually)¹ waste space to the next power of two. Measuring less dynamic space is also easier (very dynamic space like %cache used can be tricky, though). Controlling scope and pedagogy in general is also tricky, but I think this rhetorical advice is common practice and might also double as a way to address your initial misread of the topic. EDIT: he can always do refinements / more analysis in follow up posts/something if this library has any legs for him.
  [1] https://github.com/c-blake/bu/blob/main/doc/tim.md
  [2] https://github.com/c-blake/bu/blob/main/doc/edplot.md
  ¹ Sure, fancy any-size allocators can tack on a histogram of maybe somewhat rounded lg sizes (to economize histo space) and spawn fixed-size object sub-arenas if there are enough of them, say >= enough to fill a VMem page or 2, but this is rare in my experience and also complicates the dispatch to subarenas that you mention above to slightly fancier search. This kind of topic rapidly spirals into allocator-complete discussion and even systems-complete - e.g. a filesystem is largely an allocator with good old MS-DOS FAT being close to a "file is a pool of disk blocks" model.
  
  kragen ・ 6 months ago
  
  This is an excellent comment. Thank you.
  
  8dcc ・ 6 months ago
  ・ 10 more
  
  Yes, you are right. Do you have any suggestions on how I should do the benchmarking? I am not sure how this is usually done, but if the output is some kind of graph, I would also like to know how these graphs are usually produced.
  
  kragen ・ 6 months ago
  ・ 9 more
  
  I'm no expert on this, so probably someone else can give you better advice, but I'm happy to share what I do know. Here are the most important points.
  Benchmarking is a full-fledged scientific experiment. You have a hypothesis that a particular change that you can make, such as using a different allocator, will have a particular effect, such as making your program run in less time. So you make just that one single change and measure the results, while carefully controlling all other conditions that could also affect your program's run time. You have to do the test more than once in order to find out whether you've succeeded in controlling them. If you do a good enough job controlling the conditions of the experiment, it becomes reproducible and therefore falsifiable. Programming usually isn't science, but benchmarking is. A great thing about it is that each run of the experiment can take a fraction of a second, so you can do the whole scientific method from beginning to end in a few minutes and get a reproducible, falsifiable result.
  Benchmarking can get arbitrarily sophisticated, but the most important thing is to dodge the bullets that can turn benchmarks into nonsense, rather than produce very sophisticated graphs or get very high precision.
  The simplest thing on Linux or macOS is to compile a command-line executable you can run in a terminal and run
  time ./testprogram
  at least three times to get an idea of how long the wallclock time is and how variable it is. (The user and system times shown are not very reliable anymore.) Presumably there is also some way to do the same thing on Microsoft Windows. If you're on a laptop, check to see if it makes a difference if you're plugged in. Also, check whether the result from
  time ./testprogram; time ./testprogram
  is consistent; sometimes lags in CPU speed scaling can produce unpredictable results.
  Linux is not very consistent, so you should expect variations in the range of 1–10% from run to run when your program takes on the order of a second to run.
  If you get consistently shorter wallclock times after recompiling the program with a different allocator and the same compiler options (probably make sure optimization isn't completely turned off), you can probably attribute that difference to the different allocator. It's a good idea to switch back to the original allocator and verify that you can reproduce your original results to make sure you didn't accidentally change something else at the same time (maybe your laptop went into low-power mode because the battery is getting low, or maybe you changed the number of iterations in a loop and forgot you changed it, or maybe Firefox loaded a YouTube video in the background and is making the machine swap, that kind of thing).
  It's useful to ensure that you aren't swapping and that your CPU load other than the benchmark is minimal. `top` (or, better, `htop`) and `vmstat 5` (or, better, `dstat 5`) can be helpful here. Unless your program is heavily multithreaded, the CPU load is not as crucial as it used to be now that we all have multicore processors.
  It's helpful to check the particular version of the program you're benchmarking into Git, including the Makefile or whatever specifies the compiler options. That way when you write down your results you can refer to the specific Git commit hash. Also, record your compiler version, your operating system version, and your CPU. Ideally someone who isn't you should be able to read your notes on the benchmark, run it again, and get the same results, if they have access to the same hardware.
  The actual benchmark program itself can be as simple as allocating and deallocating in a loop, ideally inside another loop to run the whole benchmark multiple times; but on processors with cache (which includes every PC since the early 90s) the working set size can be very important, and you may get significantly different results for allocating 20 kilobytes 64 bytes at a time and 20 gigabytes 64 bytes at a time. If you pass in the number of iterations for one of the loops on the command line, instead of hardcoding it in the program code, it's easy to start with a small iteration count and work your way up a factor of 10 at a time until the program takes a few seconds to run. In this case, also be sure to record the specific command line you were measuring the performance of in your notes; knowing that a program took 2.205–2.216 seconds to run is of no value if you don't know if it was running a thousand iterations or a million.
  It's a good idea to actually compute something that you output using the memory you're allocating and deallocating, just in case GCC's optimizer decides that your benchmark loop has no effect and therefore can be safely removed from your program. (In http://canonical.org/~kragen/sw/dev3/kmregion_example.c I computed the sum of the data value stored in each node of the 24th of all the 5000-node linked lists I allocated and deallocated. That's because, when I read the disassembly of my test code with `objdump -d kmregion_example`, I found out that GCC had removed most of the code inside the test loop because it could prove that nobody ever read the struct fields I was initializing. Reading disassembly is neither necessary nor sufficient for dodging all bullets that invalidate benchmarks.)
  Varying the number of iterations in both the innermost and outermost loop should produce roughly proportional increases and decreases in wallclock time. (A helpful technique is to subtract the time for 1000 iterations from the time for 1001000 iterations to get the time for the last million iterations.) If it doesn't, you're not measuring what you think you're measuring. For example, especially with JIT compilers, you may be measuring startup overhead. That bit me this week with http://canonical.org/~kragen/sw/dev3/hadamardsin.js, which I erroneously reported was slower to run than the LuaJIT version. It turns out that it's slightly faster to run, just much slower to compile.
  Running the benchmark on more than one computer is helpful because sometimes changes that speed things up on one system slow them down on another. I remember a recursive backtracking search program for logic circuit optimization I wrote in C long ago which used longjmp() to terminate the search when it found a solution; it ran reasonably fast on Linux but very slowly on my friend's Macintosh because MacOS implemented setjmp() as sigsetjmp(), transforming it from a very cheap operation into a very expensive one.
  That's all you need to know to reliably do optimizations. In fact, that's probably more than you need to know. Everything below this point is extra.
  — ⁂ —
  Allocators are especially tricky to benchmark because often, by changing where data is stored in memory, they profoundly affect the speed of the rest of your program, in ways that vary from program to program. More memory fragmentation, for example, can increase how many TLB misses your program runs into. I don't know what to do about that except try to benchmark some real programs as well as simple nested loops.
  In cases where you're trying to measure an effect that's roughly the size of your measurement precision, statistical tests can be useful. https://github.com/c-blake/bu/blob/main/doc/edplot.md linked by cb321 above discusses using the 2-sample Kolmogorov-Smirnov test, which is not nearly as terrifying as it sounds. But most effects will be either much larger than your measurement error, in which case the result is obvious enough that you don't need the statistics, or much smaller, in which case all they'll tell you is that you can't be sure about the direction of the effect, much less its magnitude. In medicine, where your experiments take years and people die in them, using sophisticated statistics to make the most of your precious data is very important. In performance benchmarking, where your experiments take milliseconds, it's usually better to improve your measurement precision instead. Sometimes just running the same benchmark more times is enough, but there are more powerful techniques available.
  A way to get higher precision than `time` can give you is to run the program under `valgrind` (on Linux). Even raw valgrind will tell you exactly how many instructions it ran, and in most cases that number is the same from run to run, so instead of a measurement with ±3% precision like `time` usually gives you when you've controlled everything properly, you get a measurement typically with ±0.00000001% precision. This is fantastic, far beyond most things you can do in a physics, electronics, or chemistry lab. It means that even with just two runs you can tell if your optimization affected that number by even 0.0000001%, or, more interestingly, 0.1%. And it isn't affected by things like which CPU you run it on or whether your laptop is unplugged, and it only slows the program down about 10×.
  Unfortunately, it measures the wrong thing, because a program that runs more instructions can still be faster. `valgrind --tool=cachegrind` tries to simulate your CPU's cache hierarchy to give you a better idea of how long the program will actually take; the `kcachegrind` GUI can give you a "cycle estimation" number based on the numbers it spits out. But it can still be unrelated to how fast the program actually runs for a variety of reasons; system calls are the biggest one. `kcachegrind` is also a "profiler": it shows you how much time (simulated instructions, simulated cache misses, cycle estimation, whatever) is used by each subroutine in your program.
  You might wonder why an optimization that speeds your program up by 0.1% matters. The answer is that if you do 106 such optimizations you have sped your program up by 10%. SQLite has improved its performance by more than a factor of 3 over a 10-year period by using this approach: https://www.sqlite.org/cpu.html
  There's a more basic profiler called gprof built into GCC; building your program with `-pg` will add profiling instrumentation in to it, so that it writes profiling information into a file called `gmon.out`. After running a program built with gprof, you can run `gprof testprogram` to analyze `gmon.out`; it will tell you how much time each subroutine took (including and excluding its transitive callees) and how many times it was called.
  Linux also has a built-in profiler called `perf_events` https://perfwiki.github.io/main/ which uses hardware performance counters to get numbers like cachegrind's, but using the real hardware instead of simulation, and including a lot more detail; it can also profile the time your program spends in system calls. I've only used it in very basic ways, so I don't know very much about it. Intel also has a proprietary thing called VTune which can do things perf can't and runs on OSes that aren't Linux, but doesn't support AMD's processors.
  — ⁂ —
  I hope this is of some value to you! I'd be happy to answer any other questions. Hopefully correctly.
  
  8dcc ・ 6 months ago
  ・ 2 more
  
  Wow. This is probably the best comment I have ever seen on HN. Thank you so much for the information, I will add benchmarking to my 'libpool' project and mention it in the article. I can't think of any questions right now, but I will let you know if anything comes up. Thank you again. :)
  
  kragen ・ 6 months ago
  
  Aww, thanks! *blush*
  
  8dcc ・ 6 months ago
  ・ 6 more
  
  Actually, I have one question. You said:
  > The actual benchmark program itself can be as simple as allocating and deallocating in a loop [...]
  What do you mean, exactly? Deallocating immediately after allocating? I feel like there would be a big difference between that and allocating multiple blocks before freeing them.
  
  cb321 ・ 6 months ago
  ・ 4 more
  
  You are right - there will be a big difference for these two different workloads. As I am 100% sure @kragen already knows, this is arguably the first, smallest step in thinking for you to realize just how many possible workloads you might want to model. :-) And to realize why senior engineers will say pithy things like "the best benchmark is your actual application".
  While that is true (and I've said it myself), one reason I like the design of a little "workload interpreter shell" as in https://github.com/c-blake/bst/blob/main/test/bst-interact.c (or maybe better https://github.com/c-blake/adix/blob/master/tests/btshell.ni... which has a preprocessor) is that you can use them to run "isolated tests" as a kind of "half step" between a pure hot loop dumb thing and a full-on application with all its attendant noise and self-interaction/self-disturbance.
  So, for example, in this allocation problem setting you could write a similar 30-line "shell/interpreter" that interprets a binary stream of "commands" or allocation requests. The origin a of said binary stream could be a recording/log from some actual app you have doing something actually useful. Then you could apply that one recording in the "shell" in different configurations as @kragen mentions to study the allocator performance (both space & speed) in isolation. At that point you could start to say pretty meaningful things from replaying logs many times about this or that tradeoff on this|that workload.
  EDIT: Of course, at that point you start to get into debates about "how representative" different workloads are, but at least you could potentially share your log with someone who had a different computer and have them run benchmarks, too, and at least the workload is not synthetic but relates to some actual possible program. Moving from synthetic benchmarks to log-based ones was a big thing in OS research on filesystems in the 1970s & 80s and it helped the "generalizability of results" (I think, anyway..I'm sure many still do only synthetic benchmarking due to its convenience).
  EDIT2: I just thought of a cute name for the log replay half-step between synthetic microbenchmarks and a full application - "The millibenchmark". ;-) Maybe someone else has thought of this before, though.
  
  kragen ・ 6 months ago
  ・ 3 more
  
  The "interpreter shell" is an interesting idea!
  Recording traces of allocation requests from actual applications for allocator-benchmarking purposes revolutionized allocator performance research in the late 90s, if I recall correctly, in addition to the earlier filesystem research you mentione. Benchmarks based on those traces were the bedrock of Zorn's case that generational GCs were faster in practice than malloc().
  But I think (though possibly this is motivated reasoning) that benchmarking some workload is good enough for many optimizations. A fixed-size-block allocator is already pretty restricted in what workloads you can run on it, after all, and its performance characteristics are a lot simpler than something like first-fit. Showing that an optimization makes some workload faster is infinitely better than not knowing if it makes any workload faster. Showing that it generalizes across many workloads is better, of course, but it's not nearly the same step in utility.
  
  cb321 ・ 6 months ago
  ・ 2 more
  
  Well, you can use it for everything. I've used it for in memory hash tables, trees, etc. To be clear, for those who may not understand, the "interpretation" in these "millibenchmarks" for something as simple as an allocator is not like Python or Bash or something slow. It could literally be "mmap a binary file of integers and either all/free the binary numbers". So, yeah, there might be some CPU pipeline overhead in unpredictable branches and you might still want to try to measure that, but beyond that there is basically no work and any real workload might have one-ish hard to predict branch leading to allocator operations. So, even unadjusted for overhead it's not so bad. And it keeps your measurements fast so you can get statistically many of them even if there is an "inner min loop" to try to isolate the best case or something. (Well, at least if what you are measuring is fast.) The downside/tradeoff is just lack of fidelity to interaction effects in the full application like resource competition/branch predictor disruption/etc.
  You could probably set up a https://github.com/c-blake/nio file for it (or other) if you wanted to be systematic. A problem with the "text-newline" part of the "unix philosophy" is that it incurs a lot of unneeded binary->ascii->binary cycles that's actually more programming work as well as more CPU work, and it's really fairly orthogonal to the rest of the Unix ideas.
  EDIT reply to parent EDIT
  > benchmarking some workload is good enough
  Fair enough, except that "good enough" is sort of "famous last words" if anyone is reviewing your work. :-) :-) Some will complain you only did Intel not AMD or not ARM or not xyz or Apple Silicon or whatever other zillion variables. Something is always better than nothing, though. It should be thought of more like a Bayesian thing - "no surprise for untested situations" or maybe "confidence in proportion to evidence/experience".
  FWIW, I thought your big advice to 8dcc was pretty decent.
  
  kragen ・ 6 months ago
  
  Thanks! I agree with all of that except for the last statement, which I am not competent to judge.
  
  kragen ・ 6 months ago
  
  You are sure right about that! The example I linked allocates 5000 blocks and makes a linked list out of them, then deallocates them.

openmarkand ・ 6 months ago

[flagged]

andrewmcwatters ・ 6 months ago

Almost no one should be using the MIT or equivalent licenses anymore now that the industry has been overrun by commercial freeloaders. AGPL or GPLv2 and above.
oguz-ismail ・ 6 months ago

> using GPL as the license for a library is kind of limiting
That's the point of using GPL
matheusmoreira ・ 6 months ago

He should have used AGPLv3. Anyone who doesn't like it should contact him with a business proposal for a proprietary license.

10000truths ・ 6 months ago

I recommend against using linked lists for bookkeeping the free blocks. It seems to be the data structure that every malloc/free implementation reaches for, and I don't know why - the slowness of pointer-chasing makes it terrible for almost any real-world use case. A balanced tree would be a much better idea, given that all the essential operations would take O(log n) time instead of O(n). Even if one insists on a linear search, a bitset is much more cache friendly than pointer-chasing and it can trivially benefit from SIMD optimizations.

o11c ・ 6 months ago

For the `Chunk` list, this isn't one of the cases where linked lists are harmful. Each use only touches the top of the stack, never iterates. Also, linked lists are much easier to make thread-safe.
For the `LinkedPtr` list, the bad case is only hit during the destruction of the pool, and then only if you allocate a lot of memory. And given the overhead of deallocation I'm not sure the array approach would measurably help.
I don't see anywhere a binary tree search would be useful here, since there are no loops used for lookup (on allocation they're pushed in order, but when freed chunks are pushed in arbitrary order; this does mean no double-free protection).
- harrison_clarke ・ 6 months ago
  
  why would you need to iterate on destruction? you free the whole pool at once, since you allocate it all at once
elchananHaas ・ 6 months ago

The reason linked lists are used is for large enough allocations, there is no overhead. You use the space the application isn't using. In addition, if all allocations are the same size it is O(1), you just look at the head of the list.
More sophisticated strategies bucket allocations by size, this has fixed overhead. You can also use balanced trees for more memory efficiency, but this is slower.
For small allocations (8 bytes) that are too small to contain pointers allocators will allocate a block and use bitsets.
JJOmeo ・ 6 months ago

Pool allocators don't walk the list or search anything though. All interactions are only at the list head and O(1), as all free nodes are just that, free and equal.
- ceving ・ 6 months ago
  
  Then it might be a misnomber. Calling it "stack" instead of "list" might be better.
  
  alextingle ・ 6 months ago
  ・ 2 more
  
  That is a fair comment.
  
  cb321 ・ 6 months ago
  
  While the most precise naming might be "a singly linked list representation of a stack", "free list" (https://en.wikipedia.org/wiki/Free_list) is an ancient term of art for this problem.. so much so that wikipedia (at least the version today) even suggests "freelist" all one word as a spelling.
  The initial linking together of all free space (also cache friendly, btw) is often called "threading the free list", although this terminology is less standard than "free list" itself.
kazinator ・ 6 months ago

Suppose the information did happen to search through the free blocks. Suppose you put them into an array instead of a linked list. They can't actually be in the array, you see? The blocks are in the pool heap wherever they happen to be. So the array has to point to them. As you walk through the array you have to do reference the pointers. The only way it's better is that they are not dependent loads. But you have this array to manage now.
MathMonkeyMan ・ 6 months ago

Which pool operation is made more costly by the linked list? Both allocate and free are constant time, and trivial too.
The only thing that I can imagine being faster is a bump allocator, but that precludes free.
ben-schaaf ・ 6 months ago

I don't think you understand how the allocator in the article works. Allocating and freeing are already O(1), creating and closing the allocator are necessarily O(n). There is no search being done here.