| 1 | + | |
| 2 | + | <img align="left" width="100" height="100" src="doc/mimalloc-logo.png"/> |
| 3 | + | |
| 4 | + | [<img align="right" src="https://dev.azure.com/Daan0324/mimalloc/_apis/build/status/microsoft.mimalloc?branchName=dev"/>](https://dev.azure.com/Daan0324/mimalloc/_build?definitionId=1&_a=summary) |
| 5 | + | |
| 6 | + | # mimalloc |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | mimalloc (pronounced "me-malloc") |
| 11 | + | is a general purpose allocator with excellent [performance](#performance) characteristics. |
| 12 | + | Initially developed by Daan Leijen for the runtime systems of the |
| 13 | + | [Koka](https://koka-lang.github.io) and [Lean](https://github.com/leanprover/lean) languages. |
| 14 | + | |
| 15 | + | Latest release tag: `v2.1.1` (2023-04-03). |
| 16 | + | Latest stable tag: `v1.8.1` (2023-04-03). |
| 17 | + | |
| 18 | + | mimalloc is a drop-in replacement for `malloc` and can be used in other programs |
| 19 | + | without code changes, for example, on dynamically linked ELF-based systems (Linux, BSD, etc.) you can use it as: |
| 20 | + | ``` |
| 21 | + | > LD_PRELOAD=/usr/lib/libmimalloc.so myprogram |
| 22 | + | ``` |
| 23 | + | It also includes a robust way to override the default allocator in [Windows](#override_on_windows). Notable aspects of the design include: |
| 24 | + | |
| 25 | + | - __small and consistent__: the library is about 8k LOC using simple and |
| 26 | + | consistent data structures. This makes it very suitable |
| 27 | + | to integrate and adapt in other projects. For runtime systems it |
| 28 | + | provides hooks for a monotonic _heartbeat_ and deferred freeing (for |
| 29 | + | bounded worst-case times with reference counting). |
| 30 | + | Partly due to its simplicity, mimalloc has been ported to many systems (Windows, macOS, |
| 31 | + | Linux, WASM, various BSD's, Haiku, MUSL, etc) and has excellent support for dynamic overriding. |
| 32 | + | - __free list sharding__: instead of one big free list (per size class) we have |
| 33 | + | many smaller lists per "mimalloc page" which reduces fragmentation and |
| 34 | + | increases locality -- |
| 35 | + | things that are allocated close in time get allocated close in memory. |
| 36 | + | (A mimalloc page contains blocks of one size class and is usually 64KiB on a 64-bit system). |
| 37 | + | - __free list multi-sharding__: the big idea! Not only do we shard the free list |
| 38 | + | per mimalloc page, but for each page we have multiple free lists. In particular, there |
| 39 | + | is one list for thread-local `free` operations, and another one for concurrent `free` |
| 40 | + | operations. Free-ing from another thread can now be a single CAS without needing |
| 41 | + | sophisticated coordination between threads. Since there will be |
| 42 | + | thousands of separate free lists, contention is naturally distributed over the heap, |
| 43 | + | and the chance of contending on a single location will be low -- this is quite |
| 44 | + | similar to randomized algorithms like skip lists where adding |
| 45 | + | a random oracle removes the need for a more complex algorithm. |
| 46 | + | - __eager page reset__: when a "page" becomes empty (with increased chance |
| 47 | + | due to free list sharding) the memory is marked to the OS as unused (reset or decommitted) |
| 48 | + | reducing (real) memory pressure and fragmentation, especially in long running |
| 49 | + | programs. |
| 50 | + | - __secure__: _mimalloc_ can be built in secure mode, adding guard pages, |
| 51 | + | randomized allocation, encrypted free lists, etc. to protect against various |
| 52 | + | heap vulnerabilities. The performance penalty is usually around 10% on average |
| 53 | + | over our benchmarks. |
| 54 | + | - __first-class heaps__: efficiently create and use multiple heaps to allocate across different regions. |
| 55 | + | A heap can be destroyed at once instead of deallocating each object separately. |
| 56 | + | - __bounded__: it does not suffer from _blowup_ \[1\], has bounded worst-case allocation |
| 57 | + | times (_wcat_) (upto OS primitives), bounded space overhead (~0.2% meta-data, with low |
| 58 | + | internal fragmentation), and has no internal points of contention using only atomic operations. |
| 59 | + | - __fast__: In our benchmarks (see [below](#performance)), |
| 60 | + | _mimalloc_ outperforms other leading allocators (_jemalloc_, _tcmalloc_, _Hoard_, etc), |
| 61 | + | and often uses less memory. A nice property is that it does consistently well over a wide range |
| 62 | + | of benchmarks. There is also good huge OS page support for larger server programs. |
| 63 | + | |
| 64 | + | The [documentation](https://microsoft.github.io/mimalloc) gives a full overview of the API. |
| 65 | + | You can read more on the design of _mimalloc_ in the [technical report](https://www.microsoft.com/en-us/research/publication/mimalloc-free-list-sharding-in-action) which also has detailed benchmark results. |
| 66 | + | |
| 67 | + | Enjoy! |
| 68 | + | |
| 69 | + | ### Branches |
| 70 | + | |
| 71 | + | * `master`: latest stable release (based on `dev-slice`). |
| 72 | + | * `dev`: development branch for mimalloc v1. Use this branch for submitting PR's. |
| 73 | + | * `dev-slice`: development branch for mimalloc v2. This branch is downstream of `dev`. |
| 74 | + | |
| 75 | + | ### Releases |
| 76 | + | |
| 77 | + | Note: the `v2.x` version has a new algorithm for managing internal mimalloc pages that tends to use reduce memory usage |
| 78 | + | and fragmentation compared to mimalloc `v1.x` (especially for large workloads). Should otherwise have similar performance |
| 79 | + | (see [below](#performance)); please report if you observe any significant performance regression. |
| 80 | + | |
| 81 | + | * 2023-04-03, `v1.8.1`, `v2.1.1`: Fixes build issues on some platforms. |
| 82 | + | |
| 83 | + | * 2023-03-29, `v1.8.0`, `v2.1.0`: Improved support dynamic overriding on Windows 11. Improved tracing precision |
| 84 | + | with [asan](#asan) and [Valgrind](#valgrind), and added Windows event tracing [ETW](#ETW) (contributed by Xinglong He). Created an OS |
| 85 | + | abstraction layer to make it easier to port and separate platform dependent code (in `src/prim`). Fixed C++ STL compilation on older Microsoft C++ compilers, and various small bug fixes. |
| 86 | + | |
| 87 | + | * 2022-12-23, `v1.7.9`, `v2.0.9`: Supports building with [asan](#asan) and improved [Valgrind](#valgrind) support. |
| 88 | + | Support abitrary large alignments (in particular for `std::pmr` pools). |
| 89 | + | Added C++ STL allocators attached to a specific heap (thanks @vmarkovtsev). |
| 90 | + | Heap walks now visit all object (including huge objects). Support Windows nano server containers (by Johannes Schindelin,@dscho). |
| 91 | + | Various small bug fixes. |
| 92 | + | |
| 93 | + | * 2022-11-03, `v1.7.7`, `v2.0.7`: Initial support for [Valgrind](#valgrind) for leak testing and heap block overflow |
| 94 | + | detection. Initial |
| 95 | + | support for attaching heaps to a speficic memory area (only in v2). Fix `realloc` behavior for zero size blocks, remove restriction to integral multiple of the alignment in `alloc_align`, improved aligned allocation performance, reduced contention with many threads on few processors (thank you @dposluns!), vs2022 support, support `pkg-config`, . |
| 96 | + | |
| 97 | + | * 2022-04-14, `v1.7.6`, `v2.0.6`: fix fallback path for aligned OS allocation on Windows, improve Windows aligned allocation |
| 98 | + | even when compiling with older SDK's, fix dynamic overriding on macOS Monterey, fix MSVC C++ dynamic overriding, fix |
| 99 | + | warnings under Clang 14, improve performance if many OS threads are created and destroyed, fix statistics for large object |
| 100 | + | allocations, using MIMALLOC_VERBOSE=1 has no maximum on the number of error messages, various small fixes. |
| 101 | + | |
| 102 | + | * 2022-02-14, `v1.7.5`, `v2.0.5` (alpha): fix malloc override on |
| 103 | + | Windows 11, fix compilation with musl, potentially reduced |
| 104 | + | committed memory, add `bin/minject` for Windows, |
| 105 | + | improved wasm support, faster aligned allocation, |
| 106 | + | various small fixes. |
| 107 | + | |
| 108 | + | * 2021-11-14, `v1.7.3`, `v2.0.3` (beta): improved WASM support, improved macOS support and performance (including |
| 109 | + | M1), improved performance for v2 for large objects, Python integration improvements, more standard |
| 110 | + | installation directories, various small fixes. |
| 111 | + | |
| 112 | + | * 2021-06-17, `v1.7.2`, `v2.0.2` (beta): support M1, better installation layout on Linux, fix |
| 113 | + | thread_id on Android, prefer 2-6TiB area for aligned allocation to work better on pre-windows 8, various small fixes. |
| 114 | + | |
| 115 | + | * 2021-04-06, `v1.7.1`, `v2.0.1` (beta): fix bug in arena allocation for huge pages, improved aslr on large allocations, initial M1 support (still experimental). |
| 116 | + | |
| 117 | + | * 2021-01-31, `v2.0.0`: beta release 2.0: new slice algorithm for managing internal mimalloc pages. |
| 118 | + | |
| 119 | + | * 2021-01-31, `v1.7.0`: stable release 1.7: support explicit user provided memory regions, more precise statistics, |
| 120 | + | improve macOS overriding, initial support for Apple M1, improved DragonFly support, faster memcpy on Windows, various small fixes. |
| 121 | + | |
| 122 | + | * [Older release notes](#older-release-notes) |
| 123 | + | |
| 124 | + | Special thanks to: |
| 125 | + | |
| 126 | + | * [David Carlier](https://devnexen.blogspot.com/) (@devnexen) for his many contributions, and making |
| 127 | + | mimalloc work better on many less common operating systems, like Haiku, Dragonfly, etc. |
| 128 | + | * Mary Feofanova (@mary3000), Evgeniy Moiseenko, and Manuel Pöter (@mpoeter) for making mimalloc TSAN checkable, and finding |
| 129 | + | memory model bugs using the [genMC] model checker. |
| 130 | + | * Weipeng Liu (@pongba), Zhuowei Li, Junhua Wang, and Jakub Szymanski, for their early support of mimalloc and deployment |
| 131 | + | at large scale services, leading to many improvements in the mimalloc algorithms for large workloads. |
| 132 | + | * Jason Gibson (@jasongibson) for exhaustive testing on large scale workloads and server environments, and finding complex bugs |
| 133 | + | in (early versions of) `mimalloc`. |
| 134 | + | * Manuel Pöter (@mpoeter) and Sam Gross(@colesbury) for finding an ABA concurrency issue in abandoned segment reclamation. Sam also created the [no GIL](https://github.com/colesbury/nogil) Python fork which |
| 135 | + | uses mimalloc internally. |
| 136 | + | |
| 137 | + | |
| 138 | + | [genMC]: https://plv.mpi-sws.org/genmc/ |
| 139 | + | |
| 140 | + | ### Usage |
| 141 | + | |
| 142 | + | mimalloc is used in various large scale low-latency services and programs, for example: |
| 143 | + | |
| 144 | + | <a href="https://www.bing.com"><img height="50" align="left" src="https://upload.wikimedia.org/wikipedia/commons/e/e9/Bing_logo.svg"></a> |
| 145 | + | <a href="https://azure.microsoft.com/"><img height="50" align="left" src="https://upload.wikimedia.org/wikipedia/commons/a/a8/Microsoft_Azure_Logo.svg"></a> |
| 146 | + | <a href="https://deathstrandingpc.505games.com"><img height="100" src="doc/ds-logo.png"></a> |
| 147 | + | <a href="https://docs.unrealengine.com/4.26/en-US/WhatsNew/Builds/ReleaseNotes/4_25/"><img height="100" src="doc/unreal-logo.svg"></a> |
| 148 | + | <a href="https://cab.spbu.ru/software/spades/"><img height="100" src="doc/spades-logo.png"></a> |
| 149 | + | |
| 150 | + | |
| 151 | + | # Building |
| 152 | + | |
| 153 | + | ## Windows |
| 154 | + | |
| 155 | + | Open `ide/vs2019/mimalloc.sln` in Visual Studio 2019 and build. |
| 156 | + | The `mimalloc` project builds a static library (in `out/msvc-x64`), while the |
| 157 | + | `mimalloc-override` project builds a DLL for overriding malloc |
| 158 | + | in the entire program. |
| 159 | + | |
| 160 | + | ## macOS, Linux, BSD, etc. |
| 161 | + | |
| 162 | + | We use [`cmake`](https://cmake.org)<sup>1</sup> as the build system: |
| 163 | + | |
| 164 | + | ``` |
| 165 | + | > mkdir -p out/release |
| 166 | + | > cd out/release |
| 167 | + | > cmake ../.. |
| 168 | + | > make |
| 169 | + | ``` |
| 170 | + | This builds the library as a shared (dynamic) |
| 171 | + | library (`.so` or `.dylib`), a static library (`.a`), and |
| 172 | + | as a single object file (`.o`). |
| 173 | + | |
| 174 | + | `> sudo make install` (install the library and header files in `/usr/local/lib` and `/usr/local/include`) |
| 175 | + | |
| 176 | + | You can build the debug version which does many internal checks and |
| 177 | + | maintains detailed statistics as: |
| 178 | + | |
| 179 | + | ``` |
| 180 | + | > mkdir -p out/debug |
| 181 | + | > cd out/debug |
| 182 | + | > cmake -DCMAKE_BUILD_TYPE=Debug ../.. |
| 183 | + | > make |
| 184 | + | ``` |
| 185 | + | This will name the shared library as `libmimalloc-debug.so`. |
| 186 | + | |
| 187 | + | Finally, you can build a _secure_ version that uses guard pages, encrypted |
| 188 | + | free lists, etc., as: |
| 189 | + | ``` |
| 190 | + | > mkdir -p out/secure |
| 191 | + | > cd out/secure |
| 192 | + | > cmake -DMI_SECURE=ON ../.. |
| 193 | + | > make |
| 194 | + | ``` |
| 195 | + | This will name the shared library as `libmimalloc-secure.so`. |
| 196 | + | Use `ccmake`<sup>2</sup> instead of `cmake` |
| 197 | + | to see and customize all the available build options. |
| 198 | + | |
| 199 | + | Notes: |
| 200 | + | 1. Install CMake: `sudo apt-get install cmake` |
| 201 | + | 2. Install CCMake: `sudo apt-get install cmake-curses-gui` |
| 202 | + | |
| 203 | + | |
| 204 | + | ## Single source |
| 205 | + | |
| 206 | + | You can also directly build the single `src/static.c` file as part of your project without |
| 207 | + | needing `cmake` at all. Make sure to also add the mimalloc `include` directory to the include path. |
| 208 | + | |
| 209 | + | |
| 210 | + | # Using the library |
| 211 | + | |
| 212 | + | The preferred usage is including `<mimalloc.h>`, linking with |
| 213 | + | the shared- or static library, and using the `mi_malloc` API exclusively for allocation. For example, |
| 214 | + | ``` |
| 215 | + | > gcc -o myprogram -lmimalloc myfile.c |
| 216 | + | ``` |
| 217 | + | |
| 218 | + | mimalloc uses only safe OS calls (`mmap` and `VirtualAlloc`) and can co-exist |
| 219 | + | with other allocators linked to the same program. |
| 220 | + | If you use `cmake`, you can simply use: |
| 221 | + | ``` |
| 222 | + | find_package(mimalloc 1.4 REQUIRED) |
| 223 | + | ``` |
| 224 | + | in your `CMakeLists.txt` to find a locally installed mimalloc. Then use either: |
| 225 | + | ``` |
| 226 | + | target_link_libraries(myapp PUBLIC mimalloc) |
| 227 | + | ``` |
| 228 | + | to link with the shared (dynamic) library, or: |
| 229 | + | ``` |
| 230 | + | target_link_libraries(myapp PUBLIC mimalloc-static) |
| 231 | + | ``` |
| 232 | + | to link with the static library. See `test\CMakeLists.txt` for an example. |
| 233 | + | |
| 234 | + | For best performance in C++ programs, it is also recommended to override the |
| 235 | + | global `new` and `delete` operators. For convience, mimalloc provides |
| 236 | + | [`mimalloc-new-delete.h`](https://github.com/microsoft/mimalloc/blob/master/include/mimalloc-new-delete.h) which does this for you -- just include it in a single(!) source file in your project. |
| 237 | + | In C++, mimalloc also provides the `mi_stl_allocator` struct which implements the `std::allocator` |
| 238 | + | interface. |
| 239 | + | |
| 240 | + | You can pass environment variables to print verbose messages (`MIMALLOC_VERBOSE=1`) |
| 241 | + | and statistics (`MIMALLOC_SHOW_STATS=1`) (in the debug version): |
| 242 | + | ``` |
| 243 | + | > env MIMALLOC_SHOW_STATS=1 ./cfrac 175451865205073170563711388363 |
| 244 | + | |
| 245 | + | 175451865205073170563711388363 = 374456281610909315237213 * 468551 |
| 246 | + | |
| 247 | + | heap stats: peak total freed unit |
| 248 | + | normal 2: 16.4 kb 17.5 mb 17.5 mb 16 b ok |
| 249 | + | normal 3: 16.3 kb 15.2 mb 15.2 mb 24 b ok |
| 250 | + | normal 4: 64 b 4.6 kb 4.6 kb 32 b ok |
| 251 | + | normal 5: 80 b 118.4 kb 118.4 kb 40 b ok |
| 252 | + | normal 6: 48 b 48 b 48 b 48 b ok |
| 253 | + | normal 17: 960 b 960 b 960 b 320 b ok |
| 254 | + | |
| 255 | + | heap stats: peak total freed unit |
| 256 | + | normal: 33.9 kb 32.8 mb 32.8 mb 1 b ok |
| 257 | + | huge: 0 b 0 b 0 b 1 b ok |
| 258 | + | total: 33.9 kb 32.8 mb 32.8 mb 1 b ok |
| 259 | + | malloc requested: 32.8 mb |
| 260 | + | |
| 261 | + | committed: 58.2 kb 58.2 kb 58.2 kb 1 b ok |
| 262 | + | reserved: 2.0 mb 2.0 mb 2.0 mb 1 b ok |
| 263 | + | reset: 0 b 0 b 0 b 1 b ok |
| 264 | + | segments: 1 1 1 |
| 265 | + | -abandoned: 0 |
| 266 | + | pages: 6 6 6 |
| 267 | + | -abandoned: 0 |
| 268 | + | mmaps: 3 |
| 269 | + | mmap fast: 0 |
| 270 | + | mmap slow: 1 |
| 271 | + | threads: 0 |
| 272 | + | elapsed: 2.022s |
| 273 | + | process: user: 1.781s, system: 0.016s, faults: 756, reclaims: 0, rss: 2.7 mb |
| 274 | + | ``` |
| 275 | + | |
| 276 | + | The above model of using the `mi_` prefixed API is not always possible |
| 277 | + | though in existing programs that already use the standard malloc interface, |
| 278 | + | and another option is to override the standard malloc interface |
| 279 | + | completely and redirect all calls to the _mimalloc_ library instead . |
| 280 | + | |
| 281 | + | ## Environment Options |
| 282 | + | |
| 283 | + | You can set further options either programmatically (using [`mi_option_set`](https://microsoft.github.io/mimalloc/group__options.html)), |
| 284 | + | or via environment variables: |
| 285 | + | |
| 286 | + | - `MIMALLOC_SHOW_STATS=1`: show statistics when the program terminates. |
| 287 | + | - `MIMALLOC_VERBOSE=1`: show verbose messages. |
| 288 | + | - `MIMALLOC_SHOW_ERRORS=1`: show error and warning messages. |
| 289 | + | - `MIMALLOC_PAGE_RESET=0`: by default, mimalloc will reset (or purge) OS pages that are not in use, to signal to the OS |
| 290 | + | that the underlying physical memory can be reused. This can reduce memory fragmentation in long running (server) |
| 291 | + | programs. By setting it to `0` this will no longer be done which can improve performance for batch-like programs. |
| 292 | + | As an alternative, the `MIMALLOC_RESET_DELAY=`<msecs> can be set higher (100ms by default) to make the page |
| 293 | + | reset occur less frequently instead of turning it off completely. |
| 294 | + | - `MIMALLOC_USE_NUMA_NODES=N`: pretend there are at most `N` NUMA nodes. If not set, the actual NUMA nodes are detected |
| 295 | + | at runtime. Setting `N` to 1 may avoid problems in some virtual environments. Also, setting it to a lower number than |
| 296 | + | the actual NUMA nodes is fine and will only cause threads to potentially allocate more memory across actual NUMA |
| 297 | + | nodes (but this can happen in any case as NUMA local allocation is always a best effort but not guaranteed). |
| 298 | + | - `MIMALLOC_LARGE_OS_PAGES=1`: use large OS pages (2MiB) when available; for some workloads this can significantly |
| 299 | + | improve performance. Use `MIMALLOC_VERBOSE` to check if the large OS pages are enabled -- usually one needs |
| 300 | + | to explicitly allow large OS pages (as on [Windows][windows-huge] and [Linux][linux-huge]). However, sometimes |
| 301 | + | the OS is very slow to reserve contiguous physical memory for large OS pages so use with care on systems that |
| 302 | + | can have fragmented memory (for that reason, we generally recommend to use `MIMALLOC_RESERVE_HUGE_OS_PAGES` instead whenever possible). |
| 303 | + | <!-- |
| 304 | + | - `MIMALLOC_EAGER_REGION_COMMIT=1`: on Windows, commit large (256MiB) regions eagerly. On Windows, these regions |
| 305 | + | show in the working set even though usually just a small part is committed to physical memory. This is why it |
| 306 | + | turned off by default on Windows as it looks not good in the task manager. However, turning it on has no |
| 307 | + | real drawbacks and may improve performance by a little. |
| 308 | + | --> |
| 309 | + | - `MIMALLOC_RESERVE_HUGE_OS_PAGES=N`: where N is the number of 1GiB _huge_ OS pages. This reserves the huge pages at |
| 310 | + | startup and sometimes this can give a large (latency) performance improvement on big workloads. |
| 311 | + | Usually it is better to not use |
| 312 | + | `MIMALLOC_LARGE_OS_PAGES` in combination with this setting. Just like large OS pages, use with care as reserving |
| 313 | + | contiguous physical memory can take a long time when memory is fragmented (but reserving the huge pages is done at |
| 314 | + | startup only once). |
| 315 | + | Note that we usually need to explicitly enable huge OS pages (as on [Windows][windows-huge] and [Linux][linux-huge])). |
| 316 | + | With huge OS pages, it may be beneficial to set the setting |
| 317 | + | `MIMALLOC_EAGER_COMMIT_DELAY=N` (`N` is 1 by default) to delay the initial `N` segments (of 4MiB) |
| 318 | + | of a thread to not allocate in the huge OS pages; this prevents threads that are short lived |
| 319 | + | and allocate just a little to take up space in the huge OS page area (which cannot be reset). |
| 320 | + | The huge pages are usually allocated evenly among NUMA nodes. |
| 321 | + | We can use `MIMALLOC_RESERVE_HUGE_OS_PAGES_AT=N` where `N` is the numa node (starting at 0) to allocate all |
| 322 | + | the huge pages at a specific numa node instead. |
| 323 | + | |
| 324 | + | Use caution when using `fork` in combination with either large or huge OS pages: on a fork, the OS uses copy-on-write |
| 325 | + | for all pages in the original process including the huge OS pages. When any memory is now written in that area, the |
| 326 | + | OS will copy the entire 1GiB huge page (or 2MiB large page) which can cause the memory usage to grow in large increments. |
| 327 | + | |
| 328 | + | [linux-huge]: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html/tuning_and_optimizing_red_hat_enterprise_linux_for_oracle_9i_and_10g_databases/sect-oracle_9i_and_10g_tuning_guide-large_memory_optimization_big_pages_and_huge_pages-configuring_huge_pages_in_red_hat_enterprise_linux_4_or_5 |
| 329 | + | [windows-huge]: https://docs.microsoft.com/en-us/sql/database-engine/configure-windows/enable-the-lock-pages-in-memory-option-windows?view=sql-server-2017 |
| 330 | + | |
| 331 | + | ## Secure Mode |
| 332 | + | |
| 333 | + | _mimalloc_ can be build in secure mode by using the `-DMI_SECURE=ON` flags in `cmake`. This build enables various mitigations |
| 334 | + | to make mimalloc more robust against exploits. In particular: |
| 335 | + | |
| 336 | + | - All internal mimalloc pages are surrounded by guard pages and the heap metadata is behind a guard page as well (so a buffer overflow |
| 337 | + | exploit cannot reach into the metadata). |
| 338 | + | - All free list pointers are |
| 339 | + | [encoded](https://github.com/microsoft/mimalloc/blob/783e3377f79ee82af43a0793910a9f2d01ac7863/include/mimalloc-internal.h#L396) |
| 340 | + | with per-page keys which is used both to prevent overwrites with a known pointer, as well as to detect heap corruption. |
| 341 | + | - Double free's are detected (and ignored). |
| 342 | + | - The free lists are initialized in a random order and allocation randomly chooses between extension and reuse within a page to |
| 343 | + | mitigate against attacks that rely on a predicable allocation order. Similarly, the larger heap blocks allocated by mimalloc |
| 344 | + | from the OS are also address randomized. |
| 345 | + | |
| 346 | + | As always, evaluate with care as part of an overall security strategy as all of the above are mitigations but not guarantees. |
| 347 | + | |
| 348 | + | ## Debug Mode |
| 349 | + | |
| 350 | + | When _mimalloc_ is built using debug mode, various checks are done at runtime to catch development errors. |
| 351 | + | |
| 352 | + | - Statistics are maintained in detail for each object size. They can be shown using `MIMALLOC_SHOW_STATS=1` at runtime. |
| 353 | + | - All objects have padding at the end to detect (byte precise) heap block overflows. |
| 354 | + | - Double free's, and freeing invalid heap pointers are detected. |
| 355 | + | - Corrupted free-lists and some forms of use-after-free are detected. |
| 356 | + | |
| 357 | + | |
| 358 | + | # Overriding Standard Malloc |
| 359 | + | |
| 360 | + | Overriding the standard `malloc` (and `new`) can be done either _dynamically_ or _statically_. |
| 361 | + | |
| 362 | + | ## Dynamic override |
| 363 | + | |
| 364 | + | This is the recommended way to override the standard malloc interface. |
| 365 | + | |
| 366 | + | ### Dynamic Override on Linux, BSD |
| 367 | + | |
| 368 | + | On these ELF-based systems we preload the mimalloc shared |
| 369 | + | library so all calls to the standard `malloc` interface are |
| 370 | + | resolved to the _mimalloc_ library. |
| 371 | + | ``` |
| 372 | + | > env LD_PRELOAD=/usr/lib/libmimalloc.so myprogram |
| 373 | + | ``` |
| 374 | + | |
| 375 | + | You can set extra environment variables to check that mimalloc is running, |
| 376 | + | like: |
| 377 | + | ``` |
| 378 | + | > env MIMALLOC_VERBOSE=1 LD_PRELOAD=/usr/lib/libmimalloc.so myprogram |
| 379 | + | ``` |
| 380 | + | or run with the debug version to get detailed statistics: |
| 381 | + | ``` |
| 382 | + | > env MIMALLOC_SHOW_STATS=1 LD_PRELOAD=/usr/lib/libmimalloc-debug.so myprogram |
| 383 | + | ``` |
| 384 | + | |
| 385 | + | ### Dynamic Override on MacOS |
| 386 | + | |
| 387 | + | On macOS we can also preload the mimalloc shared |
| 388 | + | library so all calls to the standard `malloc` interface are |
| 389 | + | resolved to the _mimalloc_ library. |
| 390 | + | ``` |
| 391 | + | > env DYLD_INSERT_LIBRARIES=/usr/lib/libmimalloc.dylib myprogram |
| 392 | + | ``` |
| 393 | + | |
| 394 | + | Note that certain security restrictions may apply when doing this from |
| 395 | + | the [shell](https://stackoverflow.com/questions/43941322/dyld-insert-libraries-ignored-when-calling-application-through-bash). |
| 396 | + | |
| 397 | + | |
| 398 | + | ### Dynamic Override on Windows |
| 399 | + | |
| 400 | + | <span id="override_on_windows">Overriding on Windows</span> is robust and has the |
| 401 | + | particular advantage to be able to redirect all malloc/free calls that go through |
| 402 | + | the (dynamic) C runtime allocator, including those from other DLL's or libraries. |
| 403 | + | |
| 404 | + | The overriding on Windows requires that you link your program explicitly with |
| 405 | + | the mimalloc DLL and use the C-runtime library as a DLL (using the `/MD` or `/MDd` switch). |
| 406 | + | Also, the `mimalloc-redirect.dll` (or `mimalloc-redirect32.dll`) must be put |
| 407 | + | in the same folder as the main `mimalloc-override.dll` at runtime (as it is a dependency). |
| 408 | + | The redirection DLL ensures that all calls to the C runtime malloc API get redirected to |
| 409 | + | mimalloc (in `mimalloc-override.dll`). |
| 410 | + | |
| 411 | + | To ensure the mimalloc DLL is loaded at run-time it is easiest to insert some |
| 412 | + | call to the mimalloc API in the `main` function, like `mi_version()` |
| 413 | + | (or use the `/INCLUDE:mi_version` switch on the linker). See the `mimalloc-override-test` project |
| 414 | + | for an example on how to use this. For best performance on Windows with C++, it |
| 415 | + | is also recommended to also override the `new`/`delete` operations (by including |
| 416 | + | [`mimalloc-new-delete.h`](https://github.com/microsoft/mimalloc/blob/master/include/mimalloc-new-delete.h) a single(!) source file in your project). |
| 417 | + | |
| 418 | + | The environment variable `MIMALLOC_DISABLE_REDIRECT=1` can be used to disable dynamic |
| 419 | + | overriding at run-time. Use `MIMALLOC_VERBOSE=1` to check if mimalloc was successfully redirected. |
| 420 | + | |
| 421 | + | (Note: in principle, it is possible to even patch existing executables without any recompilation |
| 422 | + | if they are linked with the dynamic C runtime (`ucrtbase.dll`) -- just put the `mimalloc-override.dll` |
| 423 | + | into the import table (and put `mimalloc-redirect.dll` in the same folder) |
| 424 | + | Such patching can be done for example with [CFF Explorer](https://ntcore.com/?page_id=388)). |
| 425 | + | |
| 426 | + | |
| 427 | + | ## Static override |
| 428 | + | |
| 429 | + | On Unix-like systems, you can also statically link with _mimalloc_ to override the standard |
| 430 | + | malloc interface. The recommended way is to link the final program with the |
| 431 | + | _mimalloc_ single object file (`mimalloc.o`). We use |
| 432 | + | an object file instead of a library file as linkers give preference to |
| 433 | + | that over archives to resolve symbols. To ensure that the standard |
| 434 | + | malloc interface resolves to the _mimalloc_ library, link it as the first |
| 435 | + | object file. For example: |
| 436 | + | ``` |
| 437 | + | > gcc -o myprogram mimalloc.o myfile1.c ... |
| 438 | + | ``` |
| 439 | + | |
| 440 | + | Another way to override statically that works on all platforms, is to |
| 441 | + | link statically to mimalloc (as shown in the introduction) and include a |
| 442 | + | header file in each source file that re-defines `malloc` etc. to `mi_malloc`. |
| 443 | + | This is provided by [`mimalloc-override.h`](https://github.com/microsoft/mimalloc/blob/master/include/mimalloc-override.h). This only works reliably though if all sources are |
| 444 | + | under your control or otherwise mixing of pointers from different heaps may occur! |
| 445 | + | |
| 446 | + | |
| 447 | + | ## Tools |
| 448 | + | |
| 449 | + | Generally, we recommend using the standard allocator with memory tracking tools, but mimalloc |
| 450 | + | can also be build to support the [address sanitizer][asan] or the excellent [Valgrind] tool. |
| 451 | + | Moreover, it can be build to support Windows event tracing ([ETW]). |
| 452 | + | This has a small performance overhead but does allow detecting memory leaks and byte-precise |
| 453 | + | buffer overflows directly on final executables. See also the `test/test-wrong.c` file to test with various tools. |
| 454 | + | |
| 455 | + | ### Valgrind |
| 456 | + | |
| 457 | + | To build with [valgrind] support, use the `MI_TRACK_VALGRIND=ON` cmake option: |
| 458 | + | |
| 459 | + | ``` |
| 460 | + | > cmake ../.. -DMI_TRACK_VALGRIND=ON |
| 461 | + | ``` |
| 462 | + | |
| 463 | + | This can also be combined with secure mode or debug mode. |
| 464 | + | You can then run your programs directly under valgrind: |
| 465 | + | |
| 466 | + | ``` |
| 467 | + | > valgrind <myprogram> |
| 468 | + | ``` |
| 469 | + | |
| 470 | + | If you rely on overriding `malloc`/`free` by mimalloc (instead of using the `mi_malloc`/`mi_free` API directly), |
| 471 | + | you also need to tell `valgrind` to not intercept those calls itself, and use: |
| 472 | + | |
| 473 | + | ``` |
| 474 | + | > MIMALLOC_SHOW_STATS=1 valgrind --soname-synonyms=somalloc=*mimalloc* -- <myprogram> |
| 475 | + | ``` |
| 476 | + | |
| 477 | + | By setting the `MIMALLOC_SHOW_STATS` environment variable you can check that mimalloc is indeed |
| 478 | + | used and not the standard allocator. Even though the [Valgrind option][valgrind-soname] |
| 479 | + | is called `--soname-synonyms`, this also |
| 480 | + | works when overriding with a static library or object file. Unfortunately, it is not possible to |
| 481 | + | dynamically override mimalloc using `LD_PRELOAD` together with `valgrind`. |
| 482 | + | See also the `test/test-wrong.c` file to test with `valgrind`. |
| 483 | + | |
| 484 | + | Valgrind support is in its initial development -- please report any issues. |
| 485 | + | |
| 486 | + | [Valgrind]: https://valgrind.org/ |
| 487 | + | [valgrind-soname]: https://valgrind.org/docs/manual/manual-core.html#opt.soname-synonyms |
| 488 | + | |
| 489 | + | ### ASAN |
| 490 | + | |
| 491 | + | To build with the address sanitizer, use the `-DMI_TRACK_ASAN=ON` cmake option: |
| 492 | + | |
| 493 | + | ``` |
| 494 | + | > cmake ../.. -DMI_TRACK_ASAN=ON |
| 495 | + | ``` |
| 496 | + | |
| 497 | + | This can also be combined with secure mode or debug mode. |
| 498 | + | You can then run your programs as:' |
| 499 | + | |
| 500 | + | ``` |
| 501 | + | > ASAN_OPTIONS=verbosity=1 <myprogram> |
| 502 | + | ``` |
| 503 | + | |
| 504 | + | When you link a program with an address sanitizer build of mimalloc, you should |
| 505 | + | generally compile that program too with the address sanitizer enabled. |
| 506 | + | For example, assuming you build mimalloc in `out/debug`: |
| 507 | + | |
| 508 | + | ``` |
| 509 | + | clang -g -o test-wrong -Iinclude test/test-wrong.c out/debug/libmimalloc-asan-debug.a -lpthread -fsanitize=address -fsanitize-recover=address |
| 510 | + | ``` |
| 511 | + | |
| 512 | + | Since the address sanitizer redirects the standard allocation functions, on some platforms (macOSX for example) |
| 513 | + | it is required to compile mimalloc with `-DMI_OVERRIDE=OFF`. |
| 514 | + | Adress sanitizer support is in its initial development -- please report any issues. |
| 515 | + | |
| 516 | + | [asan]: https://github.com/google/sanitizers/wiki/AddressSanitizer |
| 517 | + | |
| 518 | + | ### ETW |
| 519 | + | |
| 520 | + | Event tracing for Windows ([ETW]) provides a high performance way to capture all allocations though |
| 521 | + | mimalloc and analyze them later. To build with ETW support, use the `-DMI_TRACE_ETW=ON` cmake option. |
| 522 | + | |
| 523 | + | You can then capture an allocation trace using the Windows performance recorder (WPR), using the |
| 524 | + | `src/prim/windows/etw-mimalloc.wprp` profile. In an admin prompt, you can use: |
| 525 | + | ``` |
| 526 | + | > wpr -start src\prim\windows\etw-mimalloc.wprp -filemode |
| 527 | + | > <my_mimalloc_program> |
| 528 | + | > wpr -stop <my_mimalloc_program>.etl |
| 529 | + | ``` |
| 530 | + | and then open `<my_mimalloc_program>.etl` in the Windows Performance Analyzer (WPA), or |
| 531 | + | use a tool like [TraceControl] that is specialized for analyzing mimalloc traces. |
| 532 | + | |
| 533 | + | [ETW]: https://learn.microsoft.com/en-us/windows-hardware/test/wpt/event-tracing-for-windows |
| 534 | + | [TraceControl]: https://github.com/xinglonghe/TraceControl |
| 535 | + | |
| 536 | + | |
| 537 | + | # Performance |
| 538 | + | |
| 539 | + | Last update: 2021-01-30 |
| 540 | + | |
| 541 | + | We tested _mimalloc_ against many other top allocators over a wide |
| 542 | + | range of benchmarks, ranging from various real world programs to |
| 543 | + | synthetic benchmarks that see how the allocator behaves under more |
| 544 | + | extreme circumstances. In our benchmark suite, _mimalloc_ outperforms other leading |
| 545 | + | allocators (_jemalloc_, _tcmalloc_, _Hoard_, etc), and has a similar memory footprint. A nice property is that it |
| 546 | + | does consistently well over the wide range of benchmarks. |
| 547 | + | |
| 548 | + | General memory allocators are interesting as there exists no algorithm that is |
| 549 | + | optimal -- for a given allocator one can usually construct a workload |
| 550 | + | where it does not do so well. The goal is thus to find an allocation |
| 551 | + | strategy that performs well over a wide range of benchmarks without |
| 552 | + | suffering from (too much) underperformance in less common situations. |
| 553 | + | |
| 554 | + | As always, interpret these results with care since some benchmarks test synthetic |
| 555 | + | or uncommon situations that may never apply to your workloads. For example, most |
| 556 | + | allocators do not do well on `xmalloc-testN` but that includes even the best |
| 557 | + | industrial allocators like _jemalloc_ and _tcmalloc_ that are used in some of |
| 558 | + | the world's largest systems (like Chrome or FreeBSD). |
| 559 | + | |
| 560 | + | Also, the benchmarks here do not measure the behaviour on very large and long-running server workloads, |
| 561 | + | or worst-case latencies of allocation. Much work has gone into `mimalloc` to work well on such |
| 562 | + | workloads (for example, to reduce virtual memory fragmentation on long-running services) |
| 563 | + | but such optimizations are not always reflected in the current benchmark suite. |
| 564 | + | |
| 565 | + | We show here only an overview -- for |
| 566 | + | more specific details and further benchmarks we refer to the |
| 567 | + | [technical report](https://www.microsoft.com/en-us/research/publication/mimalloc-free-list-sharding-in-action). |
| 568 | + | The benchmark suite is automated and available separately |
| 569 | + | as [mimalloc-bench](https://github.com/daanx/mimalloc-bench). |
| 570 | + | |
| 571 | + | |
| 572 | + | ## Benchmark Results on a 16-core AMD 5950x (Zen3) |
| 573 | + | |
| 574 | + | Testing on the 16-core AMD 5950x processor at 3.4Ghz (4.9Ghz boost), with |
| 575 | + | with 32GiB memory at 3600Mhz, running Ubuntu 20.04 with glibc 2.31 and GCC 9.3.0. |
| 576 | + | |
| 577 | + | We measure three versions of _mimalloc_: the main version `mi` (tag:v1.7.0), |
| 578 | + | the new v2.0 beta version as `xmi` (tag:v2.0.0), and the main version in secure mode as `smi` (tag:v1.7.0). |
| 579 | + | |
| 580 | + | The other allocators are |
| 581 | + | Google's [_tcmalloc_](https://github.com/gperftools/gperftools) (`tc`, tag:gperftools-2.8.1) used in Chrome, |
| 582 | + | Facebook's [_jemalloc_](https://github.com/jemalloc/jemalloc) (`je`, tag:5.2.1) by Jason Evans used in Firefox and FreeBSD, |
| 583 | + | the Intel thread building blocks [allocator](https://github.com/intel/tbb) (`tbb`, tag:v2020.3), |
| 584 | + | [rpmalloc](https://github.com/mjansson/rpmalloc) (`rp`,tag:1.4.1) by Mattias Jansson, |
| 585 | + | the original scalable [_Hoard_](https://github.com/emeryberger/Hoard) (git:d880f72) allocator by Emery Berger \[1], |
| 586 | + | the memory compacting [_Mesh_](https://github.com/plasma-umass/Mesh) (git:67ff31a) allocator by |
| 587 | + | Bobby Powers _et al_ \[8], |
| 588 | + | and finally the default system allocator (`glibc`, 2.31) (based on _PtMalloc2_). |
| 589 | + | |
| 590 | + | <img width="90%" src="doc/bench-2021/bench-amd5950x-2021-01-30-a.svg"/> |
| 591 | + | <img width="90%" src="doc/bench-2021/bench-amd5950x-2021-01-30-b.svg"/> |
| 592 | + | |
| 593 | + | Any benchmarks ending in `N` run on all 32 logical cores in parallel. |
| 594 | + | Results are averaged over 10 runs and reported relative |
| 595 | + | to mimalloc (where 1.2 means it took 1.2× longer to run). |
| 596 | + | The legend also contains the _overall relative score_ between the |
| 597 | + | allocators where 100 points is the maximum if an allocator is fastest on |
| 598 | + | all benchmarks. |
| 599 | + | |
| 600 | + | The single threaded _cfrac_ benchmark by Dave Barrett is an implementation of |
| 601 | + | continued fraction factorization which uses many small short-lived allocations. |
| 602 | + | All allocators do well on such common usage, where _mimalloc_ is just a tad |
| 603 | + | faster than _tcmalloc_ and |
| 604 | + | _jemalloc_. |
| 605 | + | |
| 606 | + | The _leanN_ program is interesting as a large realistic and |
| 607 | + | concurrent workload of the [Lean](https://github.com/leanprover/lean) |
| 608 | + | theorem prover compiling its own standard library, and there is a 13% |
| 609 | + | speedup over _tcmalloc_. This is |
| 610 | + | quite significant: if Lean spends 20% of its time in the |
| 611 | + | allocator that means that _mimalloc_ is 1.6× faster than _tcmalloc_ |
| 612 | + | here. (This is surprising as that is not measured in a pure |
| 613 | + | allocation benchmark like _alloc-test_. We conjecture that we see this |
| 614 | + | outsized improvement here because _mimalloc_ has better locality in |
| 615 | + | the allocation which improves performance for the *other* computations |
| 616 | + | in a program as well). |
| 617 | + | |
| 618 | + | The single threaded _redis_ benchmark again show that most allocators do well on such workloads. |
| 619 | + | |
| 620 | + | The _larsonN_ server benchmark by Larson and Krishnan \[2] allocates and frees between threads. They observed this |
| 621 | + | behavior (which they call _bleeding_) in actual server applications, and the benchmark simulates this. |
| 622 | + | Here, _mimalloc_ is quite a bit faster than _tcmalloc_ and _jemalloc_ probably due to the object migration between different threads. |
| 623 | + | |
| 624 | + | The _mstressN_ workload performs many allocations and re-allocations, |
| 625 | + | and migrates objects between threads (as in _larsonN_). However, it also |
| 626 | + | creates and destroys the _N_ worker threads a few times keeping some objects |
| 627 | + | alive beyond the life time of the allocating thread. We observed this |
| 628 | + | behavior in many larger server applications. |
| 629 | + | |
| 630 | + | The [_rptestN_](https://github.com/mjansson/rpmalloc-benchmark) benchmark |
| 631 | + | by Mattias Jansson is a allocator test originally designed |
| 632 | + | for _rpmalloc_, and tries to simulate realistic allocation patterns over |
| 633 | + | multiple threads. Here the differences between allocators become more apparent. |
| 634 | + | |
| 635 | + | The second benchmark set tests specific aspects of the allocators and |
| 636 | + | shows even more extreme differences between them. |
| 637 | + | |
| 638 | + | The _alloc-test_, by |
| 639 | + | [OLogN Technologies AG](http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/), is a very allocation intensive benchmark doing millions of |
| 640 | + | allocations in various size classes. The test is scaled such that when an |
| 641 | + | allocator performs almost identically on _alloc-test1_ as _alloc-testN_ it |
| 642 | + | means that it scales linearly. |
| 643 | + | |
| 644 | + | The _sh6bench_ and _sh8bench_ benchmarks are |
| 645 | + | developed by [MicroQuill](http://www.microquill.com/) as part of SmartHeap. |
| 646 | + | In _sh6bench_ _mimalloc_ does much |
| 647 | + | better than the others (more than 2.5× faster than _jemalloc_). |
| 648 | + | We cannot explain this well but believe it is |
| 649 | + | caused in part by the "reverse" free-ing pattern in _sh6bench_. |
| 650 | + | The _sh8bench_ is a variation with object migration |
| 651 | + | between threads; whereas _tcmalloc_ did well on _sh6bench_, the addition of object migration causes it to be 10× slower than before. |
| 652 | + | |
| 653 | + | The _xmalloc-testN_ benchmark by Lever and Boreham \[5] and Christian Eder, simulates an asymmetric workload where |
| 654 | + | some threads only allocate, and others only free -- they observed this pattern in |
| 655 | + | larger server applications. Here we see that |
| 656 | + | the _mimalloc_ technique of having non-contended sharded thread free |
| 657 | + | lists pays off as it outperforms others by a very large margin. Only _rpmalloc_, _tbb_, and _glibc_ also scale well on this benchmark. |
| 658 | + | |
| 659 | + | The _cache-scratch_ benchmark by Emery Berger \[1], and introduced with |
| 660 | + | the Hoard allocator to test for _passive-false_ sharing of cache lines. |
| 661 | + | With a single thread they all |
| 662 | + | perform the same, but when running with multiple threads the potential allocator |
| 663 | + | induced false sharing of the cache lines can cause large run-time differences. |
| 664 | + | Crundal \[6] describes in detail why the false cache line sharing occurs in the _tcmalloc_ design, and also discusses how this |
| 665 | + | can be avoided with some small implementation changes. |
| 666 | + | Only the _tbb_, _rpmalloc_ and _mesh_ allocators also avoid the |
| 667 | + | cache line sharing completely, while _Hoard_ and _glibc_ seem to mitigate |
| 668 | + | the effects. Kukanov and Voss \[7] describe in detail |
| 669 | + | how the design of _tbb_ avoids the false cache line sharing. |
| 670 | + | |
| 671 | + | |
| 672 | + | ## On a 36-core Intel Xeon |
| 673 | + | |
| 674 | + | For completeness, here are the results on a big Amazon |
| 675 | + | [c5.18xlarge](https://aws.amazon.com/ec2/instance-types/#Compute_Optimized) instance |
| 676 | + | consisting of a 2×18-core Intel Xeon (Cascade Lake) at 3.4GHz (boost 3.5GHz) |
| 677 | + | with 144GiB ECC memory, running Ubuntu 20.04 with glibc 2.31, GCC 9.3.0, and |
| 678 | + | Clang 10.0.0. This time, the mimalloc allocators (mi, xmi, and smi) were |
| 679 | + | compiled with the Clang compiler instead of GCC. |
| 680 | + | The results are similar to the AMD results but it is interesting to |
| 681 | + | see the differences in the _larsonN_, _mstressN_, and _xmalloc-testN_ benchmarks. |
| 682 | + | |
| 683 | + | <img width="90%" src="doc/bench-2021/bench-c5-18xlarge-2021-01-30-a.svg"/> |
| 684 | + | <img width="90%" src="doc/bench-2021/bench-c5-18xlarge-2021-01-30-b.svg"/> |
| 685 | + | |
| 686 | + | |
| 687 | + | ## Peak Working Set |
| 688 | + | |
| 689 | + | The following figure shows the peak working set (rss) of the allocators |
| 690 | + | on the benchmarks (on the c5.18xlarge instance). |
| 691 | + | |
| 692 | + | <img width="90%" src="doc/bench-2021/bench-c5-18xlarge-2021-01-30-rss-a.svg"/> |
| 693 | + | <img width="90%" src="doc/bench-2021/bench-c5-18xlarge-2021-01-30-rss-b.svg"/> |
| 694 | + | |
| 695 | + | Note that the _xmalloc-testN_ memory usage should be disregarded as it |
| 696 | + | allocates more the faster the program runs. Similarly, memory usage of |
| 697 | + | _larsonN_, _mstressN_, _rptestN_ and _sh8bench_ can vary depending on scheduling and |
| 698 | + | speed. Nevertheless, we hope to improve the memory usage on _mstressN_ |
| 699 | + | and _rptestN_ (just as _cfrac_, _larsonN_ and _sh8bench_ have a small working set which skews the results). |
| 700 | + | |
| 701 | + | <!-- |
| 702 | + | # Previous Benchmarks |
| 703 | + | |
| 704 | + | Todo: should we create a separate page for this? |
| 705 | + | |
| 706 | + | ## Benchmark Results on 36-core Intel: 2020-01-20 |
| 707 | + | |
| 708 | + | Testing on a big Amazon EC2 compute instance |
| 709 | + | ([c5.18xlarge](https://aws.amazon.com/ec2/instance-types/#Compute_Optimized)) |
| 710 | + | consisting of a 72 processor Intel Xeon at 3GHz |
| 711 | + | with 144GiB ECC memory, running Ubuntu 18.04.1 with glibc 2.27 and GCC 7.4.0. |
| 712 | + | The measured allocators are _mimalloc_ (xmi, tag:v1.4.0, page reset enabled) |
| 713 | + | and its secure build as _smi_, |
| 714 | + | Google's [_tcmalloc_](https://github.com/gperftools/gperftools) (tc, tag:gperftools-2.7) used in Chrome, |
| 715 | + | Facebook's [_jemalloc_](https://github.com/jemalloc/jemalloc) (je, tag:5.2.1) by Jason Evans used in Firefox and FreeBSD, |
| 716 | + | the Intel thread building blocks [allocator](https://github.com/intel/tbb) (tbb, tag:2020), |
| 717 | + | [rpmalloc](https://github.com/mjansson/rpmalloc) (rp,tag:1.4.0) by Mattias Jansson, |
| 718 | + | the original scalable [_Hoard_](https://github.com/emeryberger/Hoard) (tag:3.13) allocator by Emery Berger \[1], |
| 719 | + | the memory compacting [_Mesh_](https://github.com/plasma-umass/Mesh) (git:51222e7) allocator by |
| 720 | + | Bobby Powers _et al_ \[8], |
| 721 | + | and finally the default system allocator (glibc, 2.27) (based on _PtMalloc2_). |
| 722 | + | |
| 723 | + | <img width="90%" src="doc/bench-2020/bench-c5-18xlarge-2020-01-20-a.svg"/> |
| 724 | + | <img width="90%" src="doc/bench-2020/bench-c5-18xlarge-2020-01-20-b.svg"/> |
| 725 | + | |
| 726 | + | The following figure shows the peak working set (rss) of the allocators |
| 727 | + | on the benchmarks (on the c5.18xlarge instance). |
| 728 | + | |
| 729 | + | <img width="90%" src="doc/bench-2020/bench-c5-18xlarge-2020-01-20-rss-a.svg"/> |
| 730 | + | <img width="90%" src="doc/bench-2020/bench-c5-18xlarge-2020-01-20-rss-b.svg"/> |
| 731 | + | |
| 732 | + | |
| 733 | + | ## On 24-core AMD Epyc, 2020-01-16 |
| 734 | + | |
| 735 | + | For completeness, here are the results on a |
| 736 | + | [r5a.12xlarge](https://aws.amazon.com/ec2/instance-types/#Memory_Optimized) instance |
| 737 | + | having a 48 processor AMD Epyc 7000 at 2.5GHz with 384GiB of memory. |
| 738 | + | The results are similar to the Intel results but it is interesting to |
| 739 | + | see the differences in the _larsonN_, _mstressN_, and _xmalloc-testN_ benchmarks. |
| 740 | + | |
| 741 | + | <img width="90%" src="doc/bench-2020/bench-r5a-12xlarge-2020-01-16-a.svg"/> |
| 742 | + | <img width="90%" src="doc/bench-2020/bench-r5a-12xlarge-2020-01-16-b.svg"/> |
| 743 | + | |
| 744 | + | --> |
| 745 | + | |
| 746 | + | |
| 747 | + | # References |
| 748 | + | |
| 749 | + | - \[1] Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. |
| 750 | + | _Hoard: A Scalable Memory Allocator for Multithreaded Applications_ |
| 751 | + | the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX). Cambridge, MA, November 2000. |
| 752 | + | [pdf](http://www.cs.utexas.edu/users/mckinley/papers/asplos-2000.pdf) |
| 753 | + | |
| 754 | + | - \[2] P. Larson and M. Krishnan. _Memory allocation for long-running server applications_. |
| 755 | + | In ISMM, Vancouver, B.C., Canada, 1998. [pdf](http://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.45.1947&rep=rep1&type=pdf) |
| 756 | + | |
| 757 | + | - \[3] D. Grunwald, B. Zorn, and R. Henderson. |
| 758 | + | _Improving the cache locality of memory allocation_. In R. Cartwright, editor, |
| 759 | + | Proceedings of the Conference on Programming Language Design and Implementation, pages 177–186, New York, NY, USA, June 1993. [pdf](http://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.43.6621&rep=rep1&type=pdf) |
| 760 | + | |
| 761 | + | - \[4] J. Barnes and P. Hut. _A hierarchical O(n*log(n)) force-calculation algorithm_. Nature, 324:446-449, 1986. |
| 762 | + | |
| 763 | + | - \[5] C. Lever, and D. Boreham. _Malloc() Performance in a Multithreaded Linux Environment._ |
| 764 | + | In USENIX Annual Technical Conference, Freenix Session. San Diego, CA. Jun. 2000. |
| 765 | + | Available at <https://github.com/kuszmaul/SuperMalloc/tree/master/tests> |
| 766 | + | |
| 767 | + | - \[6] Timothy Crundal. _Reducing Active-False Sharing in TCMalloc_. 2016. CS16S1 project at the Australian National University. [pdf](http://courses.cecs.anu.edu.au/courses/CSPROJECTS/16S1/Reports/Timothy_Crundal_Report.pdf) |
| 768 | + | |
| 769 | + | - \[7] Alexey Kukanov, and Michael J Voss. |
| 770 | + | _The Foundations for Scalable Multi-Core Software in Intel Threading Building Blocks._ |
| 771 | + | Intel Technology Journal 11 (4). 2007 |
| 772 | + | |
| 773 | + | - \[8] Bobby Powers, David Tench, Emery D. Berger, and Andrew McGregor. |
| 774 | + | _Mesh: Compacting Memory Management for C/C++_ |
| 775 | + | In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'19), June 2019, pages 333-–346. |
| 776 | + | |
| 777 | + | <!-- |
| 778 | + | - \[9] Paul Liétar, Theodore Butler, Sylvan Clebsch, Sophia Drossopoulou, Juliana Franco, Matthew J Parkinson, |
| 779 | + | Alex Shamis, Christoph M Wintersteiger, and David Chisnall. |
| 780 | + | _Snmalloc: A Message Passing Allocator._ |
| 781 | + | In Proceedings of the 2019 ACM SIGPLAN International Symposium on Memory Management, 122–135. ACM. 2019. |
| 782 | + | --> |
| 783 | + | |
| 784 | + | # Contributing |
| 785 | + | |
| 786 | + | This project welcomes contributions and suggestions. Most contributions require you to agree to a |
| 787 | + | Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us |
| 788 | + | the rights to use your contribution. For details, visit https://cla.microsoft.com. |
| 789 | + | |
| 790 | + | When you submit a pull request, a CLA-bot will automatically determine whether you need to provide |
| 791 | + | a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions |
| 792 | + | provided by the bot. You will only need to do this once across all repos using our CLA. |
| 793 | + | |
| 794 | + | |
| 795 | + | # Older Release Notes |
| 796 | + | |
| 797 | + | * 2020-09-24, `v1.6.7`: stable release 1.6: using standard C atomics, passing tsan testing, improved |
| 798 | + | handling of failing to commit on Windows, add [`mi_process_info`](https://github.com/microsoft/mimalloc/blob/master/include/mimalloc.h#L156) api call. |
| 799 | + | * 2020-08-06, `v1.6.4`: stable release 1.6: improved error recovery in low-memory situations, |
| 800 | + | support for IllumOS and Haiku, NUMA support for Vista/XP, improved NUMA detection for AMD Ryzen, ubsan support. |
| 801 | + | * 2020-05-05, `v1.6.3`: stable release 1.6: improved behavior in out-of-memory situations, improved malloc zones on macOS, |
| 802 | + | build PIC static libraries by default, add option to abort on out-of-memory, line buffered statistics. |
| 803 | + | * 2020-04-20, `v1.6.2`: stable release 1.6: fix compilation on Android, MingW, Raspberry, and Conda, |
| 804 | + | stability fix for Windows 7, fix multiple mimalloc instances in one executable, fix `strnlen` overload, |
| 805 | + | fix aligned debug padding. |
| 806 | + | * 2020-02-17, `v1.6.1`: stable release 1.6: minor updates (build with clang-cl, fix alignment issue for small objects). |
| 807 | + | * 2020-02-09, `v1.6.0`: stable release 1.6: fixed potential memory leak, improved overriding |
| 808 | + | and thread local support on FreeBSD, NetBSD, DragonFly, and macOSX. New byte-precise |
| 809 | + | heap block overflow detection in debug mode (besides the double-free detection and free-list |
| 810 | + | corruption detection). Add `nodiscard` attribute to most allocation functions. |
| 811 | + | Enable `MIMALLOC_PAGE_RESET` by default. New reclamation strategy for abandoned heap pages |
| 812 | + | for better memory footprint. |
| 813 | + | * 2020-02-09, `v1.5.0`: stable release 1.5: improved free performance, small bug fixes. |
| 814 | + | * 2020-01-22, `v1.4.0`: stable release 1.4: improved performance for delayed OS page reset, |
| 815 | + | more eager concurrent free, addition of STL allocator, fixed potential memory leak. |
| 816 | + | * 2020-01-15, `v1.3.0`: stable release 1.3: bug fixes, improved randomness and [stronger |
| 817 | + | free list encoding](https://github.com/microsoft/mimalloc/blob/783e3377f79ee82af43a0793910a9f2d01ac7863/include/mimalloc-internal.h#L396) in secure mode. |
| 818 | + | * 2019-12-22, `v1.2.2`: stable release 1.2: minor updates. |
| 819 | + | * 2019-11-22, `v1.2.0`: stable release 1.2: bug fixes, improved secure mode (free list corruption checks, double free mitigation). Improved dynamic overriding on Windows. |
| 820 | + | * 2019-10-07, `v1.1.0`: stable release 1.1. |
| 821 | + | * 2019-09-01, `v1.0.8`: pre-release 8: more robust windows dynamic overriding, initial huge page support. |
| 822 | + | * 2019-08-10, `v1.0.6`: pre-release 6: various performance improvements. |
| 823 | + | |