How to optimize memory consumption

I propose to talk about memory size optimization, fragmentation, mmap, system, C and data structures through the example of a library that implements the Aho-Corasick search algorithm.

For a security need, I must detect character strings in text within a limited time. I will for example try to find known attack patterns. The most basic algorithm is very simple, I have as input a list of words to search for in text. I take the first word, I look for each character in the text to see if the word is found. When I reach the end of the text I start again with the second word. Nothing simpler, but security researchers have asked me to find about 2400 words in my text. The search complexity takes a hit! If "n" is the number of words, "m" the average size of a word and "t" the size of the text, we get this complexity:

$O\left(n \cdot m \cdot t\right)$

For each character in the text, we must test each character of each word. This way of doing things is not compatible with the time constraints imposed to do a search. Furthermore, this is not satisfactory from an algorithmic point of view.

The solution is well known, you must use the Aho-Corasick algorithm which allows you to have a complexity in O(t). Aho-Corasick allows you to build a graph in which you move as you analyze the characters of the text in which you are searching. I don't want to go into the details of the algorithm, here's a good description: Aho-Corasick Algorithm

I take the opportunity to play with ChatGPT :-) which approximately creates the C code that implements the algorithm in a basic way. After fixing bugs and adapting the interface, I get satisfactory behavior.

The source code is versioned on github: thierry-f-78/aho-corasick

The tests

I mainly test on my mac "2.3 GHz Quad-Core Intel Core i7". This is a 64bit processor, so all pointers are 8 bytes in size. The test procedure consists of these steps:

Tree initialization with word loading and failure link calculation.
Calculation of useful memory for tree storage.
Match test for each word loaded from the reference list (there may be more than one match per word). This test validates the proper functioning of the function.
Verification of the expected number of matches against all test words loaded from the reference list. This test validates the proper functioning of the function.
Speed test by performing a loop of 10,000,000 searches in a constant reference text.

In parallel, I measure the RSS memory consumed which is the memory actually consumed by the process.

After starting the program and before loading the data, the RSS is 522kb. I don't try to reduce this size.

The reference word list contains 2358 words with an average size of 22 bytes. The test text is 74 bytes and two words match. Once the tree is generated, the word list is represented by a tree of 23,701 nodes.

The first measurement

The basic algorithm has no optimization in terms of memory consumption and is reasonably fast. It consumes 60MB of RSS memory and does 4.2 million matches per second. The actual size of the useful data that makes up the tree is 46.6MB.

A priori, using 60MB of memory to store a tree of 2358 words is excessive. The complete dataset has 23,701 nodes, this makes an approximate average overhead of 2650 bytes per node. This is a high value, but it's approximately what is expected. Here is the declaration of a structure that represents a node:

struct ac_node {
    struct ac_node *children[256];
    struct ac_node *fail;
    char *word;
};

By measuring the data size, we easily justify the 46.6MB of memory. On the other hand, we are far from the 60MB of RSS measured, 13.4MB are missing to justify. This is probably due to memory fragmentation, due to the size of pages allocated by the kernel relative to the size of a node, as well as the ability of pages to be contiguous.

Malloc and fragmentation

The memory allocation system in C is based on distributing small blocks of memory taken from a large block delivered to the program by the OS. When we want to store 5 bytes the program asks for some memory from the allocator, which itself will ask the system for it. The system only provides complete memory pages. These pages are generally 4kb, they can be contiguous or not.

The memory allocator will then distribute the memory pages allocated by the kernel to the various functions that need them. So that the memory allocation system (malloc/calloc/realloc/free) can manage the allocated sub-blocks and the available sub-blocks, it also uses part of this allocated memory to store data. This data is not directly useful for the program, but is essential for the allocator.

The kernel therefore provides large contiguous or non-contiguous memory blocks, and the software's memory allocation library (malloc/calloc/realloc/free) manages their granular cutting for final use.

Let's assume a kernel block of 10 bytes and neglect the memory used by the allocator to structure itself. The diagrams below represent the allocated bytes in blue, the free bytes in white. The address of the memory space starts at 0 for better readability.

The software needs 4 bytes, it takes the first 4.
It then needs 2 bytes, it takes the second ones.
It no longer needs the first 4 bytes, it returns them. There are therefore 8 bytes available.
The software needs 5 bytes, there are indeed 5 bytes available, but not contiguous, malloc must therefore request a new block from the system. In the first case, the process memory space allows the addition of a contiguous block. Only bytes 0 to 3 cannot be used and form a hole.
in the second case, it does not allow it and the new block is not contiguous. Bytes 0 to 3 and 6 to 8 cannot be used and form two holes:

This is where fragmentation appears, it manifests itself through the 4-byte holes. Two in the first case, and only one in the second case. Memory defragmentation is not a simple thing in C or in any language based on memory allocation. Also, although it is not easy, it can be interesting not to create holes, or to limit their appearance.

The holes created will still be used when the program needs a memory block of a size corresponding to the contiguous size available.

In the case of the library, the size of a node is 2064 bytes. The size of a memory page is 4096 bytes. We can only put one node per 4kb page. Each time the kernel cannot provide contiguous pages in memory, we create a hole of 4096 - 2064 bytes, or 2032 bytes. If we look at how many holes of 2032 bytes are needed to waste 13.4MB of memory, we get 6914. Compared to the 23701 nodes that make up the tree, this represents 29% of allocations that would not be contiguous. This is a ratio that is far too high. The real reason must be a mix between fragmentation and the allocator's control structures.

First memory optimization

The idea is to reduce the size of the necessary memory. We can note an array of pointers 256 values that serves as an index to quickly indicate which child node to go to depending on the character entered. Each character is a byte, or 256 switching possibilities. We will call this array the children index or the index. It is a matter of seeing if we can reduce the size of this index.

It may be interesting to see how many children each node has and make a statistic of it. This will give an order of magnitude on the use of the index. For this, you have to create the tree and code a recursive algorithm that traverses the tree and updates counters. We add the number of valid links that each node has to a global counter that we divide at the end by the total number of nodes (in short, we calculate the average number of valid children per node). The result is:

Maximum number of child nodes: 32
Average number of child nodes: 1

We conclude that there are on average 255 extra values (out of the 256 possible and planned values) to describe the children index of each node. Also theoretically, the structure of a node could on average be limited to a size of 3 pointers of 8 bytes (one child link, one fail link and the searched word), which gives a theoretical memory size used of:

$\left( 3 \cdot 8 + 22 \right) \cdot 23701 = 1.1MB$

To reduce the size of the children index, we can use a bounded index. In the example below, the first 4 and the last 4 pointers are NULL, so they are not useful for the children index. In other words, we can compress the index by indicating the number of NULL elements at the beginning of the index and the number of NULL elements at the end of the index.

It is therefore a matter of modifying the structure of a node in order to store the first and last significant pointer. The index having only 256 possible values from 0 to 255, one byte per bound is enough to store the information. These bytes are named "first" and "last" The storage structure of a node becomes:

struct ac_node {
    unsigned char first;
    unsigned char last;
    struct ac_node **children;
    struct ac_node *fail;
    char *word;
};

The result is satisfactory. The RSS memory has dropped to 1.9MB, however the execution speed has dropped to 3.7 million matches per second.

The execution speed has decreased because access to the index is no longer immediate. In the previous version, we could search directly for a byte in the index, in this new version, we must first check that the byte is contained in the index, then apply an offset and choose the right index in the tree. These 3 operations (two comparisons and one subtraction) justify the drop in performance.

The test program indicates a useful memory consumption of 1.01MB, but the RSS measures 1.9MB, almost double. We can no longer neglect the memory needed by the process to function, namely the 522kb of RSS observed at program startup. We therefore have 1.4MB used instead of the expected 1MB. It seems that the excess space is occupied by memory fragmentation due to the allocator, as well as the node model used which is not as ideal as the one mentioned. Indeed, the index is now designated by a pointer that uses 8 bytes and the index limits are stored with two 8-bit integers which therefore take 2 bytes.

Here is the calculation of the size of a node,

2 + 8 + 8 + 8 for the node + 22 for the text + 8 for the index, or 56 bytes

We multiply by the number of nodes (23,701), and this gives 1.26MB, which approaches the measured 1.4MB.

Can we still reduce the amount of memory used?

Looking at the structure that carries each node, we can wonder if the pointer to the word serves any purpose. In practice it is used to know if we have found a word by checking that this pointer is not NULL. A single bit can replace this value. Its other utility is to return the word that corresponds to the caller.

Since the word corresponds to a portion of the text in which we are searching, it is not useful to store the found word, only the position in the text as well as the number of corresponding characters is sufficient to reconstruct the information.

Let's try to remove the "char *word" pointer. It will simply be a matter of replacing it with an integer that contains the number of corresponding characters or 0 if the node is not final. A 16-bit integer will be sufficient. On the one hand it is sufficient for my needs since it allows words of 64kb long, and on the other hand it only takes two bytes in memory. We therefore save 6 bytes per node.

struct ac_node {
    unsigned char first;
    unsigned char last;
    struct ac_node **children;
    struct ac_node *fail;
    unsigned short match;
};

After compiling and testing the program, we see that the amount of memory drops considerably, but not as much as hoped. Here is therefore an opportunity to talk about memory alignment.

Structure alignment

An x86 or x86_64 processor is much more efficient when it accesses aligned memory. Aligned means that the memory is aligned on the size of the data that the processor will retrieve. Thus a 64-bit pointer will be positioned at a memory address that is a multiple of 8 bytes. Without this, some instructions simply don't work and others are emulated by the generated assembly code. The compiler will therefore align the members of a structure on multiples of the largest value contained in the structure. Here, it will be 8 bytes. This way of aligning is not optimal because shorter types can be aligned differently.

So by applying the concept of alignment, the structure below which seems to contain only 20 bytes will actually contain 32, of which 12 are not used.

struct ac_node {
    unsigned char first;
    unsigned char last;
    unsigned short match;
    struct ac_node **children;
    struct ac_node *fail;
};

To minimize these holes, we should order the members of a structure so that the order of members is determined based on the alignment of their type. The goal is that there are no spaces between two successive members. By declaring the structure as below, a maximum of holes are filled. Instead of being 32 bytes, it is 24. There are 4 bytes left free between "last" and "children".

struct ac_node {
    unsigned char first;
    unsigned char last;
    unsigned short match;
    struct ac_node **children;
    struct ac_node *fail;
};

We note that there remains a hole of 4 bytes due to the 64-bit alignment of the "children" pointer. We can force the compiler to compact the structure to the strict necessary by using the __attribute((packed))__ directive at the end of the structure declaration. In this specific case the 64-bit pointers will no longer be aligned on multiples of their size but glued to each other.

When a structure is not aligned, the program still works. When the program wants to retrieve data in memory, it knows by definition that the data is not aligned it will often have to make two accesses to read the data in two times Also applying a __attribute((packed))__ on its structures can create big performance problems.

struct ac_node {
    unsigned char first;
    unsigned char last;
    unsigned short match;
    struct ac_node **children;
    struct ac_node *fail;
} __attribute((packed))__;

The measurements show that the amount of memory used does not decrease. Indeed the allocator returns structures aligned on multiples of 64bits, as well as the structures no longer contain holes, the holes are created between them.

In the example above, if the first structure is allocated at memory address 0 and ends at 19, the second should be allocated at address 20, but 20 is not a multiple of 8, so the allocator will position the requested memory block at address 24, which leaves a gap between 20 and 23. Let's keep this optimization for later.

This modification (without __attribute((packed))__) reduces the amount of memory used by 100KB, but it also impacts performance. The memory occupied is now 1.8MB and the execution speed is 3.6 million matches per second. Performance is slightly down because for each match you have to calculate the returned pointer by doing a subtraction, and you also have to push two values onto the stack instead of just one.

Let's save a few more bytes

The current node structure points to an index. Since each node has an index, the index is necessarily present, this pointer cannot be NULL. The pointer to the index doesn't provide anything from a functional point of view, it's just convenient to be able to grow and relocate the index as the tree is created. If we can position the index at a relatively fixed location in memory relative to the structure address, the pointer would no longer be useful.

The most obvious fixed place relative to the structure is the end of the structure. This way the index will always be at the structure address to which we add the structure size. We save 8 bytes per node structure. Here's how the index works and is positioned:

And here's how it works and is positioned after this optimization. Note the disappearance of bytes 4 to 11 which are no longer useful.

The new structure is declared as below. The array of size [0] is not counted in the structure size (sizeof), however it points to the end of the structure and indicates the nature of the index. It's the developer who will have to force the memory allocation size of the structure based on what they want to do.

struct ac_node {
	short match;
	unsigned char first;
	unsigned char last;
	struct ac_node *fail;
	struct ac_node *children[0];
};

The difficulty of this evolution is that each time the index changes size, via a realloc() the structure can change address. So, you have to traverse the entire tree to correct the pointers to this structure. This implies:

Longer data loading time
More memory fragmentation.

That said, since statistically the number of children is 1, there should be little memory reallocation.

After implementing this evolution, we save 120KB of RAM. The RSS is now 1.68MB and we observe better performance on match searching with 3.7 million matches per second. This is explained because the index is closer in memory to the structure and can therefore be found in the CPU's memory cache.

Let's try to reduce the amount of memory even more

This time, we're going to partially remove memory fragmentation by making the structures necessary for tree creation contiguous. The algorithm's operation consists of creating a tree at process startup, then using it. So when the tree is created the memory structures will no longer be modified. Also, using malloc() for each node, which provides flexibility to easily delete or add a node, is not justified beyond the construction phase.

The example below shows how realloc() works to illustrate the creation of a memory hole due to fragmentation. It also shows memory usage for the allocator itself (noted as "metadata"). This is just an illustration, the use of metadata is more complex than that.

To remove a large part of these elements, we can allocate a large memory block in which the Aho-Corasick tree creation program will make its own subdivisions without using the allocator. Note that free no longer needs to be used. However, each time an index grows, all the tree data is shifted. This evolution therefore requires memory movement work and offset calculations.

After this modification, memory consumption decreases by 180KB for a total of 1.50MB RSS and execution speed is slightly down to 3.69 million matches per second. I can't explain this speed decrease, since it's minimal I don't delve deeper into the question. With this operating mode, word indexing is much slower. Although I haven't measured it, it seems to be twice as slow.

Even less memory consumed

There's still one dynamic allocation left, which is that of the large memory block that is often grown via the use of realloc(). Each time it grows, it can potentially create fragmentation that will increase the process's memory consumption. The ideal would be to make a single block of the final size, but it's not possible to calculate this size without creating the tree and therefore using memory for that.

When a process requests memory from the kernel, the heap increase is definitive. So, even when memory is freed with free(), it's not actually returned to the kernel so that it can be used for another process.

Linux offers a type of memory that can actually be freed. These are areas allocated with mmap(). Mmap is designed to manipulate file content as a memory block. It allows you to manipulate a non-existent file, and is therefore similar to allocating a memory block. When this memory area is no longer used, it is actually freed. However, this memory area doesn't allow easy growth. More precisely, under Linux, you can grow it with mremap(), but apparently not under MacOS, and I use MacOS for my tests.

Now that we know how to allocate a large memory block and work inside it, it's easy to modify and make the tree creation algorithm work with an mmap() memory provider rather than a malloc() provider. To avoid reallocating memory too often, we make sure to allocate 512KB blocks, each time the memory needed for the tree exceeds 512KB, a new mmap() of a size multiple of 512KB is made, the data is moved and the old block is returned to the kernel with munmap().

At the end of tree creation, we have the total size of the structures, but since memory allocation via mmap() is done by blocks and therefore in excess. We will therefore need to get rid of this excess memory by allocating a block of the total necessary with malloc(), transferring the data and freeing the original block with munmap().

This optimization further reduces memory consumption by 200KB to reach 1.30MB RSS. Execution speed has increased to 4 million matches per second.

If we take into account the 552KB RSS consumed idle by the process, we can deduce that the tree fits in 744KB, a measurement in the code gives 684,192 bytes. We are therefore close to the minimum required memory.

Let's apply `attribute((packed))`

Now that all the memory is compact, we can test the node structure declaration with __attribute((packed))__. This should remove holes and further reduce the amount of memory used.

This optimization brings the RSS to 1152KB, saving 144KB. If we take into account the 552KB RSS consumed idle by the process, we can deduce that the tree fits in 600KB, a measurement in the code gives a useful data size of 576KB. The gap is negligible. Probably due to memory consumed for the second phase of failure link calculation. The impact on match speed is really minimal because we go from 4.03 million matches per second to 4.

However, exposing an API with __attribute((packed))__ constrains library usage to compilers that understand this directive because it's not standard C.

Memory consumption and speed according to evolutions

ID	Evolution	Size of data used (KB)	RSS size (KB)	Speed (M Matches/s)
start	Startup measurement	0	552
raw	First version	47,760	60,220	4.28
trim	Index compression by removing beginning and end	1,038	1,904	3.75
reduce1	Removal of matching word	853	1,804	3.60
reduce2	Removal of index pointer	668	1,680	3.73
nofrag	Fragmentation removal (`malloc() / realloc()` per node)	668	1,496	3.69
mmap	Last fragmentation removal (global `realloc()`)	668	1,296	4.03
packed	Compaction / 32bit alignment of node structure	576	1,152	4.00

The initial measurement relative to the "raw" evolution doesn't appear because it's relatively too high and prevents seeing the evolution of memory consumption of other improvements. In practice, it's the only important improvement.

We can note that the "nofrag" and "mmap" evolutions don't reduce the memory used by the data, but only the RSS. Indeed these evolutions improve memory allocation by reducing fragmentation. The measurement of used memory doesn't include fragmentation, so it doesn't change.

Execution speed is difficult to explain because it depends on factors that are difficult to measure, notably the content of the CPU cache. We can note that execution speed varies little as code improvements progress. The worst variation reaches a 15.8% performance decrease. Although this is significant, the absolute performance remains acceptable.

What to do with all this?

Optimizing memory consumption is laborious work, but necessary in many cases. Depending on constraints, we can't afford to waste RAM to save the developer's brain. Typically, the embedded domain often imposes very limited amounts of memory. Furthermore, at a time when ecology is a major societal issue, can we still accept wasting resources?