Article Illustration

🇫🇷 Version Française disponible ici

Associative arrays and their performance in Go

For Go developers, understanding the nuances of the language's built-in data structures is essential for writing efficient code. This article focuses on an analysis of maps in Go, a fundamental data structure that implements associative arrays.

This analysis covers two points:

  • Exploring the duality between maps and lists in terms of performance.
  • Discovering the internal mechanisms of maps.

The objective is to make informed choices when designing data structures, particularly for performance-sensitive applications.

Associative arrays

Very roughly speaking, a processor can only perform a few binary and arithmetic operations, conditional jumps, and manipulate memory space. Among these operations, we can very easily find how to make a list (it's enough to put fixed-size elements one after another in memory space), but conversely, a key/value array doesn't seem to be something trivial for a processor.

Paradoxically, most high-level languages (mainly scripting languages) present key/value array structures in the native syntax of the language.

Python:

fruits = {
    "apple": "red",
    "banana": "yellow",
    "kiwi": "green"
}

PHP:

$fruits = array(
    "apple" => "red",
    "banana" => "yellow",
    "kiwi" => "green"
);

JavaScript:

let fruits = {
    apple: "red",
    banana: "yellow",
    kiwi: "green"
};

Go

fruits := map[string]string{
    "apple":  "red",
    "banana": "yellow",
    "kiwi":   "green",
}

These associative arrays are necessarily implemented by a relatively complex algorithm and their use is not neutral: it will impact a program's performance.

An associative array is a data structure that allows storing key/value pairs and accessing them quickly. It uses a hash function to convert keys into array indices. The hash function is the heart of the associative array. It takes a key as input and returns an index in the array. The array size is known, so the result of the hash function will be a value between 0 and the array size - 1.

When inserting a key/value pair into the array, we calculate the index with the hash function, then place the pair in the location corresponding to this index. Sometimes a collision can occur, meaning two different keys for which the hash function returns the same index. We then use a list to put the values one after another.

So in terms of performance, if we ignore the complexity of the hash function, we will have:

  • Insertion: O(1) on average, O(n) in the worst case
  • Search: O(1) on average, O(n) in the worst case
  • Deletion: O(1) on average, O(n) in the worst case

The performance of an associative array based on a hash table will therefore depend on these parameters:

  • The array size
  • The speed of the hashing algorithm
  • The probability of collisions based on keys (and therefore the length of lists)

We therefore understand that what lies behind these seemingly harmless notations is far from being something simple.

Low-level languages such as C, C++, Java, Rust do not natively offer the possibility of creating an associative array. C requires implementing an array adapted to development or using external libraries. C++, Java and Rust offer implementations in their standard libraries.

Go is between these two types of languages. It's a compiled language at a fairly low level, but offers a native implementation of associative arrays through maps. I propose to study the impact of the hash key and determine the relevance of using a map.

The case of Go

In Go, maps are implemented with a hash table, so we have a complexity that is O(h) + O(1), where O(h) is the unknown complexity of the hashing algorithm and O(1) is the access complexity once the hash is known (ignoring the worst case). O(1) can be ignored and we therefore focus on O(h). Searching for an element among "n" elements in a list has a complexity of O(n).

Since O(h) is unknown, we might think that for small values of n, the complexity O(n) would give better response times than O(h).

Benchmark presentation

This involves creating a list and a map of n elements, and measuring 10 million accesses with random keys. The measurement operation will be repeated for each value of n, where n is between 1 and 40.

The first test named Raw generates non-variable keys in string form representing hexadecimal, ranging from "0000" to "0028". The test keys are pseudo-random keys ranging from "0000" to "0028".

The second test named Long generates distinct keys ranging from "00000000" to "00989680". The test keys are also pseudo-random keys ranging from "00000000" to "00989680".

Raw and Long are very unrepresentative names, I had another idea in mind at the beginning of these tests.

The processor for this test is an Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz.

We can note the use of Go tags for selective compilation.

Result presentation

First test:

First test graph

Second test:

Second test graph

We note in the result a behavior that doesn't resemble at all what is expected. We expect to have a constant response time for maps (here ~25ns/op), but below 9 elements this is not true. Maps have response times below ~25ns/op and proportional to the response times of a list traversal. We therefore assume that:

  • The hashing algorithm of complexity O(h) is evaluated at ~25ns/op
  • Below 9 elements, this algorithm is not used.

Maps in Go

Looking at the Go source code (compiler, including runtime and standard lib), I find four map implementations, each presenting a map access function (lookup):

  • map_fast32_noswiss.go: mapaccess1_fast32(..., key uint32)
  • map_fast64_noswiss.go: mapaccess1_fast64(..., key uint64)
  • map_faststr_noswiss.go: mapaccess1_faststr(..., key string)
  • map_noswiss.go: mapaccess1(..., key unsafe.Pointer)

The files are noted noswiss because Go implements an experimental version of maps that use an algorithm called Swiss Tables. We therefore find the *_swiss.go files, but only the generic one is implemented and you must indicate the use of this algorithm with the environment variable GOEXPERIMENT=swissmap. We ignore these implementations.

Two of the implementations are optimized for keys in the form of 32 and 64-bit integers, one is optimized for keys in the form of character strings, and the last is generalist.

We will focus on the mapaccess1_faststr implementation found in the file src/runtime/map_faststr_noswiss.go simply because the program decompilation shows that this function is used.

go build -gcflags -S a.go

...
0x004f (a.go:51)   LEAQ  type:map[string]int(SB), AX
0x0056 (a.go:51)   MOVQ  main..autotmp_16+72(SP), BX
0x005b (a.go:51)   PCDATA   $1, $2
0x005b (a.go:51)   NOP
0x0060 (a.go:51)   CALL  runtime.mapaccess1_faststr(SB)
...

We discover the real structure hidden behind a map declaration in the file src/runtime/map_noswiss.go. Here's a copy of the code:

type hmap struct {
    count     int // # live cells == size of map.  Must be first (used by len() builtin)
    flags     uint8
    B         uint8  // log_2 of # of buckets (can hold up to loadFactor * 2^B items)
    noverflow uint16 // approximate number of overflow buckets; see incrnoverflow for details
    hash0     uint32 // hash seed

    buckets    unsafe.Pointer // array of 2^B Buckets. may be nil if count==0.
    oldbuckets unsafe.Pointer // previous bucket array of half the size, non-nil only when growing
    nevacuate  uintptr        // progress counter for evacuation (buckets less than this have been evacuated)

    extra *mapextra // optional fields
}

We understand from this structure and its comments that the hash won't really be trivial. We can note the B member whose comment suggests that the size of the table whose indexes will be the result of the hash function is variable in size. Each entry in this table is called a bucket. We note the buckets and oldbuckets members for which the comments suggest there's a migration system between two different table sizes.

Let's look at the lookup function source code:

We see at line 131 the condition if h.B == 0. This condition implies special treatment when B equals 0, the hash key calculation is not performed because there's only one bucket. The only available bucket is automatically selected, and it only remains to traverse it to find the requested value.

This is where we understand the behavior of the curve where we can clearly see that the map behaves the same way as a list for values below 9.

Beyond 8 values, it's necessary to have 2 buckets, so it's from 9 values onwards that the response time of the function becomes constant.

Let's look at the hash function. After some research, I find its implementations in assembly in the file src/runtime/asm_amd64.s, let's look at the amd64 assembly code, since that's the processor I'm using:

// func strhash(p unsafe.Pointer, h uintptr) uintptr
TEXT runtime·strhash<ABIInternal>(SB),NOSPLIT,$0-24
   // AX = ptr to string struct
   // BX = seed
   CMPB  runtime·useAeshash(SB), $0
   JEQ   noaes
   MOVQ  8(AX), CX   // length of string
   MOVQ  (AX), AX // string data
   JMP   aeshashbody<>(SB)
noaes:
   JMP   runtime·strhashFallback<ABIInternal>(SB)

// AX: data
// BX: hash seed
// CX: length
// At return: AX = return value
TEXT aeshashbody<>(SB),NOSPLIT,$0-0
   // Fill an SSE register with our seeds.
   MOVQ  BX, X0            // 64 bits of per-table hash seed
   PINSRW   $4, CX, X0        // 16 bits of length
   PSHUFHW $0, X0, X0         // repeat length 4 times total
   MOVO  X0, X1            // save unscrambled seed
   PXOR  runtime·aeskeysched(SB), X0   // xor in per-process seed
   AESENC   X0, X0            // scramble seed

   CMPQ  CX, $16
   JB aes0to15
   JE aes16
   CMPQ  CX, $32
   JBE   aes17to32
   CMPQ  CX, $64
   JBE   aes33to64
   CMPQ  CX, $128
   JBE   aes65to128
   JMP   aes129plus

aes0to15:
   TESTQ CX, CX
   JE aes0

   ADDQ  $16, AX
   TESTW $0xff0, AX
   JE endofpage

   // 16 bytes loaded at this address won't cross
   // a page boundary, so we can load it directly.
   MOVOU -16(AX), X1
   ADDQ  CX, CX
   MOVQ  $masks<>(SB), AX
   PAND  (AX)(CX*8), X1
final1:
   PXOR  X0, X1   // xor data with seed
   AESENC   X1, X1   // scramble combo 3 times
   AESENC   X1, X1
   AESENC   X1, X1
   MOVQ  X1, AX   // return X1
   RET

(...)

We can note in this code the use of two algorithms implemented in the strhashFallback and aeshashbody functions. aeshashbody uses the AESENC assembly instructions to calculate the hash of a string. AES is a cryptographic function, curious? This approach is very clever because it repurposes a cryptographic function to calculate a hash that will be fairly well distributed in few CPU cycles. Indeed, a good encryption algorithm must produce maximum entropy, so we'll find few repetitions and good distribution of the result with few collisions. Furthermore, and this is the most important part, it's an assembly instruction, so it's the CPU that will apply the AES algorithm, making it difficult to go faster.

I haven't found a reference in Intel documentation, but several sources on the Internet seem to agree that AESENC takes 4 cycles on recent architectures and 7 on older ones. So, in 4 CPU cycles we can hash 128 bits, or 16 bytes.

My CPU is an Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz. A cycle is a clock "tick". The clock is clocked at 2.3 GHz, which means 2.3 billion cycles per second. To calculate the duration of a cycle we must proceed as follows: 1s ÷ 2.30GHz = 0.4348ns. AESENC lasts 4 cycles, or 4 × 0.4348ns = 1.7392ns. For a complete lookup that includes a hash calculation, we have an average of 25ns. Considering that the AESENC instruction is far from being the only one used in this calculation, we're in the right order of magnitude. To be convinced, we calculate that in 25ns with a cycle duration of 0.4348ns, we can execute 57.5 cycles, or very roughly 57 basic assembly instructions.

The strhashFallback function is classic hashing code written in Go available here src/runtime/hash64.go. As its name indicates, it's used if the AESENC instruction is not available on the CPU.

Development notes

The importance of map initialization

After analyzing the functioning of maps and buckets, we understand the importance of specifying the quantity of entries we're going to put in a map. make(map[string]interface{}, 100), where 100 will allow initializing the map with the right number of buckets and thus avoid resizing.

Furthermore, initializing the map to a value close to the expected value makes access faster than searching in a list from the second value. I won't research the reason, but here's a lead: the hashing algorithm, followed by a comparison would be very slightly faster than two comparisons, so each bucket containing only one piece of data (for less than 8 values) there's only one comparison to make after calculating the hash.

We can also note the increasing access time while it's roughly constant in other implementations. This is probably due to buckets that aren't filled before the end of the bench since they're optimized for 40 values.

Fixed initialization graph

Limiting map access and choosing types

Even if accessing a map element is fast, it's infinitely longer than accessing a variable. To illustrate this, here are three methods of using a map. Consider that s is a structure definition that contains 3 members. The benchmark updates the structure carried by the map 1,000,000 times.

  • map[string]s 46.25 ns/op (-38%): requires a lookup to retrieve the value, update it, then a map update to assign it.
  • map[string]*s 50.81 ns/op (-52%): performs a lookup for each member update.
  • map[string]*s 33.35 ns/op: performs a lookup to retrieve the pointer
package main

import "fmt"
import "math/rand"
import "time"

type s struct {
    a int
    b int
    c int
}

func main() {
    var a map[string]s
    var b map[string]*s
    var c map[string]*s
    var i int
    var now time.Time
    var sa s
    var sb *s
    var keys []string
    var k string

    a = make(map[string]s, 100)
    b = make(map[string]*s, 100)
    c = make(map[string]*s, 100)
    keys = make([]string, 100)

    for i = 0; i < 100; i++ {
        k = fmt.Sprintf("%04x", i)
        a[k] = s{}
        b[k] = &s{}
        c[k] = &s{}
        keys[i] = k
    }

    // Method 1
    now = time.Now()
    for i = 0; i < 1000000; i++ {
        k = keys[rand.Intn(len(keys))]
        sa = a[k]
        sa.a++
        sa.b++
        sa.c++
        a[k] = sa
    }
    fmt.Printf("no-ptr: %.2f ns/op\n", float64(time.Since(now)) / 1000000)

    // Method 2
    now = time.Now()
    for i = 0; i < 1000000; i++ {
        k = keys[rand.Intn(len(keys))]
        c[k].a++
        c[k].b++
        c[k].c++
    }
    fmt.Printf("no-lookup: %.2f ns/op\n", float64(time.Since(now)) / 1000000)

    // Method 3
    now = time.Now()
    for i = 0; i < 1000000; i++ {
        k = keys[rand.Intn(len(keys))]
        sb = b[k]
        sb.a++
        sb.b++
        sb.c++
    }
    fmt.Printf("ptr: %.2f ns/op\n", float64(time.Since(now)) / 1000000)
}

We can note that:

  • Method 2's notation is more compact and elegant, it's also the slowest because it requires three map accesses.
  • Method 1 is a map on a structure. The structure cannot be manipulated in the map, it must be retrieved, modified and reassigned, so this method is slow because it requires two map accesses.
  • The last method retrieves a pointer and manipulates it. It only makes one map access, so it's the fastest.

To manipulate complex structures with maps, it's preferable to use pointers and make as few map accesses as possible.

Limitations

All this is only true for map[string]interface{}, other key types are not tested. However, we can assume from reading the code that 32 and 64-bit integers are well optimized.

We haven't studied map creation time, nor indexing times in cases of map size changes. These times can be non-negligible in the case of high-frequency map creation or frequent insertions.

Map usage

The in-depth analysis of map[string]interface{} in Go reveals several interesting points:

  1. Hash function optimization: Using a hashing algorithm based on the AESENC CPU instruction for character strings is particularly ingenious, as it takes advantage of the best possible from the CPU.

  2. Size adaptation: Go's map implementation adapts to data size. For small collections (less than 9 elements), the map behaves like a simple list, thus avoiding the hashing overhead for small data sets. Furthermore, maps remain memory-efficient by having an adaptable-size array.

  3. Constant performance: Beyond 8 elements, maps offer quasi-constant access time, making them particularly efficient for large data collections.

While maps are generally an excellent choice for associative arrays in Go, it's always important to consider the size and nature of the manipulated data to choose the most appropriate structure.

If the size of the data to be indexed is less than 8, it's preferable to list them rather than put them in a map because the time gain is better from 5% to 50%. In other cases, the map is more performant.

Percentages graph

Source files: map.tgz