🇫🇷 Version Française disponible ici
For Go developers, understanding the nuances of the language's built-in data structures is essential for writing efficient code. This article focuses on an analysis of maps in Go, a fundamental data structure that implements associative arrays.
This analysis covers two points:
The objective is to make informed choices when designing data structures, particularly for performance-sensitive applications.
Very roughly speaking, a processor can only perform a few binary and arithmetic operations, conditional jumps, and manipulate memory space. Among these operations, we can very easily find how to make a list (it's enough to put fixed-size elements one after another in memory space), but conversely, a key/value array doesn't seem to be something trivial for a processor.
Paradoxically, most high-level languages (mainly scripting languages) present key/value array structures in the native syntax of the language.
fruits = {
"apple": "red",
"banana": "yellow",
"kiwi": "green"
}
$fruits = array(
"apple" => "red",
"banana" => "yellow",
"kiwi" => "green"
);
let fruits = {
apple: "red",
banana: "yellow",
kiwi: "green"
};
fruits := map[string]string{
"apple": "red",
"banana": "yellow",
"kiwi": "green",
}
These associative arrays are necessarily implemented by a relatively complex algorithm and their use is not neutral: it will impact a program's performance.
An associative array is a data structure that allows storing key/value pairs and accessing them quickly. It uses a hash function to convert keys into array indices. The hash function is the heart of the associative array. It takes a key as input and returns an index in the array. The array size is known, so the result of the hash function will be a value between 0 and the array size - 1.
When inserting a key/value pair into the array, we calculate the index with the hash function, then place the pair in the location corresponding to this index. Sometimes a collision can occur, meaning two different keys for which the hash function returns the same index. We then use a list to put the values one after another.
So in terms of performance, if we ignore the complexity of the hash function, we will have:
The performance of an associative array based on a hash table will therefore depend on these parameters:
We therefore understand that what lies behind these seemingly harmless notations is far from being something simple.
Low-level languages such as C, C++, Java, Rust do not natively offer the possibility of creating an associative array. C requires implementing an array adapted to development or using external libraries. C++, Java and Rust offer implementations in their standard libraries.
Go is between these two types of languages. It's a compiled language at a fairly low level, but offers a native implementation of associative arrays through maps. I propose to study the impact of the hash key and determine the relevance of using a map.
In Go, maps are implemented with a hash table, so we have a complexity that is O(h) + O(1), where O(h) is the unknown complexity of the hashing algorithm and O(1) is the access complexity once the hash is known (ignoring the worst case). O(1) can be ignored and we therefore focus on O(h). Searching for an element among "n" elements in a list has a complexity of O(n).
Since O(h) is unknown, we might think that for small values of n, the complexity O(n) would give better response times than O(h).
This involves creating a list and a map of n elements, and measuring 10 million accesses with random keys. The measurement operation will be repeated for each value of n, where n is between 1 and 40.
The first test named Raw generates non-variable keys in string form representing hexadecimal, ranging from "0000"
to "0028"
. The test keys are pseudo-random keys ranging from "0000"
to "0028"
.
The second test named Long generates distinct keys ranging from "00000000"
to "00989680"
. The test keys are also pseudo-random keys ranging from "00000000"
to "00989680"
.
Raw and Long are very unrepresentative names, I had another idea in mind at the beginning of these tests.
The processor for this test is an Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
.
We can note the use of Go tags for selective compilation.
First test:
Second test:
We note in the result a behavior that doesn't resemble at all what is expected. We expect to have a constant response time for maps (here ~25ns/op), but below 9 elements this is not true. Maps have response times below ~25ns/op and proportional to the response times of a list traversal. We therefore assume that:
Looking at the Go source code (compiler, including runtime and standard lib), I find four map implementations, each presenting a map access function (lookup):
map_fast32_noswiss.go
: mapaccess1_fast32(..., key uint32)
map_fast64_noswiss.go
: mapaccess1_fast64(..., key uint64)
map_faststr_noswiss.go
: mapaccess1_faststr(..., key string)
map_noswiss.go
: mapaccess1(..., key unsafe.Pointer)
The files are noted noswiss
because Go implements an experimental version of maps that use an algorithm called Swiss Tables. We therefore find the
*_swiss.go
files, but only the generic one is implemented and you must indicate the use of this algorithm with the environment variable GOEXPERIMENT=swissmap
. We ignore these implementations.
Two of the implementations are optimized for keys in the form of 32 and 64-bit integers, one is optimized for keys in the form of character strings, and the last is generalist.
We will focus on the mapaccess1_faststr
implementation found in the file src/runtime/map_faststr_noswiss.go simply because the program decompilation shows that this function is used.
go build -gcflags -S a.go
...
0x004f (a.go:51) LEAQ type:map[string]int(SB), AX
0x0056 (a.go:51) MOVQ main..autotmp_16+72(SP), BX
0x005b (a.go:51) PCDATA $1, $2
0x005b (a.go:51) NOP
0x0060 (a.go:51) CALL runtime.mapaccess1_faststr(SB)
...
We discover the real structure hidden behind a map
declaration in the file src/runtime/map_noswiss.go. Here's a copy of the code:
type hmap struct {
count int // # live cells == size of map. Must be first (used by len() builtin)
flags uint8
B uint8 // log_2 of # of buckets (can hold up to loadFactor * 2^B items)
noverflow uint16 // approximate number of overflow buckets; see incrnoverflow for details
hash0 uint32 // hash seed
buckets unsafe.Pointer // array of 2^B Buckets. may be nil if count==0.
oldbuckets unsafe.Pointer // previous bucket array of half the size, non-nil only when growing
nevacuate uintptr // progress counter for evacuation (buckets less than this have been evacuated)
extra *mapextra // optional fields
}
We understand from this structure and its comments that the hash won't really be trivial. We can note the B
member whose comment suggests that the size of the table whose indexes will be the result of the hash function is variable in size. Each entry in this table is called a bucket. We note the buckets
and oldbuckets
members for which the comments suggest there's a migration system between two different table sizes.
Let's look at the lookup function source code:
We see at line 131 the condition
if h.B == 0
. This condition implies special treatment when B equals 0, the hash key calculation is not performed because there's only one bucket. The only available bucket is automatically selected, and it only remains to traverse it to find the requested value.
This is where we understand the behavior of the curve where we can clearly see that the map behaves the same way as a list for values below 9.
Beyond 8 values, it's necessary to have 2 buckets, so it's from 9 values onwards that the response time of the function becomes constant.
Let's look at the hash function. After some research, I find its implementations in assembly in the file src/runtime/asm_amd64.s, let's look at the amd64 assembly code, since that's the processor I'm using:
// func strhash(p unsafe.Pointer, h uintptr) uintptr
TEXT runtime·strhash<ABIInternal>(SB),NOSPLIT,$0-24
// AX = ptr to string struct
// BX = seed
CMPB runtime·useAeshash(SB), $0
JEQ noaes
MOVQ 8(AX), CX // length of string
MOVQ (AX), AX // string data
JMP aeshashbody<>(SB)
noaes:
JMP runtime·strhashFallback<ABIInternal>(SB)
// AX: data
// BX: hash seed
// CX: length
// At return: AX = return value
TEXT aeshashbody<>(SB),NOSPLIT,$0-0
// Fill an SSE register with our seeds.
MOVQ BX, X0 // 64 bits of per-table hash seed
PINSRW $4, CX, X0 // 16 bits of length
PSHUFHW $0, X0, X0 // repeat length 4 times total
MOVO X0, X1 // save unscrambled seed
PXOR runtime·aeskeysched(SB), X0 // xor in per-process seed
AESENC X0, X0 // scramble seed
CMPQ CX, $16
JB aes0to15
JE aes16
CMPQ CX, $32
JBE aes17to32
CMPQ CX, $64
JBE aes33to64
CMPQ CX, $128
JBE aes65to128
JMP aes129plus
aes0to15:
TESTQ CX, CX
JE aes0
ADDQ $16, AX
TESTW $0xff0, AX
JE endofpage
// 16 bytes loaded at this address won't cross
// a page boundary, so we can load it directly.
MOVOU -16(AX), X1
ADDQ CX, CX
MOVQ $masks<>(SB), AX
PAND (AX)(CX*8), X1
final1:
PXOR X0, X1 // xor data with seed
AESENC X1, X1 // scramble combo 3 times
AESENC X1, X1
AESENC X1, X1
MOVQ X1, AX // return X1
RET
(...)
We can note in this code the use of two algorithms implemented in the strhashFallback
and aeshashbody
functions. aeshashbody
uses the AESENC
assembly instructions to calculate the hash of a string. AES is a cryptographic function, curious? This approach is very clever because it repurposes a cryptographic function to calculate a hash that will be fairly well distributed in few CPU cycles. Indeed, a good encryption algorithm must produce maximum entropy, so we'll find few repetitions and good distribution of the result with few collisions. Furthermore, and this is the most important part, it's an assembly instruction, so it's the CPU that will apply the AES algorithm, making it difficult to go faster.
I haven't found a reference in Intel documentation, but several sources on the Internet seem to agree that AESENC takes 4 cycles on recent architectures and 7 on older ones. So, in 4 CPU cycles we can hash 128 bits, or 16 bytes.
My CPU is an Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
. A cycle is a clock "tick". The clock is clocked at 2.3 GHz, which means 2.3 billion cycles per second. To calculate the duration of a cycle we must proceed as follows: 1s ÷ 2.30GHz = 0.4348ns. AESENC lasts 4 cycles, or 4 × 0.4348ns = 1.7392ns. For a complete lookup that includes a hash calculation, we have an average of 25ns. Considering that the AESENC instruction is far from being the only one used in this calculation, we're in the right order of magnitude. To be convinced, we calculate that in 25ns with a cycle duration of 0.4348ns, we can execute 57.5 cycles, or very roughly 57 basic assembly instructions.
The strhashFallback
function is classic hashing code written in Go available here src/runtime/hash64.go. As its name indicates, it's used if the AESENC instruction is not available on the CPU.
After analyzing the functioning of maps and buckets, we understand the importance of specifying the quantity of entries we're going to put in a map. make(map[string]interface{}, 100)
, where 100 will allow initializing the map with the right number of buckets and thus avoid resizing.
Furthermore, initializing the map to a value close to the expected value makes access faster than searching in a list from the second value. I won't research the reason, but here's a lead: the hashing algorithm, followed by a comparison would be very slightly faster than two comparisons, so each bucket containing only one piece of data (for less than 8 values) there's only one comparison to make after calculating the hash.
We can also note the increasing access time while it's roughly constant in other implementations. This is probably due to buckets that aren't filled before the end of the bench since they're optimized for 40 values.
Even if accessing a map element is fast, it's infinitely longer than accessing a variable. To illustrate this, here are three methods of using a map. Consider that s
is a structure definition that contains 3 members. The benchmark updates the structure carried by the map 1,000,000 times.
map[string]s
46.25 ns/op (-38%): requires a lookup to retrieve the value, update it, then a map update to assign it.map[string]*s
50.81 ns/op (-52%): performs a lookup for each member update.map[string]*s
33.35 ns/op: performs a lookup to retrieve the pointerpackage main
import "fmt"
import "math/rand"
import "time"
type s struct {
a int
b int
c int
}
func main() {
var a map[string]s
var b map[string]*s
var c map[string]*s
var i int
var now time.Time
var sa s
var sb *s
var keys []string
var k string
a = make(map[string]s, 100)
b = make(map[string]*s, 100)
c = make(map[string]*s, 100)
keys = make([]string, 100)
for i = 0; i < 100; i++ {
k = fmt.Sprintf("%04x", i)
a[k] = s{}
b[k] = &s{}
c[k] = &s{}
keys[i] = k
}
// Method 1
now = time.Now()
for i = 0; i < 1000000; i++ {
k = keys[rand.Intn(len(keys))]
sa = a[k]
sa.a++
sa.b++
sa.c++
a[k] = sa
}
fmt.Printf("no-ptr: %.2f ns/op\n", float64(time.Since(now)) / 1000000)
// Method 2
now = time.Now()
for i = 0; i < 1000000; i++ {
k = keys[rand.Intn(len(keys))]
c[k].a++
c[k].b++
c[k].c++
}
fmt.Printf("no-lookup: %.2f ns/op\n", float64(time.Since(now)) / 1000000)
// Method 3
now = time.Now()
for i = 0; i < 1000000; i++ {
k = keys[rand.Intn(len(keys))]
sb = b[k]
sb.a++
sb.b++
sb.c++
}
fmt.Printf("ptr: %.2f ns/op\n", float64(time.Since(now)) / 1000000)
}
We can note that:
To manipulate complex structures with maps, it's preferable to use pointers and make as few map accesses as possible.
All this is only true for map[string]interface{}
, other key types are not tested. However, we can assume from reading the code that 32 and 64-bit integers are well optimized.
We haven't studied map creation time, nor indexing times in cases of map size changes. These times can be non-negligible in the case of high-frequency map creation or frequent insertions.
The in-depth analysis of map[string]interface{}
in Go reveals several interesting points:
Hash function optimization: Using a hashing algorithm based on the AESENC
CPU instruction for character strings is particularly ingenious, as it takes advantage of the best possible from the CPU.
Size adaptation: Go's map implementation adapts to data size. For small collections (less than 9 elements), the map behaves like a simple list, thus avoiding the hashing overhead for small data sets. Furthermore, maps remain memory-efficient by having an adaptable-size array.
Constant performance: Beyond 8 elements, maps offer quasi-constant access time, making them particularly efficient for large data collections.
While maps are generally an excellent choice for associative arrays in Go, it's always important to consider the size and nature of the manipulated data to choose the most appropriate structure.
If the size of the data to be indexed is less than 8, it's preferable to list them rather than put them in a map because the time gain is better from 5% to 50%. In other cases, the map is more performant.
Source files: map.tgz