# Benchmarking with sysbench

A getting started guide to benchmarking with sysbench

## Why, When, How?

Benchmark shows the difference between should and does. Coupled with monitoring it’s a great tool to identify bottlenecks.

As to when: whenever we can, as long as the system is production. A good benchmark will stress the server so we don’t want users affected. Right after provisioning a server, during a maintenance window, before and after any major changes, these are all good opportunities to take advantage of a few precious minutes of calm to run some tests.

We should try to mimic as best as possible the behaviour of the software we have. Is it single or multithreaded? Is it on more CPU bound or disk IO bound. Does it access data randomly or sequentially? Does it manage data on fixed-size blocks?

## Introducing sysbench

Sysbench is a popular benchmark tool that is available for linux, mac and can even run on windows (with it’s subsystem for linux). It ships with tests for the OS and databases. It is extensible and scriptable, so writing our own tests is possible.

If you are running linux, it’s most likely available in your distribution repository. Otherwise head to sysbench’s github for instructions.

## Getting started

The general syntax is:

sysbench <testname> <action> [test-specific options]

Sysbench has quite a few options, but help a few keystrokes away. To get to the general help:

sysbench --help

And for the specific test use:

sysbench <testname> help

Some tests have a prepare and cleanup action, some also have an optional warmup action, others can be run directly.

## The test loop

All tests have a common structure, the requested number of worker threads are created, each one will run its own test loop, each iteration of the loop is called an event and can have any number of operations. Sysbench keeps counters on all events across all threads.

When it’s done, we get the statistics for all events. Under latency we can see the average event duration. These values are common to all tests.

...
General statistics:
total time:                          10.0014s
total number of events:              32479

Latency (ms):
min:                                    1.11
avg:                                    2.46
max:                                   41.95
95th percentile:                       11.24
sum:                                79934.27

events (avg/stddev):           4059.8750/59.44
execution time (avg/stddev):   9.9918/0.01

On multithreaded tests we want that events are distributed fairly among all threads, i.e. all threads should execute approximately the same number of events. This is shown under fairness, the stddev (standard deviation) should be small in relation to the average.

Most of the benchmarks will run for a specified time or until a certain condition is met. The default time limit is 10 seconds, but we can change that using --time. Alternatively we can set a limit number of --events.

In this first test we’ll torture the CPUs by computing prime numbers. Each single event will check prime numbers up to the provided --cpu-max-prime. The higher this value is, the longer it takes an event to end.

I usually like to start with single threadetest. I try to find a --cpu-max-prime that will take at least a few seconds to complete.

# sysbench cpu test, single event
-> sysbench cpu run --threads=1 --events=1 --cpu-max-prime=8000000
sysbench 1.0.15 (using system LuaJIT 2.0.5)

Running the test with following options:
Initializing random number generator from current time

Prime numbers limit: 8000000

CPU speed:
events per second:     0.09

General statistics:
total time:                          11.2315s
total number of events:              1

Latency (ms):
min:                                11231.45
avg:                                11231.45
max:                                11231.45
95th percentile:                    11317.84
sum:                                11231.45

events (avg/stddev):           1.0000/0.00
execution time (avg/stddev):   11.2314/0.00

With a suitable max prime number found, the next step is to run a longer test.

# sysbench cpu, single thread
-> sysbench cpu run --threads=1 --time=60 --cpu-max-prime=8000000
sysbench 1.0.15 (using system LuaJIT 2.0.5)

Running the test with following options:
Initializing random number generator from current time

Prime numbers limit: 8000000

CPU speed:
events per second:     0.09

General statistics:
total time:                          67.5691s
total number of events:              6

Latency (ms):
min:                                11217.78
avg:                                11261.48
max:                                11302.15
95th percentile:                    11317.84
sum:                                67568.87

events (avg/stddev):           6.0000/0.00
execution time (avg/stddev):   67.5689/0.00

On average, it took 11.261 seconds to calculate primes up to our limit. A lower average is better, since it means the CPU takes less time to reach the same point. For the same reason, a higher value in events per second is better.

If we are going to run multithreaded applications we need to run another test with an appropiate number --threads. For example in a system with 4 cores with hyperthreading this is a good starting point.

# cpu test, 8 threads
-> sysbench cpu run --threads=8 --time=60 --cpu-max-prime=8000000
sysbench 1.0.15 (using system LuaJIT 2.0.5)

Running the test with following options:
Initializing random number generator from current time

Prime numbers limit: 8000000

CPU speed:
events per second:     0.29

General statistics:
total time:                          82.1154s
total number of events:              24

Latency (ms):
min:                                26514.20
avg:                                27212.58
max:                                28103.35
95th percentile:                    27846.48
sum:                               653101.88

events (avg/stddev):           3.0000/0.00
execution time (avg/stddev):   81.6377/0.36

For memory we can test either write or read speeds, the mode is selected with --memory-open. We can choose between sequential (seq) and random (rnd) access modes with --memory-access-mode.

This benchmark has a special stop condition: --memory-total-size. Bear in mind, this isn’t the total size allocated in RAM.

# memory, no time limit, stop on 100GB written
-> sysbench memory run --memory-total-size=100G --memory-open=write --memory-access-mode=rnd --time=0
WARNING: Both event and time limits are disabled, running an endless test
sysbench 1.0.15 (using system LuaJIT 2.0.5)

Running the test with following options:
Initializing random number generator from current time

Running memory speed test with the following options:
block size: 1KiB
total size: 102400MiB
operation: write
scope: global

Total operations: 104857600 (5010197.95 per second)

102400.00 MiB transferred (4892.77 MiB/sec)

General statistics:
total time:                          20.9272s
total number of events:              104857600

Latency (ms):
min:                                    0.00
avg:                                    0.00
max:                                    0.38
95th percentile:                        0.00
sum:                                10816.80

events (avg/stddev):           104857600.0000/0.00
execution time (avg/stddev):   10.8168/0.00

The latency is measured in milliseconds, it’s just to big of a unit for memory speeds so it’s not very useful here. But we can think of the MiB per seconds as the average speed.

If we have a service that work with fixed-size blocks, for example if plan to have a database, we should set --memory-block-size to match it.

# some memory tests for MySQL (page size is 16K)
sysbench memory run --threads=8 --memory-block-size=16K --memory-total-size=100G --memory-oper=write
sysbench memory run --threads=8 --memory-block-size=16K --memory-total-size=100G --memory-oper=read

## Testing your disk I/O speed

Before starting, we need to prepare a file or set of files to work on. We set the total space to allocate with --file-total-size (which must be at least 2GB) and how many files there will be in the set, the space will be split into even sized files.

We probably want to avoid caching the files in memory because we wouldn’t be measuring the true disk spee, so it might me a good idea to include --file-extra-flags=direct to bypass it.

It’s generally recommended, whenever possible, that allocated space is larger than the installed RAM, just in case caching is taking place in spite of our best efforts.

-> sysbench fileio prepare --file-num=10 --file-total-size=10G --file-extra-flags=direct

sysbench 1.0.15 (using system LuaJIT 2.0.5)

10 files, 1048576Kb each, 10240Mb total
Creating files for the test...
Extra file open flags: (none)
Creating file test_file.0
Creating file test_file.1
Creating file test_file.2
Creating file test_file.3
Creating file test_file.4
Creating file test_file.5
Creating file test_file.6
Creating file test_file.7
Creating file test_file.8
Creating file test_file.9
10737418240 bytes written in 45.15 seconds (226.78 MiB/sec).

prepare can reutilize previous test files, but it only can grow them, so we can only reuse test files if they are bigger or equal than before. Otherwise we must cleanup and prepare again.

Now we need to choose a --file-test-mode:

• seqwr: sequential write
• rndwr: random write
# file read-write test
-> sysbench fileio run --file-num=10 --file-total-size=10G --file-test-mode=rndrw --file-extra-flags=direct --time=60

sysbench 1.0.15 (using system LuaJIT 2.0.5)

Running the test with following options:
Initializing random number generator from current time

Extra file open flags: directio
10 files, 1GiB each
10GiB total file size
Block size 16KiB
Number of IO requests: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test

File operations:
writes/s:                     1592.38
fsyncs/s:                     398.14

Throughput:
written, MiB/s:               24.88

General statistics:
total time:                          60.0027s
total number of events:              262756

Latency (ms):
min:                                    0.06
avg:                                    0.23
max:                                   77.63
95th percentile:                        0.39
sum:                                59796.75

events (avg/stddev):           262756.0000/0.00
execution time (avg/stddev):   59.7967/0.00

As always, it’s a good idea to try to match as best as possible the disk IO activity for our application. We have some flexibility here with additional options:

• --file-num number of files [128]
• --file-block-size block size [16384]
• --file-total-size total size [2G]
• --file-test-mode test mode: seqwr, seqrewr, seqrd, rndrd, rndwr, rndrw
• --file-io-mode file operations mode: sync,async,mmap [sync]
• --file-async-backlog number of asynchronous requests per thread [128]
• --file-extra-flags list of additional flags to use to open files: sync,dsync,direct
• --file-fsync-freq fsync() every N requests [100]
• --file-fsync-all fsync() after each write [off]
• --file-fsync-end fsync() at the end of test [on]
• --file-fsync-mode method for sync: fsync, fdatasync [fsync]
• --file-merged-requests max I/O request to merge [0]
• --file-rw-ratio reads/writes ratio for combined test [1.5]

As an example, for a system destined to run a MySQL server we could try something like this.

# a test simulating MySQL file activity
sysbench fileio run --threads=8 --file-extra-flags=direct --file-test-mode=rndrw --file-num=10 --file-block-size=1M --file-block-size=16K

After running our tests, we need only to remove the work files with cleanup

-> sysbench fileio cleanup --file-num=10

sysbench 1.0.15 (using system LuaJIT 2.0.5)

Removing test files...

## Testing mutexes

This benchmark is intended to test the speed for mutex locks (mutex stands for “mutual exclusion”). Which play an important role in multithreading. It simulates high concurrency, with threads acquiring and releasing locks all the time.

Sysbench creates an initial pool of --mutex-num mutexes, then it starts the threads, which iterate over an empty loop, every --mutex-loops the thread will grab one random mutex from the pool, modify a global variable, and release it. Each event will acquire and release the mutexes until the --mutex-locks value is reached.

Here --time or --events are completely ignored. Sysbench always runs one event per thread.

-> sysbench mutex run --threads=10
sysbench 1.0.15 (using system LuaJIT 2.0.5)

Running the test with following options:
Initializing random number generator from current time

General statistics:
total time:                          0.8845s
total number of events:              10

Latency (ms):
min:                                  802.84
avg:                                  855.55
max:                                  873.81
95th percentile:                      877.61
sum:                                 8555.51

events (avg/stddev):           1.0000/0.00
execution time (avg/stddev):   0.8556/0.02

Here we have a test for the CPU scheduler. It can simulate high concurrency conditions, where multiple threads compete for mutexes.

Each thread will take --thread-locks mutexes and yield the CPU. This causes the thread to stop running and is placed on the run queue by the scheduler. Once the thread is again scheduled to run, the mutexes are released.

A single event will perform these actions --thread-yields times, so the higher this number, the higher the concurrency placed on each mutex, and the lower the number of events per second.

-> sysbench threads run --threads=4 --thread-yields=1000 --thread-locks=10
sysbench 1.0.15 (using system LuaJIT 2.0.5)

Running the test with following options:
Initializing random number generator from current time

General statistics:
total time:                          10.0005s
total number of events:              50754

Latency (ms):
min:                                    0.50
avg:                                    0.79
max:                                   14.37
95th percentile:                        0.81
sum:                                39980.48

events (avg/stddev):           12688.5000/54.38
execution time (avg/stddev):   9.9951/0.00

## What’s next?

Whew, that was a longer post than I intended and to think I only covered a part of it all.

There is still more to say about sysbench, check the second part where I deal with databases, so the fun is just starting.

So long.

Tomas