How slow is the tracing interpreter of PyPy's meta-tracing JIT?

I wanted to investigate the warmup behavior of the PyPy interpreter, so I wrote a somewhat arbitrary microbenchmark:

all_results = set()
num = int(sys.argv[1])

class A(object):
    pass

def main():
    res = 0
    for i in range(num):
        a = A()
        a.x = i
        d = {"a": a.x}
        l = [0, 1, d["a"]]
        res += l[-1]
    all_results.add(res)

This function is the ideal case for PyPy's JIT compiler, as it has a tight loop with many object allocations that all have predictable lifetimes. There is no control flow inside the loop. The JIT compiler can optimize this loop body extremely well, to just an integer addition.

However, I wanted to investigate the warmup behavior of the JIT compiler. PyPy's tracing JIT will trace this loop at its 1041th iteration. Tracing a loop is rather expensive in PyPy, due its meta-tracing architecture. When the loop is traced, there is actually a stack of two interpreters: the meta-tracing interpreter will execute the Python interpreter, while the latter executes one iteration of the Python loop. The meta-tracing interpreter executes the loop, but also keeps a log of all the operations that were executed by PyPy's Python interpreter. This log – the trace – is then optimized and compiled to machine code. From then on, that chunk of machine code will be executed instead of interpreting the loop body.

What really interests me here is just how costly the double-interpretation overhead of the meta-tracing interpreter is. I had no intuition about the slowdown factor of it, is it 10x, 100x, 1000x, 10,000x, 100,000? My goal is to understand the order of magnitude of that slowdown factor.

All the numbers in this blog post are run on PyPy 2.7.18 (because I had a nicely instrumented version of PyPy2 around), I expect the results to be similar (ie the same order of magnitude at least) on PyPy 3.11. Timing results are from my AMD Ryzen 7 PRO 7840U, running Ubuntu Linux 24.04.2.

Estimating the slowdown factor of the meta-tracing interpreter

To estimate the slowdown factor we can run this benchmark with num set to exactly 1041. Since the JIT will trace the 1041th iteration, this is the maximally bad case for the JIT. The JIT will trace the loop, and the meta-tracing interpreter will execute the Python interpreter for one iteration of the loop, which is exactly what we want to measure. However, since we don't execute another iteration, we never execute the compiled machine code, so we can measure the overhead of the meta-tracing interpreter. If we set the environment variable PYPYLOG=jit-summary:outputfilename, we can see how much time the meta-tracing interpreter spent tracing the loops.

To make the numbers slightly larger, I actually add and run 100 copies of the above main function, to reduce the noise of the measurement.

Click to expand!
from __future__ import print_function
import sys

all_funcs = []
all_results = set()
s = """
class A%s(object):
    pass

def main%s():
    res = 0
    for i in range(num):
        a = A%s()
        a.x = i
        d = {"a": a.x}
        l = [0, 1, d["a"]]
        res += l[-1]
    all_results.add(res)
all_funcs.append(main%s)
"""
fullstr = "\n".join(s % (i, i, i, i) for i in range(100))
exec(fullstr)

import time

num = int(sys.argv[1])

t1 = time.time()
for fn in all_funcs:
    fn()
t2 = time.time()
print(num, t2 - t1)

Now we can run this benchmark 100 times with bash to get a good average across many process execution:

for i in {0..100..1};
    do PYPYLOG=jit-summary:out%d pypy -S x.py 1041 >> times;
done

Afterwards, we can look at the out%d files to see how much time the meta-tracing interpreter spent tracing the loops (%d is expanded to the process id by the PyPy logging system). The content of these files will look roughly like this:

[c8ca76f3db881] {jit-summary
Tracing:        100     0.033022
Backend:        100     0.009439
TOTAL:                  0.072409
ops:                    66200
heapcached ops:         20200
recorded ops:           11200
  calls:                2000
guards:                 2200
opt ops:                9700
opt guards:             1800
opt guards shared:      1300
forcings:               0
abort: trace too long:  0
abort: compiling:       0
abort: vable escape:    0
abort: bad loop:        0
abort: force quasi-immut:       0
abort: segmenting trace:        0
virtualizables forced:  0
nvirtuals:              2500
nvholes:                0
nvreused:               1100
vecopt tried:           0
vecopt success:         0
Total # of loops:       100
Total # of bridges:     0
Freed # of loops:       0
Freed # of bridges:     0
[c8ca76f3e9618] jit-summary}

Let's extract the tracing times with grep:

grep Tracing out* | grep -o "0[.].*" > tracingtimes

Now we can compute the slowdown factor like this:

>>> with open('tracingtimes') as f:
...     f = [float(l) for l in f]
...     tracingtime = sum(f) / len(f)
...     
>>> with open('times') as f:
...     f = [float(l.split()[1]) for l in f]
...     exectime = sum(f) / len(f)
...     
>>> regulartime = (exectime - tracingtime) / 1040
>>> print(regulartime)
3.550879183375572e-05
>>> print(tracingtime)
0.03153696039603962
>>> slowdown = tracingtime / regulartime
>>> print(slowdown)
888.1451259645404

(random side-note: I want to complain about the fact that the pygments lexer does not support the >>>> syntax of PyPy's interactive interpreter.)

This means that the meta-tracing interpreter is about ~900x slower than the regular interpreter for this benchmark. (Note that the tracing time alone is not the full warmup time of the JIT, since it doesn't include the time spent optimizing the trace and producing machine code.)

How many instructions does the meta-tracing interpreter execute?

Let's understand the work the Python interpreter does in one iteration of the loop. We can disassemble one of the main function to see what it does:

>>> import dis
>>> dis.dis(main0)
  6           0 LOAD_CONST               1 (0)
              3 STORE_FAST               0 (res)

  7           6 SETUP_LOOP              87 (to 96)
              9 LOAD_GLOBAL              0 (range)
             12 LOAD_GLOBAL              1 (num)
             15 CALL_FUNCTION            1
             18 GET_ITER            
        >>   19 FOR_ITER                73 (to 95)
             22 STORE_FAST               1 (i)

  8          25 LOAD_GLOBAL              2 (A0)
             28 CALL_FUNCTION            0
             31 STORE_FAST               2 (a)

  9          34 LOAD_FAST                1 (i)
             37 LOAD_FAST                2 (a)
             40 STORE_ATTR               3 (x)

 10          43 BUILD_MAP                0
             46 LOAD_FAST                2 (a)
             49 LOAD_ATTR                3 (x)
             52 LOAD_CONST               2 ('a')
             55 STORE_MAP           
             56 STORE_FAST               3 (d)

 11          59 LOAD_CONST               1 (0)
             62 LOAD_CONST               3 (1)
             65 LOAD_FAST                3 (d)
             68 LOAD_CONST               2 ('a')
             71 BINARY_SUBSCR       
             72 BUILD_LIST               3
             75 STORE_FAST               4 (l)

 12          78 LOAD_FAST                0 (res)
             81 LOAD_FAST                4 (l)
             84 LOAD_CONST               4 (-1)
             87 BINARY_SUBSCR       
             88 INPLACE_ADD         
             89 STORE_FAST               0 (res)
             92 JUMP_ABSOLUTE           19
        >>   95 POP_BLOCK           

 13     >>   96 LOAD_GLOBAL              4 (all_results)
             99 LOOKUP_METHOD            5 (add)
            102 LOAD_FAST                0 (res)
            105 CALL_METHOD              1
            108 POP_TOP             
            109 LOAD_CONST               0 (None)
            112 RETURN_VALUE  

The loop body starts at bytecode offset 19 and ends at bytecode offset 95. The loop body consists of 25 Python bytecode instructions.

When the JIT compiler traces this loop, it will execute the Python interpreter for one iteration of the loop on top of the meta-tracing interpreter. I built a special version of PyPy that counts the number of opcodes executed by the meta-tracing interpreter while tracing. Here are the results:

function name in Python interpreter number of operations executed number of times called number of operations traced
dispatch_bytecode 864 28 0
PyFrame.dispatch 708 1 0
W_TypeObject.lookup_where_with_method_cache 142 6 0
handle_bytecode__AccessDirect_None 112 28 0
list_BINARY_SUBSCR__AccessDirect_None 79 2 0
W_TypeObject.descr_call 77 1 3
_init_from_list_w_helper 66 1 12
W_TypeObject.issubtype 61 2 0
popvalues__AccessDirect_None 54 1 3
getattr 45 1 0
setattr 43 1 0
Function.call_obj_args 42 3 1
int_INPLACE_ADD 41 1 4
jump_absolute 37 1 2
BytesDictStrategy.setitem 35 1 0
STORE_MAP 35 1 0
_reorder_and_add 34 1 0
call_args 34 1 0
ll_dict_lookup 33 2 5
setitem 32 2 0
... ... ... ...
total 3675 173 103

The names of these functions aren't too important, what matters is that the meta-tracing interpreter executes 3675 operations to trace one iteration of the loop in main, which is a lot. It also traces 103 operations, most of which are then removed again by the trace optimizer, leaving a trace with just 22 operations:

label(p0, p1, p6, p7, i45, i32, p20, p31, p35, i50, i36)
debug_merge_point(0, 0, '<code object main0. file '<string>'. line 5> #19 FOR_ITER')
i52 = int_lt(i50, 0)
guard_false(i52)
i53 = int_ge(i50, i36)
guard_false(i53)
i55 = int_add(i50, 1)
debug_merge_point(0, 0, '<code object main0. file '<string>'. line 5> #25 LOAD_GLOBAL')
# update iterater object
setfield_gc(p20, i55, descr=<FieldS pypy.objspace.std.iterobject.W_AbstractSeqIterObject.inst_index 8>)
guard_not_invalidated()
debug_merge_point(0, 0, '<code object main0. file '<string>'. line 5> #88 INPLACE_ADD')
# perform the addition, checking for machine integer overflow
i56 = int_add_ovf(i45, i50)
guard_no_overflow(descr=<Guard0x729830a52170>) [p0, p6, p7, p20, p1, i50, i32, i45]
debug_merge_point(0, 0, '<code object main0. file '<string>'. line 5> #92 JUMP_ABSOLUTE')
# check whether ctrl-c was pressed
i58 = getfield_raw_i(125998063394752, descr=<FieldS pypysig_long_struct.c_value 0>)
i60 = int_lt(i58, 0)
guard_false(i60)
debug_merge_point(0, 0, '<code object main0. file '<string>'. line 5> #19 FOR_ITER')
jump(p0, p1, p6, p7, i56, i50, p20, p31, p35, i55, i36)

Note how in particular, no allocations are left in this trace. The optimize recognized that neither the A instance, the dictionary, nor the list need to be allocated.

How long does it take for the JIT compilation time to be amortized?

To find out after how many iterations the JIT compilation time is amortized, we can run the benchmark with varying num values and measure the execution time, both with and without the JIT. We do this by starting many processes using more bash for loops. We can then plot the execution time against the number of iterations to see when the time with the JIT becomes lower than the time without the JIT.

for i in {0..20000..1}; do pypy -S x.py "$i" >> datajit ; done; for i in {0..20000..1}; do pypy --jit off -S x.py "$i" >> datanojit ; done

And we can also run the same script on CPython, while we are at it:

for i in {1..10000..1}; do python3.13 -S x.py "$i" >> datacpy ; done;

(this is a bit apples against oranges because I am comparing PyPy2 against CPython 3.13 but oh well.)

Both will run for a while and then we can plot the data.

Gory matplotlib details hidden
# Co-created by Copilot
import matplotlib.pyplot as plt
import numpy as np

THRESHOLD = 1041  # JIT threshold

def plot_trace_costs(maximum, save_as, baseline_filename='datanojit', baseline_legend='PyPy No JIT'):
    with open('datajit', 'r') as f:
        datajit = [tuple(map(float, line.strip().split())) for line in f.readlines()]
    with open(baseline_filename, 'r') as f:
        datanojit = [tuple(map(float, line.strip().split())) for line in f.readlines()]
    # scale the y values by 1000
    datajit = [(x, y * 1000) for x, y in datajit]
    datanojit = [(x, y * 1000) for x, y in datanojit]
    origdatajit = datajit.copy()
    # sort the data by x value
    datajit.sort(key=lambda x: x[0])
    datanojit.sort(key=lambda x: x[0])
    # filter out lines with x value greater than maximum
    datajit = [x for x in datajit if x[0] <= maximum]
    datanojit = [x for x in datanojit if x[0] <= maximum]
    # create new figure
    plt.figure(figsize=(10, 6))
    plt.clf()

    # unpack data
    jit_x, jit_y = zip(*datajit) if datajit else ([], [])
    nojit_x, nojit_y = zip(*datanojit) if datanojit else ([], [])

    # plot the data points, using a scatter plot. use circle size of 2
    plt.scatter(jit_x, jit_y, label='PyPy w/ JIT', alpha=0.3, s=1)
    plt.scatter(nojit_x, nojit_y, label=baseline_legend, alpha=0.3, s=1)
    # add thin vertical line at x=THRESHOLD
    plt.axvline(x=THRESHOLD, color='gray', linestyle='--', linewidth=1)
    # add text to the left of the line, near the top saying "JIT threshold"
    all_y = list(jit_y) + list(nojit_y)
    if all_y:
        plt.text(THRESHOLD, max(all_y) * 0.9, 'JIT threshold', color='gray', fontsize=10, ha='left')

    # fit a line to the JIT data, starting from x=THRESHOLD
    jit_x_fit = [x for x, _ in origdatajit if x >= THRESHOLD]
    jit_y_fit = [y for x, y in origdatajit if x >= THRESHOLD]
    assert len(jit_x_fit) == len(jit_y_fit), "JIT x and y data lengths do not match"
    assert jit_x_fit and jit_y_fit
    jit_fit = np.polyfit(jit_x_fit, jit_y_fit, 1)
    jit_fit_line = np.polyval(jit_fit, jit_x)
    plt.plot(jit_x, jit_fit_line, color='black', linestyle='--', linewidth=1)
    assert nojit_x and nojit_y
    nojit_x_fit = [x for x in nojit_x if x >= THRESHOLD]
    nojit_y_fit = [y for x, y in zip(nojit_x, nojit_y) if x >= THRESHOLD]
    assert nojit_x_fit and nojit_y_fit
    nojit_fit = np.polyfit(nojit_x_fit, nojit_y_fit, 1)
    nojit_fit_line = np.polyval(nojit_fit, nojit_x)
    plt.plot(nojit_x, nojit_fit_line, color='black', linestyle='--', linewidth=1)

    # compute intersection point of the two lines
    intersection_x = (nojit_fit[1] - jit_fit[1]) / (jit_fit[0] - nojit_fit[0])
    intersection_y = jit_fit[0] * intersection_x + jit_fit[1]
    # also compute the speedup of the JIT and No JIT fits by dividing the slopes
    jit_speedup = 1 / (jit_fit[0] / nojit_fit[0])

    # add text box to the upper right corner with the fit parameters
    fit_text = f'PyPy w/ JIT fit: y = {jit_fit[0]:.6f}x + {jit_fit[1]:.6f}\n{baseline_legend} fit: y = {nojit_fit[0]:.6f}x + {nojit_fit[1]:.6f}'
    speedup_text = f'Speedup: {jit_speedup:.2f}x\nIntersection: ({intersection_x:.2f}, {intersection_y:.2f})'
    fit_text += "\n" + speedup_text
    plt.text(0.95, 0.95, fit_text, transform=plt.gca().transAxes, fontsize=10,
             verticalalignment='top', horizontalalignment='right',
             bbox=dict(facecolor='white', alpha=0.5, edgecolor='none'))
    # compute the y-values of both fits at the JIT threshold
    jit_y_at_threshold = jit_fit[0] * THRESHOLD + jit_fit[1]
    nojit_y_at_threshold = nojit_fit[0] * THRESHOLD + nojit_fit[1]
    diff = jit_y_at_threshold - nojit_y_at_threshold
    diff_text = f'Difference at threshold: {diff:.2f} ms'
    print(diff_text)
    plt.xlabel('Iterations')
    plt.ylabel('time taken (microseconds)')
    plt.grid()
    plt.title(f'Warmup behaviour: PyPy w/ JIT vs {baseline_legend}')
    # legend on the top left corner
    plt.legend(loc='upper left')
    plt.savefig(save_as)

if __name__ == '__main__':
    plot_trace_costs(maximum=20000, save_as='trace_costs_start.svg')
    plot_trace_costs(maximum=2000, save_as='trace_costs_very_start.svg')
    plot_trace_costs(maximum=10000, save_as='trace_costs_cpy.svg', baseline_filename='datacpy', baseline_legend='CPython 3.13')

This is what the plot looks like: Trace Costs: PyPy w/ JIT vs PyPy No JIT

The plot shows that the JIT compilation time is amortized after about 2600 iterations. The JIT compilation time is about 39ms (which consists of the meta-tracing time plus code generation time). If the program runs for a very long time, the speedup of the JIT will approach >200x (but the example was picked to make this easy for the JIT, so it's not indicative of the performance of realistic Python code).

We can zoom in the at the first 2000 data points:

Trace Costs: PyPy w/ JIT vs PyPy No JIT, first 2000 points

And here's the comparison with CPython 3.13:

Trace Costs: PyPy w/ JIT vs CPython 3.13

Compared to CPython the speedup is ~160x and PyPy catches up with CPython only at around 4800 iterations. I don't know why CPython's performance is so much more noisy across the 10,000 iterations.

Conclusion

After these quick measurements we conclude that the meta-tracing interpreter is three orders of magnitude slower than the regular interpreter. We (Yusuke Izawa and I) plan to improve the performance of the meta-tracing interpreter in the future, and I look forward to seeing how the numbers change then.

Acknowledgements

Thanks to Yusuke Izawa for his work on instrumenting the meta-tracing interpreter, and for our previous collaboration.

A Hint for Language-Specific Runtime Function Optimization in RPython's Meta-JIT

Meta Note: This post was in my draft folder for the PyPy blog for three years, and various people gave the feedback "wtf are you even talking about". Since I haven't worked on it since then I clearly am not going to fix anything about it, so I'll just post it almost unchanged to my personal blog now.

RPython's meta-JIT cannot reason about the properties of many RPython implementation-level functions, because those are often specific to the different interpreters written in RPython. Expressing language-specific optimizations on these functions would allow the JIT to more effectively reason about language-specific data structures and potentially remove costly calls into the runtime support code. In this blog post I will describe a new hint that makes it possible to express invariants about the functions of the data structures used in a specific language interpreter. These hints make it possible for the meta-JIT optimizer to do more language-specific optimizations across function calls without having to change the meta-JIT itself.

Introduction and Background

RPython has a meta-JIT, that can be applied to a variety of languages, one of them being PyPy, a Python implementation (that's really two implementations, one PyPy2, one PyPy3). Most modern languages have many built-in functions and data-types that are written in the implementation language (RPython, in our case, C in the case of CPython). We want the JIT optimizer to be able to reason about these functions and datatypes.

One way to do this would be to extend the meta-JIT by adding a new language-specific optimization pass that knows how to deal with these functions. This approach however goes against the idea of a meta-JIT, since we want the JIT to be language independent.

So we developed various tools in the past that make it possible for the author of RPython interpreters to express various optimizations. These tools take the forms of hints. Hints are small changes in the RPython source of the interpreter that communicate with the meta-JIT. Hints can take one of two major forms, either they are special functions that will be called at runtime from the RPython code of the interpreter, or decorators that can be applied to the RPython functions that make up the interpreter. Both the special functions as well as the decorators are part of the rpython.rlib.jit library.

One thing I want to stress here at the beginning is that all these hints, and everything I talk about in the rest of the blog post, is needed by the people that are working on the RPython code that makes up the PyPy Python interpreter (or other languages implemented in RPython). "Regular" Python code that just runs on PyPy does not need these hints, and in fact cannot use them.

The @elidable Decorator

One very important decorator is called @elidable, which makes it possible to mark an RPython function as pure (actually a small generalization, but for the purpose of this post thinking of them as pure is fine). This decorator was described in a blog post in 2011, it was still called @purefunction then. The post later turned into a paper.

An elidable function will be constant-folded by the JIT if all its arguments are constant. If it can't be removed because some arguments aren't constant, it's still subject to common subexpression elimination (CSE). Subsequent calls to the same function with the same arguments can be removed and replaced by the previous result.

There are lots of examples of @elidable functions, both in RPython in general and in the PyPy interpreter. In the post linked above they are used for optimizing object operations in a tiny example interpreter.

But also a lot of functionality that is more directly exposed to the Python programmer is elidable. Most string methods are elidable for example: str.lower will always return the same result for some argument, and if you call it twice on the same string, it will also return the same result. So adding the @elidable decorator to the implementation of these string methods allows the JIT to constant-fold them if they are applied to a constant string, and to remove a second identical call.

Another big class of examples are all the implementation functions for our big integer operations, which underlie the Python int type once the values don't fit into a machine word anymore. On the implementation level we implement those in an rbigint class, and most methods of that are also elidable. This enables the JIT to constant-fold big integer addition, and do CSE on big integer arithmetic.

Limitations of @elidable

This is all very useful! But it's still only a limited amount of things that the interpreter author can express to the JIT with this one function decorator. Recently (ie in 2022) I merged a branch that adds two new hints that the interpreter author can use to communicate possible optimizations to the meta-JIT! The work has been an ongoing project since a while. So far, only the meta-JIT support has been merged, in future enhancements we plan to apply them to the PyPy Python interpreter where appropriate.

record_known_result

So, what are the new hints? One of them is called record_known_result, and that hint is what this blog post is about. The other is called record_exact_value, it's conceptually quite simple, but it's much harder to see how it can be used. It was implemented by Lin Cheng from Cornell, and it is described (together with a possible optimization to PyPy that uses the hint) in another paper.

What is record_known_result used for? One of the limitations of elidable is that often there are properties that connect several function calls that are connected in some way. Sometimes there are functions that are inverses of each other, so that f(g(x)) == x for all x (example: negation on numbers is its own inverse, --x == x). Sometimes functions are idempotent, which means that if you call the function several times you can remove all but the first call. An example would be abs on numbers, after the first call to abs the result is positive, so calling the function again on the result has no effect, i.e. abs(abs(x)) == abs(x). These properties could in theory be nicely used by a hypothetical optimizer that knows about them and the functions. However, as described above, we don't want to change the meta-JIT to add knowledge about interpreter-specific functionality. So we wanted to add a hint that can express them to the meta-JIT.

What could hints look like, that make it possible to express these properties? At first I was experimenting with some declarative decorators like @idempotent and @is_inverse_of(func) but I felt like it wouldn't scale to add lots of these decorators and support for all of them in the meta-JIT. In the end I found a not fully obvious hint that is not a decorator, that is powerful enough to implement at least these last two and many similar patterns. This hint piggy-backs on the existing CSE implementation of pure functions in the meta-JIT.

The hint works as follows: It is a new function called record_known_result(result, func, *args). Calling that function has no direct effect at runtime. Instead, it communicates to the meta-JIT, that if it sees a function call to func with the arguments *args, it can replace the result of that function call with res.

Since this is pretty abstract, it's best to look at an example. How would you use this to express the fact that rbigint_neg is its own inverse, which is function in the runtime that is responsible for computing the negative value of a big integers? The implementation of rbigint_neg looks roughly like this (it's actually a method and a tiny bit more complicated, but not really significantly so):

@elidable
def rbigint_neg(self):
    return rbigint(digits=self.digits, sign=-self.sign)

If we want to use the new hint to express that rbigint_neg(rbigint_neg(x)) == x, we need to rewrite the function somewhat, by introducing a pure helper function that does the actual computation, and turning the original function into a wrapper that calls the helper:

@elidable
def _rbigint_neg_helper(self):
    return rbigint(digits=self.digits, sign=-self.sign)

def rbigint_neg(self):
    res = _rbigint_neg_helper(self)
    record_known_result(self, _rbigint_neg_helper, res)
    return res

But when we trace the rbigint_neg function, the record_known_result hint tells the JIT the following information: if at any point in the future (meaning further down the trace) we see another call to _rbigint_neg_helper with res as the argument, we can replace that call directly with self, which is exactly the property that _rbigint_neg_helper is its own inverse. As another example, let's express the idempotence of bytes.lower. We can imagine the implementation looking something like this (the "real" implementation is actually quite different, we don't want the extra copy of bytes.join):

@elidable
def bytes_lower(b):
    # implementation looks very different in practice
    # just an illustration!
    res = ['\x00'] * len(b)
    for i, c in enumerate(b):
        if 'A' <= c <= 'Z':
            c = chr(ord(c) - ord('A') + ord('a'))
        res[i] = c
    return b"".join(res)

To express that the function is idempotent, we need to express that bytes_lower(bytes_lower(b)) == b. We express this again with the same approach, move the implementation into a helper function, call the helper from the original function and call record_known_result too:

@elidable
def _bytes_lower_helper(b):
    ... # as above

def bytes_lower(b):
    res = _bytes_lower_helper(b)
    record_known_result(res, _bytes_lower_helper, res)
    return res

This tells the meta-JIT that if res is later itself passed to _bytes_lower_helper, it can remove that call and replace it immediately with res (because res is already all lower cased, as its the result of a call to lower), i.e. that _bytes_lower_helper is idempotent. (There are also other properties of lower and upper we could express in this way, for example that bytes.lower(bytes.upper(x)) == bytes.lower(x), let's leave it at that for now though).

Both of these usage patterns of record_known_result could of course also be pulled out into general decorators again. For example a generic @idempotent decorator could be implemented like this:

def idempotent(func):
    # idempotent implies elidable
    func = elidable(func)
    def wrapper(arg):
        res = func(arg)
        record_known_result(res, func, res)
        return res
    return wrapper

Then the decorator could be used like this for bytes_lower:

@idempotent
def bytes_lower(b):
    # implementation as in the original code above
    ...

Implementing record_known_result

How is record_known_result implemented? As I wrote above, the implementation of that hint builds on the existing support for elidable functions in the optimizer of the meta-JIT. There are several optimizations that do something with elidable function calls: constant folding, CSE, dead code elimination. Let's look at those work on elidable functions:

  • Constant folding removes calls to elidable functions with constant results (technically this is a bit complicated, but conceptually this is what happens).

  • CSE will replace calls to an elidable function by previous results, if they appear a second time further down the trace.

  • Dead code elimination will remove elidable function calls in the trace that have unused results.

So if there is a trace like this:

r1 = call_elidable((f), (1)) # constant-folded to 17
r2 = call_elidable((g), a, b)
r3 = call_elidable((g), a, b) # replaced by r2
r4 = call_elidable((h), c, d) # removed, result unused
print(r1, r2, r3)

It will be optimized to:

r2 = call_elidable((g), a, b)
print((17), r2, r2)

Some general notes about these traces: They are all in single-static-assignment form (SSA), meaning that every variable is assigned to only once. ¹ They are also slightly simplified compared to "real" traces.

Let's look at how the CSE pass that optimizes elidable calls, that is part of the meta-JIT works. In pseudocode it could look something like this:

def cse_elidable_calls(trace):
    seen_calls = {}
    output_trace = []
    for op in trace:
        if is_call_elidable(op):
            # op.args are the function,
            # followed by the arguments
            # which are variables or constants
            key = op.args
            previous_op = seen_calls.get(key)
            if previous_op is not None:
                replace_result_with(op, previous_op)
                # don't need to emit the op
                continue
            else:
                seen_calls[key] = op
        output_trace.append(op)
    return output_trace

There is quite a bit of hand-waving here, particularly around how replace_result_with can work. But this is conceptually what the real optimization does. ²

Making use of the information provided by record_known_result is done by changing the CSE pass in particular. Let's say you trace something like this:

x = bytes_lower(s)
... some other code ...
y = bytes_lower(x)
print(x, y)

This should trigger the idempotence optimization. The resulting trace could look like this:

# bytes_lower itself is inlined into the trace:
r1 = call_elidable((_bytes_lower_helper), s1)
record_known_result(r1, (_bytes_lower_helper), r1)
... intermediate operations ...

# second call to bytes_lower inlined into the trace:
r2 = call_elidable((_bytes_lower_helper), r1)
record_known_result(r2, (_bytes_lower_helper), r2)
print(r1, r2)

The CSE pass on elidable functions will now optimize away the call that results in r2. It does this not by replacing r2 by a previous call to _bytes_lower_helper with the same arguments (such a call doesn't exist), but instead makes use of the information conveyed by the first record_known_result trace operation. That operation states that if you see a call like the second _bytes_lower_helper you can replace it with r1. The resulting optimized trace therefore looks like this:

r1 = call_elidable((_bytes_lower_helper), s1)
... intermediate optimizations, optimized ...
# call removed, r2 replaced with r1
print(r1, r1)

The record_known_result operations are also removed, because further optimization passes and the backends don't need them. To get this effect, we have to change the pseudocode above to teach the CSE pass about record_known_result operations in the following way:

def cse_elidable_calls(trace):
    seen_calls = {}
    output_trace = []
    for op in trace:
        # <---- start new code
        if is_record_known_result(op):
            # remove the first argument,
            # which is the result
            key = op.args[1:]
            # the remaining key is function called,
            # followed by arguments, like below
            seen_calls[key] = op.args[0]
            # don't emit the record_known_result op
            continue
        # end new code ---->
        if is_call_elidable(op):
            # op.args are the function,
            # followed by the arguments
            # which are variables or constants
            key = op.args
            previous_op = seen_calls.get(key)
            if previous_op is not None:
                replace_result_with(op, previous_op)
                # don't need to emit the op
                continue
            else:
                seen_calls[key] = op
        output_trace.append(op)
        return output_trace

That's all! So from the point of view of the implementation of CSE of elidable functions, the new hint is actually very natural.

In the case of function inverses, dead code elimination also plays an important role. Let's look at the trace of a double negation, maybe like this: x = -y; ...; print(-x):

r1 = call_elidable((_rbigint_neg_helper), a1)
record_known_result(a1, (_rbigint_neg_helper), r1)
... intermediate stuff
r2 = call_elidable((_rbigint_neg_helper), r1)
record_known_result(r1, (_rbigint_neg_helper), r2)
print(r2)

After CSE, the second call is removed and the trace looks like this, because r2 was found to be the same as a1:

r1 = call_elidable((_rbigint_neg_helper), a1) # dead
... intermediate stuff, CSEd
# call removed
print(a1)

Now dead code elimination notices that the first call is not needed any more either and removes it.

What is good about this design? It very neatly ties into the existing infrastructure and is basically only about 100 lines of changes in the meta-JIT. The amount of work the optimizer does stays essentially the same, as the new hints are basically directly usable by CSE which we already do anyway.

Performance effects

So far, we haven't actually used this new hint in PyPy much. At this point, the hint is only a new tool in the interpreter author toolbox, and we still need to find the best places to use this tool. The only use of the hint so far is an annotation that tells the JIT that encoding and decoding to and from utf8 are inverses of each other, to be able to optimize this kind of code: x = someunicode.encode("utf-8").decode("utf-8") by replacing x with someunicode (of course in practice there is usually again some distance between the encode and decode calls). This happens in a bunch of places in real code that I saw, but I didn't do a super careful study of what the performance effect is yet.

Limitations

What are the problems and the limitations of the approach I described in this post?

Correctness remains tricky! If you write the wrong hints, the meta-JIT will potentially miscompile your users' programs. To at least get some signal for that, record_known_result actually performs the hinted call and does an assert on the result if you run the program untranslated while executing tests. In combination with for example property-based testing this can find a lot of the bugs, but is of course no guarantee.

Many things aren't expressible. The new hint is much less powerful than some of the recent pattern based optimization systems (e.g. metatheory.jl) that allow library authors to express rewrites. Instead, we designed the hint to minimally fit into the existing optimizers at the cost of power and ease of use. The most obvious limitation compared to pattern based approaches is that the record_known_result hint cannot quantify over unknown values. It can only use values that are available in the program. As an example, it's not really possible to express that bigint_sub(x, x) == bigint(0) for arbitrary big integers x.

Another limitation of the hint is that currently it is only applicable to pure/elidable functions. This makes it not really applicable to any kind of mutable data structure. As an example, in theory sorted(list) is idempotent, but only as long as the lists involved aren't mutated between the two calls to sorted. Reasoning about mutation doesn't really fit into the model easily. The meta-JIT itself is actually able to do a lot of tracking of what kinds of mutations occurred and what the heap must look like. But we haven't found a good way to combine this available information with user-provided information about function behaviour.

Conclusion

We added two new hints to RPython's meta-JIT that allow the interpreter author to express language-specific optimizations. We are still only getting used to these new hints and their possible applications and will need to collect more experience about how big the performance implications are in practice for real programs.

Footnotes

¹ In fact, there is not really a concept of "variable" at all, instead all variables are identical with the operations that produce them.

² Some details on the hand-waving: replacing ops with other ops is implemented using a union-find data-structure to efficiently allow doing arbitrary replacements. These replacements need to influence the lookup in the seen_calls dict, so in practice it's not even a dictionary at all. Another way that the pseudocode is simplified is that we don't in practice have tiny passes like this that go over the trace again and again. Instead, we have a single optimization pass that goes over the trace in forward direction once.

New Personal Blog

I mainly post on the PyPy blog, but there are sometimes things I write that don't really fit there. In 2018 and 2020 I wrote two essays about (computational) art which lived on glitch.com. Now that that is going to shut down, I decided to migrate and backdate them here, together with some shorter backdated draft posts that never really made it anywhere (or only to social media).

PyPy Binary Sizes over Time

We don't do this regularly or anything, but every couple of years I look at how the PyPy binary sizes have developed in the meantime. Looks like an ok balance between occasionally cleaning something up and thereby shrinking the binary; and then slow growth or explicit time/binary-size-tradeoffs.

plot showing how the PyPy binary went from 68 MiB in early 2020 to 61 MiB
today, with a few ups and downs in between. various points in time are
annotated with reasons for growth and shrinkage. the biggest shrinkage is:
2022-05-10 compact jitcode liveness info

This is the size of PyPy 2.7, btw. I can't use the 3.x variant, because the size of that changes due to newer Python versions implemented. Interestingly enough PyPy 3.x is mostly smaller than 2.7. PyPy 3.11 is 58.8 MiB, despite having a bunch more features. I never quite figure out why. Just for comparison, CPython 3.11 is 25.2 MiB.

Here's a plot of CPython/PyPy 3.x binary sizes:

CPython and PyPy binary sizes over time. CPython 3.7 is 35 MiB, shrinks to
22.5 MiB in 3.8, then slowly grows again until 3.11 at 25.2 MiB. Cpython 3.12
is much bigger, at 37.3 MiB.  CPy 3.13 shrinks a bit again to 36.1 MiB. PyPy
tracks the ups and downs in roughly the same shape but is generally much
bigger. 3.7 is at 60.5, 3.8 at 56.2, then it slowly increases to 58.8 at PyPy
3.11

Porting the "Worst Datalog" to Python

I really enjoyed the two posts by Christophe Grand about Datalog, Writing the Worst Datalog Ever in 26loc and Half Dumb Datalog in 30 loc (even more than the first one), so I ended up porting them to Python.

There's a bunch of differences to the Clojure versions:

  • they are roughly twice as long
  • use generators and thus have to environment threading somewhat differently
  • a custom class for variables (no symbols in Python)
  • some dictionary copying is going on because of mutability (and I didn't want to think too much about when mutation is happening).

The semi-naive implementation ended up being roughly 50 lines, which is pretty reasonable I thought.

The code for them both is in this Gist:

DFA Minimization using Rational Trees in Prolog

I was inspired by Phil Zucker's blog post Naive Automata Minimization. On Twitter he wrote:

Maybe a way to actualize this is using knot tied rational trees instead of integer state ids. But this has its own confusions.

I really wanted to try to implement this in SWI-Prolog.

Rational Trees

Rational trees are Prolog terms that are circular in some way. They can be created by unification:

?- X = f(X).
X = f(X).

Now the term X represents the term f(f(f(f(f(f(...)))))):

?- X = f(X), X = f(f(f(f(f(f(f(A))))))).
X = A, A = f(A).

This is a bit like appending a list to itself in Python:

>>> l = []
>>> l.append(l)
>>> l
[[...]]

SWI-Prolog is good at dealing with these recursive terms, it can unify them etc. Python is less good. While repr does the right thing, as we saw above, == produces a recursion error:

>>> import copy
>>> copy.deepcopy(l)
[[...]]
>>> copy.deepcopy(l) == l
Traceback (most recent call last):
  File "<python-input-9>", line 1, in <module>
    copy.deepcopy(l) == l
RecursionError: maximum recursion depth exceeded

Defining DFAs with Rational Trees

We can use these rational trees in Prolog to represent deterministic finite automata (DFAs) over the alphabet with two characters, a and b. A DFA is a Prolog term representing the starting state. A state is represented by a term state(Accept, NextStateA, NextStateB) where Accept is true or false. The two next states are again state terms. A next state can also be none. To achieve loops in the DFA, we use rational trees to refer to outer states.

To represent a DFA that matches a+ we can use the term state(false, state(true, ..., none), none), where the ... is the second state term. We can build it like this:

dfa_a_plus(S1) :-
    S1 = state(false, S2, none),
    S2 = state(true, S2, none).

We can also build it as a non-minimal DFA with an extra state:

dfa_a_plus_inefficient(S1) :- % less efficient example automaton for a+ with extra states
    S1 = state(false, S2, none),
    S2 = state(true, S3, none),
    S3 = state(true, S4, none),
    S4 = state(true, S2, none).

Those two terms unify successfully and are thus equivalent:

?- dfa_a_plus(S1), dfa_a_plus_inefficient(S2), S1=S2.
 S1 = S2, S2 = state(false, _S1, none), % where
     _S1 = state(true, state(true, state(true, _S1, none), none), none).

They unify because the Prolog rational trees are "observationally equivalent" in the sense of Phil's blog post, which the Prolog unification algorithm can check.

The string representation of SWI-Prolog is not minimal though. This shows that for the printer uses some internal knowledge about how the trees were constructed, because from pure Prolog it should not be observable where the bound logic variable "was" in the structure (the term is ground).

Matching

So far we only defined automata, but we can also write a predicate that matches one against a list of a or b.

match([], state(true, _, _)).
match([a | Rest], state(_, StateA, _)) :- match(Rest, StateA).
match([b | Rest], state(_, _, StateB)) :- match(Rest, StateB).

Some example calls:

?- dfa_a_plus(S), match([a, a, a, a], S). % succeeds

?- dfa_a_plus(S), match([], S).
false.

?- dfa_a_plus(S), match([a, a, a, b], S).
false.

?- dfa_a_plus(S), match([b], S).
false.

Let's look at a more complicated DFA example, the one from the blog post:

dfa_blog_post(S1) :-
    S1 = state(false, S2, S3),
    S2 = state(false, S4, S3), % states S2 and S3 are equivalent
    S3 = state(false, S5, S3),
    S4 = state(true, S5, S4), % states S4 and S5 are equivalent
    S5 = state(true, S4, S4).

(there's a nice diagram in Phil's post.)

If we want to minimize it, we can assign a number to each unique state reachable from the starting state. We can use an association list, i.e. a list where the entries are terms State/Number. States that are observationally equivalent unify, so we can just use Prolog unification (via the "member" predicate which checks whether a term is an element of a list).

number_states(D, L) :-
    % number the states, the first number is 0
    number_states_helper(D, 0, _, [], LRev),
    % reverse the result, because it is built in reverse order
    reverse(LRev, L).

number_states_helper(none, Num, Num, L, L). % none doesn't need a number
number_states_helper(State, Num, Num, L, L) :-
    % if we already gave a number to the state, we are done.
    member(State/_, L).
number_states_helper(State, NumIn, NumOut, LIn, LOut) :-
    % if we've never seen the state, we assign it NumIn
    not(member(State/_, LIn)),
    L1 = [State/NumIn | LIn],
    State = state(_, NextStateA, NextStateB),
    succ(NumIn, Num1),
    number_states_helper(NextStateA, Num1, Num2, L1, L2),
    number_states_helper(NextStateB, Num2, NumOut, L2, LOut).

We can apply this predicate from the non-minimal DFA:

:- dfa_blog_post(D), number_states(D, L), length(L, Length), write(Length), nl, write(L), nl.
3
@([state(false,state(false,S_1,S_2),S_2)/0,state(false,S_1,S_2)/1,S_1/2],[S_1=state(true,state(true,S_1,S_1),S_1),S_2=state(false,state(true,S_1,S_1),S_2)])

The list L of numbered states is kind of hard to grasp due to the messy way that rational tree terms gets printed, but the important part is that we found out that three states is the minimal size.

After we numbered the states we can use that numbering to produce a graphviz graph for the DFA:

to_dot(D) :-
    number_states(D, L),
    write("digraph G {"), nl,
    write('start [label="", shape=none]'), nl,
    write("start -> 0"), nl,
    to_dot_list(L, L),
    write("}"), nl.

to_dot_list([], _).
to_dot_list([State/NumState | Rest], FullList) :-
    State = state(Accept, StateA, StateB),
    to_dot_node(Accept, NumState),
    to_dot_edge(StateA, NumState, FullList, a),
    to_dot_edge(StateB, NumState, FullList, b),
    to_dot_list(Rest, FullList).

to_dot_node(true, Num) :-
    write(Num), write(" [shape=doublecircle]"), nl.
to_dot_node(false, Num) :-
    write(Num), write(" [shape=circle]"), nl.

to_dot_edge(none, _, _, _).
to_dot_edge(State, NumPrev, FullList, Char) :-
    member(State/NumState, FullList),
    write(NumPrev), write(" -> "), write(NumState), write(" [label="), write(Char), write("]"), nl.

If we apply this to the example we get:

digraph G {
start [label="", shape=none]
start -> 0
0 [shape=circle]
0 -> 1 [label=a]
0 -> 1 [label=b]
1 [shape=circle]
1 -> 2 [label=a]
1 -> 1 [label=b]
2 [shape=doublecircle]
2 -> 2 [label=a]
2 -> 2 [label=b]
}

Which looks like this:

DFA vizualization

Optimizing Collatz in PyPy

Somebody was Wrong on the Internet about the performance of Python, using computing Collatz sequence as a benchmark. Therefore I had to try to measure how long this takes on PyPy, with and without adding a cache for already seen numbers. Takes about 4min on my laptop for 1 billion numbers, or 30s with a cache. Code is here:

A JIT for Deterministic Automata using PeachPy

For a lesson about JITs in a colleague's compiler course I wrote a small just-in-time compiler for deterministic finite automata in Python, using PeachPy. I planned to write a complete blog post about them, but it doesn't seem to quite be happening, so below is the code. It has a few comments at least.

The idea is to show a few of the typical things a JIT can do that a static compiler can't: - switching between interpreter and compiled code - profiling to find out which part of the program are executed most commonly - deoptimization back to the interpreter - patching of existing code if new code is added

Here's the code:

Understanding William Forsythe's "City of Abstracts"

(if you don't want to read anything just click this button to play around: TLDR)

Last year I had the chance to experience the choreographer William Forsythe's video installation City of Abstracts in the Folkwang museum in Essen. Here's a video of a few people in front of it:

The installation consists of an open space, a camera and a big screen. The camera films the people that are moving and standing in the space, the screen shows a distorted version of the filmed images. The distortion seems to have the following properties: people that are standing still appear only minimally distorted on the screen (for example the person on the left in the back, in the first part of the video). People that are moving across the field of view of the camera turn into diagonally stretched versions of themselves (for example the person moving from right to left at around 0:20). The head is ahead (hrm) in their direction of movement. If they then stand still, their heads stand still first, while the rest of their bodies slowly catch up, from top to bottom.

While I was standing in the museum playing around, I started to wonder how the installation worked. At first I thought it must be a fairly complicated effect, that somehow analyzed motion and did a complicated transformation. However, after some more playing, running back and forth etc. I became convinced that the program is actually quite simple. In the following I want to explain how it works and recreate the effect in Javascript.

A Miniature City

Because it is quite hard to reason about what happens in a big video, let's start by looking at a miniature version of the problem. The following is a 5x5 pixel input video where a stack of pixels first moves left, then right, and then back and forth for a bit:

We want to find out how this input video has to be transformed into an output video in a way that recreates the effect of Forsythe's installation.

Above we saw that the bottom of a moving object is somehow further back in time than the top. This could mean that the output video uses older and older lines of pixels, from top to bottom. The top row of pixels would be from the current camera input, the second row from one frame back, the third row from two frames back, and so on. Let's try this for the miniature input video:

This seems to recreate the effect! When the stack of pixels moves left, it is transformed into a bottom left to top right diagonal, when it moves back it slowly becomes a top left to bottom right diagonal. When it stands still, the top pixels stand still first in the output, the lower pixels slowly catch up. Below you can look at the output pixel video:

The Real Thing

Now that we know how the effect works, we can apply it to a real video, using a webcam. If you click Start and give permission, you can try it yourself (The webcam video is not transmitted anywhere, just shown below). The left video is the webcam output, the right video is the transformed version. If you are interested in the Javascript code, you can use the Glitch button in the top right corner to inspect (and edit) it.

The first video is directly the output of the webcam, the second video shows the "timeshifted" output as explained above. To see anything interesting happening, you need to move around in front of the webcam. Sideways motions are easiest to understand. Remember that the second video takes a while to catch up with your movements. Some fun things to try: walking across the room, slowly shaking your head, jumping, rotating your arms...

(Aside: I am quite impressed that it works just fine to implement this piece of video processing in Javascript and have it run relatively fluidly in a browser on a phone.)

One parameter we can play with in the above transformation is we can make the lower pixels catch up faster with the upper pixels by doing the transformation block-wise, instead of a single row of pixels at a time. If you increase the block size below, the output video will become less laggy, but on the other hand more blocky.

1

Relationship to Rolling Shutter

An amusing sidenote is that we can use the transformation to understand the rolling shutter effect. Rolling shutter happens when taking a picture of a fast moving object with a smartphone camera. Those cameras don't record the whole image at once, but do it line by line. And if the object moves very fast relative to the time the phone takes to record the whole image, different parts of the final image show the object at different points in time (and thus space). What the effect above does is artificially slow down the creation of a picture. Therefore we can recreate the effects that rolling shutter gives with objects that move much more slowly, for example by rotating an arm.

Porting Jürgen LIT Fischer's "sinus" from Fortran to Javascript

LIT Fischer developed his light-graphics step by step, an oscillating rhythmically-flowing, undulating movements on the screen – "heterophonics" – as he calls them. The variety of the undulating lightwaves, often brought about by minute deviations from the original program, fascinate the artist. With his light-movement studies he actually succeeds in visualizing series of musical tones, sound sequences and meditative song. Dr. Eva Karcher

I've been fascinated by the light art installations of artist Jürgen LIT Fischer since a while, e.g. "Fraktal" as part of the Tetraeder in Bottrop:

Tetraeder at
Night

Photo CC-BY Mirek Claßen

Later I found out that LIT Fischer used to do computer art in the 1980s using Fortran and a plotter. Here are two examples:

Mit Zwischenraum, hindurchzuschaun

"Mit Zwischenraum, hindurchzuschaun" ("with space between to look through"), Jürgen LIT Fischer, 1986. Aside: the title is a part of a poem by Christian Morgenstern, "Der Lattenzaun" (click through for an English translation)

Intervalle

"Intervalle" ("intervals"), Jürgen LIT Fischer, 1987

Dreiklang

Dreiklang

"Dreiklang" ("triad"), Jürgen LIT Fischer, 1987

Dreiklang, top three panels

"Dreiklang", top three panels, 1987

Looking through some of the books about his exhibitions I found that one of them contained fragments of the Fortran code that he used as the basis of one of the installations:

Fortran sinus

Despite not knowing Fortran I decided to still try to port the code to Javascript (which I of course also don't really know, but at least it's easier to Google and more importantly, to run). I guessed that the Z* functions were responsible for communicating with the plotter. In particular that the ZDRAW function had to move the plotter pen somehow.

After some amount of fiddling I managed to produce the output below. I used different parameters than in the Fortran source, but reading some more about his working style, Fischer anyway played around with the parameters a lot (in fact, his final output is usually a layer of many different semi-transparent versions of the outputs of the same program, using different colors).

What we see is basically a series of sine waves, plotted at a fixed distance below each other. They all have the same frequency but the amplitude goes up from 0 to some value, and then down to 0 again. What I find cool is how simple the ingredients are: just lots of sine waves, but the result is still quite an interesting effect.

Compared to the Fortran code I changed things somewhat: I fixed the code duplication between the code that computes the upper and the lower half of the waves and I removed the extra level of loops responsible for increasing the stroke width by drawing the same line several times with a tiny offset (that's much easier in svg!). I also tried to translate the variable names from abbreviated German to English. If you want to read or modify the source code, use the Glitch button in the upper right corner!

The Discontinuity in the Middle

In the picture of the real artwork above, those waves are split up into three acrylic glass panels. In the top middle panel there is what looks like some strange discontinuity in the middle.

Discontinuity image

"Dreiklang", top middle panel, rotated

I suppose that was done intentionally, but I still wanted to try to figure out where it came from. And indeed, playing with the parameters some more, we can reproduce it:

To understand this effect, let's look at an even more exaggerated view:

I have added in red to the left the size of the amplitude of each sine wave. On the peak of every wave, the lines above bunch together, and the lines below are spread apart. The eye notices the sharp jump between the distances of the lines in the middle. The reason for that jump is the fact that the amplitudes of the stacked waves vary linearly up, and then down. But since that interpolation function is piecewise linear, it has a non-smooth part in the middle.

This is going beyond understanding the original code, but we can fix that! Instead of interpolating the amplitudes linearly, we can do it smoothly, e.g. again with a sine wave:

Interactive Version

If you want to play around with the parameter space too, here's an interactive version:

Number of Cycles
3
Height of Drawing
3
Number of Steps
3
Size Amplitude
3
Stroke Width
3
Color
Smooth Interpolation

right-click and save as to download svg