A Hint for Language-Specific Runtime Function Optimization in RPython's Meta-JIT

Meta Note: This post was in my draft folder for the PyPy blog for three years, and various people gave the feedback "wtf are you even talking about". Since I haven't worked on it since then I clearly am not going to fix anything about it, so I'll just post it almost unchanged to my personal blog now.

RPython's meta-JIT cannot reason about the properties of many RPython implementation-level functions, because those are often specific to the different interpreters written in RPython. Expressing language-specific optimizations on these functions would allow the JIT to more effectively reason about language-specific data structures and potentially remove costly calls into the runtime support code. In this blog post I will describe a new hint that makes it possible to express invariants about the functions of the data structures used in a specific language interpreter. These hints make it possible for the meta-JIT optimizer to do more language-specific optimizations across function calls without having to change the meta-JIT itself.

Introduction and Background

RPython has a meta-JIT, that can be applied to a variety of languages, one of them being PyPy, a Python implementation (that's really two implementations, one PyPy2, one PyPy3). Most modern languages have many built-in functions and data-types that are written in the implementation language (RPython, in our case, C in the case of CPython). We want the JIT optimizer to be able to reason about these functions and datatypes.

One way to do this would be to extend the meta-JIT by adding a new language-specific optimization pass that knows how to deal with these functions. This approach however goes against the idea of a meta-JIT, since we want the JIT to be language independent.

So we developed various tools in the past that make it possible for the author of RPython interpreters to express various optimizations. These tools take the forms of hints. Hints are small changes in the RPython source of the interpreter that communicate with the meta-JIT. Hints can take one of two major forms, either they are special functions that will be called at runtime from the RPython code of the interpreter, or decorators that can be applied to the RPython functions that make up the interpreter. Both the special functions as well as the decorators are part of the rpython.rlib.jit library.

One thing I want to stress here at the beginning is that all these hints, and everything I talk about in the rest of the blog post, is needed by the people that are working on the RPython code that makes up the PyPy Python interpreter (or other languages implemented in RPython). "Regular" Python code that just runs on PyPy does not need these hints, and in fact cannot use them.

The @elidable Decorator

One very important decorator is called @elidable, which makes it possible to mark an RPython function as pure (actually a small generalization, but for the purpose of this post thinking of them as pure is fine). This decorator was described in a blog post in 2011, it was still called @purefunction then. The post later turned into a paper.

An elidable function will be constant-folded by the JIT if all its arguments are constant. If it can't be removed because some arguments aren't constant, it's still subject to common subexpression elimination (CSE). Subsequent calls to the same function with the same arguments can be removed and replaced by the previous result.

There are lots of examples of @elidable functions, both in RPython in general and in the PyPy interpreter. In the post linked above they are used for optimizing object operations in a tiny example interpreter.

But also a lot of functionality that is more directly exposed to the Python programmer is elidable. Most string methods are elidable for example: str.lower will always return the same result for some argument, and if you call it twice on the same string, it will also return the same result. So adding the @elidable decorator to the implementation of these string methods allows the JIT to constant-fold them if they are applied to a constant string, and to remove a second identical call.

Another big class of examples are all the implementation functions for our big integer operations, which underlie the Python int type once the values don't fit into a machine word anymore. On the implementation level we implement those in an rbigint class, and most methods of that are also elidable. This enables the JIT to constant-fold big integer addition, and do CSE on big integer arithmetic.

Limitations of @elidable

This is all very useful! But it's still only a limited amount of things that the interpreter author can express to the JIT with this one function decorator. Recently (ie in 2022) I merged a branch that adds two new hints that the interpreter author can use to communicate possible optimizations to the meta-JIT! The work has been an ongoing project since a while. So far, only the meta-JIT support has been merged, in future enhancements we plan to apply them to the PyPy Python interpreter where appropriate.

record_known_result

So, what are the new hints? One of them is called record_known_result, and that hint is what this blog post is about. The other is called record_exact_value, it's conceptually quite simple, but it's much harder to see how it can be used. It was implemented by Lin Cheng from Cornell, and it is described (together with a possible optimization to PyPy that uses the hint) in another paper.

What is record_known_result used for? One of the limitations of elidable is that often there are properties that connect several function calls that are connected in some way. Sometimes there are functions that are inverses of each other, so that f(g(x)) == x for all x (example: negation on numbers is its own inverse, --x == x). Sometimes functions are idempotent, which means that if you call the function several times you can remove all but the first call. An example would be abs on numbers, after the first call to abs the result is positive, so calling the function again on the result has no effect, i.e. abs(abs(x)) == abs(x). These properties could in theory be nicely used by a hypothetical optimizer that knows about them and the functions. However, as described above, we don't want to change the meta-JIT to add knowledge about interpreter-specific functionality. So we wanted to add a hint that can express them to the meta-JIT.

What could hints look like, that make it possible to express these properties? At first I was experimenting with some declarative decorators like @idempotent and @is_inverse_of(func) but I felt like it wouldn't scale to add lots of these decorators and support for all of them in the meta-JIT. In the end I found a not fully obvious hint that is not a decorator, that is powerful enough to implement at least these last two and many similar patterns. This hint piggy-backs on the existing CSE implementation of pure functions in the meta-JIT.

The hint works as follows: It is a new function called record_known_result(result, func, *args). Calling that function has no direct effect at runtime. Instead, it communicates to the meta-JIT, that if it sees a function call to func with the arguments *args, it can replace the result of that function call with res.

Since this is pretty abstract, it's best to look at an example. How would you use this to express the fact that rbigint_neg is its own inverse, which is function in the runtime that is responsible for computing the negative value of a big integers? The implementation of rbigint_neg looks roughly like this (it's actually a method and a tiny bit more complicated, but not really significantly so):

@elidable
def rbigint_neg(self):
    return rbigint(digits=self.digits, sign=-self.sign)

If we want to use the new hint to express that rbigint_neg(rbigint_neg(x)) == x, we need to rewrite the function somewhat, by introducing a pure helper function that does the actual computation, and turning the original function into a wrapper that calls the helper:

@elidable
def _rbigint_neg_helper(self):
    return rbigint(digits=self.digits, sign=-self.sign)

def rbigint_neg(self):
    res = _rbigint_neg_helper(self)
    record_known_result(self, _rbigint_neg_helper, res)
    return res

But when we trace the rbigint_neg function, the record_known_result hint tells the JIT the following information: if at any point in the future (meaning further down the trace) we see another call to _rbigint_neg_helper with res as the argument, we can replace that call directly with self, which is exactly the property that _rbigint_neg_helper is its own inverse. As another example, let's express the idempotence of bytes.lower. We can imagine the implementation looking something like this (the "real" implementation is actually quite different, we don't want the extra copy of bytes.join):

@elidable
def bytes_lower(b):
    # implementation looks very different in practice
    # just an illustration!
    res = ['\x00'] * len(b)
    for i, c in enumerate(b):
        if 'A' <= c <= 'Z':
            c = chr(ord(c) - ord('A') + ord('a'))
        res[i] = c
    return b"".join(res)

To express that the function is idempotent, we need to express that bytes_lower(bytes_lower(b)) == b. We express this again with the same approach, move the implementation into a helper function, call the helper from the original function and call record_known_result too:

@elidable
def _bytes_lower_helper(b):
    ... # as above

def bytes_lower(b):
    res = _bytes_lower_helper(b)
    record_known_result(res, _bytes_lower_helper, res)
    return res

This tells the meta-JIT that if res is later itself passed to _bytes_lower_helper, it can remove that call and replace it immediately with res (because res is already all lower cased, as its the result of a call to lower), i.e. that _bytes_lower_helper is idempotent. (There are also other properties of lower and upper we could express in this way, for example that bytes.lower(bytes.upper(x)) == bytes.lower(x), let's leave it at that for now though).

Both of these usage patterns of record_known_result could of course also be pulled out into general decorators again. For example a generic @idempotent decorator could be implemented like this:

def idempotent(func):
    # idempotent implies elidable
    func = elidable(func)
    def wrapper(arg):
        res = func(arg)
        record_known_result(res, func, res)
        return res
    return wrapper

Then the decorator could be used like this for bytes_lower:

@idempotent
def bytes_lower(b):
    # implementation as in the original code above
    ...

Implementing record_known_result

How is record_known_result implemented? As I wrote above, the implementation of that hint builds on the existing support for elidable functions in the optimizer of the meta-JIT. There are several optimizations that do something with elidable function calls: constant folding, CSE, dead code elimination. Let's look at those work on elidable functions:

  • Constant folding removes calls to elidable functions with constant results (technically this is a bit complicated, but conceptually this is what happens).

  • CSE will replace calls to an elidable function by previous results, if they appear a second time further down the trace.

  • Dead code elimination will remove elidable function calls in the trace that have unused results.

So if there is a trace like this:

r1 = call_elidable((f), (1)) # constant-folded to 17
r2 = call_elidable((g), a, b)
r3 = call_elidable((g), a, b) # replaced by r2
r4 = call_elidable((h), c, d) # removed, result unused
print(r1, r2, r3)

It will be optimized to:

r2 = call_elidable((g), a, b)
print((17), r2, r2)

Some general notes about these traces: They are all in single-static-assignment form (SSA), meaning that every variable is assigned to only once. ¹ They are also slightly simplified compared to "real" traces.

Let's look at how the CSE pass that optimizes elidable calls, that is part of the meta-JIT works. In pseudocode it could look something like this:

def cse_elidable_calls(trace):
    seen_calls = {}
    output_trace = []
    for op in trace:
        if is_call_elidable(op):
            # op.args are the function,
            # followed by the arguments
            # which are variables or constants
            key = op.args
            previous_op = seen_calls.get(key)
            if previous_op is not None:
                replace_result_with(op, previous_op)
                # don't need to emit the op
                continue
            else:
                seen_calls[key] = op
        output_trace.append(op)
    return output_trace

There is quite a bit of hand-waving here, particularly around how replace_result_with can work. But this is conceptually what the real optimization does. ²

Making use of the information provided by record_known_result is done by changing the CSE pass in particular. Let's say you trace something like this:

x = bytes_lower(s)
... some other code ...
y = bytes_lower(x)
print(x, y)

This should trigger the idempotence optimization. The resulting trace could look like this:

# bytes_lower itself is inlined into the trace:
r1 = call_elidable((_bytes_lower_helper), s1)
record_known_result(r1, (_bytes_lower_helper), r1)
... intermediate operations ...

# second call to bytes_lower inlined into the trace:
r2 = call_elidable((_bytes_lower_helper), r1)
record_known_result(r2, (_bytes_lower_helper), r2)
print(r1, r2)

The CSE pass on elidable functions will now optimize away the call that results in r2. It does this not by replacing r2 by a previous call to _bytes_lower_helper with the same arguments (such a call doesn't exist), but instead makes use of the information conveyed by the first record_known_result trace operation. That operation states that if you see a call like the second _bytes_lower_helper you can replace it with r1. The resulting optimized trace therefore looks like this:

r1 = call_elidable((_bytes_lower_helper), s1)
... intermediate optimizations, optimized ...
# call removed, r2 replaced with r1
print(r1, r1)

The record_known_result operations are also removed, because further optimization passes and the backends don't need them. To get this effect, we have to change the pseudocode above to teach the CSE pass about record_known_result operations in the following way:

def cse_elidable_calls(trace):
    seen_calls = {}
    output_trace = []
    for op in trace:
        # <---- start new code
        if is_record_known_result(op):
            # remove the first argument,
            # which is the result
            key = op.args[1:]
            # the remaining key is function called,
            # followed by arguments, like below
            seen_calls[key] = op.args[0]
            # don't emit the record_known_result op
            continue
        # end new code ---->
        if is_call_elidable(op):
            # op.args are the function,
            # followed by the arguments
            # which are variables or constants
            key = op.args
            previous_op = seen_calls.get(key)
            if previous_op is not None:
                replace_result_with(op, previous_op)
                # don't need to emit the op
                continue
            else:
                seen_calls[key] = op
        output_trace.append(op)
        return output_trace

That's all! So from the point of view of the implementation of CSE of elidable functions, the new hint is actually very natural.

In the case of function inverses, dead code elimination also plays an important role. Let's look at the trace of a double negation, maybe like this: x = -y; ...; print(-x):

r1 = call_elidable((_rbigint_neg_helper), a1)
record_known_result(a1, (_rbigint_neg_helper), r1)
... intermediate stuff
r2 = call_elidable((_rbigint_neg_helper), r1)
record_known_result(r1, (_rbigint_neg_helper), r2)
print(r2)

After CSE, the second call is removed and the trace looks like this, because r2 was found to be the same as a1:

r1 = call_elidable((_rbigint_neg_helper), a1) # dead
... intermediate stuff, CSEd
# call removed
print(a1)

Now dead code elimination notices that the first call is not needed any more either and removes it.

What is good about this design? It very neatly ties into the existing infrastructure and is basically only about 100 lines of changes in the meta-JIT. The amount of work the optimizer does stays essentially the same, as the new hints are basically directly usable by CSE which we already do anyway.

Performance effects

So far, we haven't actually used this new hint in PyPy much. At this point, the hint is only a new tool in the interpreter author toolbox, and we still need to find the best places to use this tool. The only use of the hint so far is an annotation that tells the JIT that encoding and decoding to and from utf8 are inverses of each other, to be able to optimize this kind of code: x = someunicode.encode("utf-8").decode("utf-8") by replacing x with someunicode (of course in practice there is usually again some distance between the encode and decode calls). This happens in a bunch of places in real code that I saw, but I didn't do a super careful study of what the performance effect is yet.

Limitations

What are the problems and the limitations of the approach I described in this post?

Correctness remains tricky! If you write the wrong hints, the meta-JIT will potentially miscompile your users' programs. To at least get some signal for that, record_known_result actually performs the hinted call and does an assert on the result if you run the program untranslated while executing tests. In combination with for example property-based testing this can find a lot of the bugs, but is of course no guarantee.

Many things aren't expressible. The new hint is much less powerful than some of the recent pattern based optimization systems (e.g. metatheory.jl) that allow library authors to express rewrites. Instead, we designed the hint to minimally fit into the existing optimizers at the cost of power and ease of use. The most obvious limitation compared to pattern based approaches is that the record_known_result hint cannot quantify over unknown values. It can only use values that are available in the program. As an example, it's not really possible to express that bigint_sub(x, x) == bigint(0) for arbitrary big integers x.

Another limitation of the hint is that currently it is only applicable to pure/elidable functions. This makes it not really applicable to any kind of mutable data structure. As an example, in theory sorted(list) is idempotent, but only as long as the lists involved aren't mutated between the two calls to sorted. Reasoning about mutation doesn't really fit into the model easily. The meta-JIT itself is actually able to do a lot of tracking of what kinds of mutations occurred and what the heap must look like. But we haven't found a good way to combine this available information with user-provided information about function behaviour.

Conclusion

We added two new hints to RPython's meta-JIT that allow the interpreter author to express language-specific optimizations. We are still only getting used to these new hints and their possible applications and will need to collect more experience about how big the performance implications are in practice for real programs.

Footnotes

¹ In fact, there is not really a concept of "variable" at all, instead all variables are identical with the operations that produce them.

² Some details on the hand-waving: replacing ops with other ops is implemented using a union-find data-structure to efficiently allow doing arbitrary replacements. These replacements need to influence the lookup in the seen_calls dict, so in practice it's not even a dictionary at all. Another way that the pseudocode is simplified is that we don't in practice have tiny passes like this that go over the trace again and again. Instead, we have a single optimization pass that goes over the trace in forward direction once.