gh-146073: Add fitness/exit quality mechanism for JIT trace frontend by cocolato · Pull Request #148089 · python/cpython

cocolato · 2026-04-04T12:54:16Z

Issue: Improving trace quality by tracking "fitness" and "exit quality" #146073

cocolato · 2026-04-06T16:38:30Z

It appears that the current parameters do not yet guarantee runtime safety; I will continue to work on fixes and optimizations.

markshannon · 2026-04-07T08:28:12Z

I've commented on the issue #146073 (comment)

cocolato · 2026-04-14T08:18:15Z

@markshannon Thanks for review! I'm holding off on changing the fitness parameters for now, but I can run some benchmarks if you think it's necessary.

Fidget-Spinner · 2026-04-14T08:34:16Z

Still seeing a big slowdown in richards on https://github.com/colesbury/fastmark:

Main of this branch:

Benchmark                     Time      Useful Work
richards                      115.9 ms      (100%)
richards_super                103.5 ms      (100%)

This branch:

Benchmark                     Time      Useful Work
richards                      119.2 ms      (100%)
richards_super                118.5 ms      (100%)

I'm going to check if this is affecting the optimizer somehow.

…ER_EXECUTOR for RESUME

Treat back edges as an exit, not a penalty, this way they are more likely to end at a backedge instead of ending at random spots

markshannon · 2026-04-14T14:18:01Z

Increasing the max trace length is only going to help if the trace is stopping too early.
How big are the traces on main for richards?

Fidget-Spinner · 2026-04-14T14:33:31Z

I think we can safely reduce the max trace length, let me do that.

There were two problems with the older code:

Branch penalty was treated as before instruction count, when it should be treated as the sum over the expected trace length.
Treating non-closing JUMP_BACKWARD as penalty rather than exit quality seems to make it such that we stop traces at a certain offset after seeing a JUMP_BACKWARD. This isn't what we want. instead, we want to treat it as an exit that is worth stopping at (to increase the likelihood of linking to another trace).

New code has almost no slowdown on Richards, and a huge speedup on telco benchmark.

Benchmark                     Time      Useful Work
richards                      113.4 ms      (100%)
deltablue                     196.4 ms      (100%)
raytrace                      267.3 ms      (100%)
nbody                         150.6 ms      (100%)
go                            111.5 ms      (100%)
telco                        3517.8 ms      (100%)

Main:

Benchmark                     Time      Useful Work
richards                      111.6 ms      (100%)
deltablue                     192.2 ms      (100%)
raytrace                      270.4 ms      (100%)
nbody                         151.1 ms      (100%)
go                            112.6 ms      (100%)
telco                        3809.7 ms      (100%)

markshannon

A few more comments.

There are a few cases where we are still special casing some situations that fitness should handle and can be removed.

markshannon · 2026-04-15T08:00:51Z

+
+/* Exit quality thresholds: trace stops when fitness < exit_quality.
+ * Higher = trace is more willing to stop here. */
+#define EXIT_QUALITY_CLOSE_LOOP      (FITNESS_INITIAL)


Suggested change

#define EXIT_QUALITY_CLOSE_LOOP (FITNESS_INITIAL)

#define EXIT_QUALITY_CLOSE_LOOP (FITNESS_INITIAL - AVG_SLOTS_PER_INSTRUCTION*4)

FITNESS_INITIAL is too high a value for this, but not by much. We want to unroll tiny loops a bit and, more importantly, we don't want to special case the start instruction to avoid zero length traces.

markshannon · 2026-04-15T08:13:04Z

+ * N_BACKWARD_SLACK more bytecodes before reaching EXIT_QUALITY_CLOSE_LOOP,
+ * based on AVG_SLOTS_PER_INSTRUCTION. */
+#define N_BACKWARD_SLACK           50
+#define EXIT_QUALITY_BACKWARD_EDGE (EXIT_QUALITY_CLOSE_LOOP / 2 - N_BACKWARD_SLACK * AVG_SLOTS_PER_INSTRUCTION)


NOTE:

The problem here is that when tracing loops, we are treating the start of the loop as the closing point, but we want to stop at the end of the loop otherwise.
We probably need to make the back edge quality calculation a bit more complex.

if the jump is to the loop closing point: exit_quality = 0 (to ensure loop is closed)

otherwise: exit_quality = high ~(FITNESS - 10 * AVG_SLOTS_PER_INSTRUCTION)

(this can be fixed in a separate PR if would complicate this PR too much)

I prefer to do this in next PR.

markshannon · 2026-04-15T08:13:07Z

+
+/* Backward edge penalty for JUMP_BACKWARD_NO_INTERRUPT (coroutines/yield-from).
+ * Smaller than FITNESS_BACKWARD_EDGE since we want to trace through them. */
+#define EXIT_QUALITY_BACKWARD_EDGE_COROUTINE  (EXIT_QUALITY_BACKWARD_EDGE / 8)


Why are we treating these backward edges differently? They may be in smaller loops, but N_BACKWARD_SLACK already handles that.

I think this will help tracer through groutines short loops.

markshannon · 2026-04-15T08:27:28Z

+            target_instr == tracer->initial_state.close_loop_instr) {
+            return EXIT_QUALITY_CLOSE_LOOP;
+        }
+        else if (target_instr->op.code == ENTER_EXECUTOR && !_PyJit_EnterExecutorShouldStopTracing(opcode)) {


Suggested change

else if (target_instr->op.code == ENTER_EXECUTOR && !_PyJit_EnterExecutorShouldStopTracing(opcode)) {

else if (target_instr->op.code == ENTER_EXECUTOR) {

The fitness should handle this. If fitness is high, we will continue tracing. If it is getting lower, then we want to stop at the ENTER_EXECUTOR to join up with an existing trace.

No there's an exception to this: we don't want to treat ENTER_EXECUTORS caused by RESUME as a EXIT_QUALITY_ENTER_EXECUTOR, and instead treat them as a default one. Stopping at RESUME forms small, fragmented, loop traces, which I've previously documented in my RESUME tracing PR as I saw actual slowdowns from it.

This is what _PyJit_EnterExecutorShouldStopTracing says:

// Continue tracing (skip over the executor). If it's a RESUME // trace to form longer, more optimizeable traces. // We want to trace over RESUME traces. Otherwise, functions with lots of RESUME // end up with many fragmented traces which perform badly. // See for example, the richards benchmark in pyperformance. // For consideration: We may want to consider tracing over side traces // inserted into bytecode as well in the future.

I disagree.
The whole point of fitness/quality is remove these ad-hoc special cases.

The reason you were seeing many fragmented traces before was that we didn't have a principled way to do this.

We won't see lots of small fragmented traces because, as I said, the fitness will be high for short traces and will exceed EXIT_QUALITY_ENTER_EXECUTOR

I just did a benchmark. It's 0.5% slower applying this change on fastmark.

Can we introduce EXIT_QUALITY_ENTER_EXECUTOR_RESUME? We need to differentiate the following ENTER_EXECUTORS (they have different qualities in reality):

ENTER_EXECUTOR due to JUMP_BACKWARD (best).

ENTER_EXECUTOR due to progress or is_control_flow (decent)

ENTER_EXECUTOR due to RESUME (worst).

bedevere-app · 2026-04-15T08:32:18Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

cocolato · 2026-04-16T08:39:50Z

I ran some tests on macOS, and performance on the fitness branch appears to have dropped significantly.

use fastbench: PYTHONHASHSEED=0 ./python.exe ~/src/fastmark/fastmark.py --scale 1000 richards richards_super raytrace go telco --json fitness.json

Machine:

OS: macOS 26.3.1 (arm64)
SoC/CPU: Apple M4
RAM: 24 GB
Kernel: Darwin 25.3.0

main branch:

Python 3.15.0a8+ (heads/main:9d38143088, Apr 16 2026, 15:19:11) [Clang 17.0.0 (clang-1700.6.3.2)]
Benchmark                     Time      Useful Work
richards                     1057.0 ms      ( 96%)
richards_super               1049.0 ms      (100%)
raytrace                     4005.7 ms      (100%)
go                           1935.1 ms      (100%)
telco                        4061.8 ms      (100%)

fitness branch:

Python 3.15.0a8+ (heads/jit-tracer-fitness:9c75bb67dd, Apr 16 2026, 15:23:16) [Clang 17.0.0 (clang-1700.6.3.2)]
Benchmark                     Time      Useful Work
richards                     1106.0 ms      ( 97%)
richards_super               1083.1 ms      (100%)
raytrace                     4190.3 ms      (100%)
go                           1978.6 ms      (100%)
telco                        4125.5 ms      (100%)

markshannon · 2026-04-21T16:10:07Z

We seem to be going around in circles a bit here.

@cocolato can you try out this script https://github.com/python/cpython/pull/148840/changes#diff-7d8d989c9e02ccababda3709e44e2465010f9aa25843f4764e4e742adcfaf39b to see if it offers any insight?

I don't know if you can extract some of the key features of the benchmarks that are slower to find out why?

markshannon · 2026-04-21T16:11:57Z

Regarding performance. We also need to consider the interplay between trace fitness/length and warmup.
If warmup is too high, and the benchmarks short, overly long traces are going to appear better than they really are.

Ideally we want to cover the hot part of the program fairly quickly, not trace any cold parts and not cover the same piece of code with multiple traces unless there is genuine polymorphism. Easier said than done though.

I would prefer good traces, even it appears a little slower on one or two benchmarks and the performance is more likely to be consistent.

cocolato · 2026-04-22T14:37:30Z

@markshannon I run the new tests, this is the result:

workload	executors	uops	guards	calls	exits	loops
`richards.gv`	19 -> 8	3249 -> 1439	442 -> 208	68 -> 30	17 -> 7	2 -> 1
`gen_in_loop.gv`	2 -> 1	51 -> 42	7 -> 5	0 -> 0	2 -> 0	0 -> 1
`long_loop.gv`	2 -> 1	723 -> 483	3 -> 2	0 -> 0	2 -> 1	0 -> 0
`long_loop_with_calls.gv`	3 -> 2	1714 -> 589	9 -> 7	67 -> 23	2 -> 1	1 -> 1
`long_loop_with_side_exits.gv`	2 -> 1	1287 -> 458	100 -> 36	0 -> 0	2 -> 1	0 -> 0
`mid_loop.gv`	1 -> 1	155 -> 155	2 -> 2	0 -> 0	0 -> 0	1 -> 1
`mid_loop_with_calls.gv`	2 -> 2	551 -> 551	7 -> 7	21 -> 21	0 -> 0	2 -> 2
`mid_loop_with_side_exits.gv`	1 -> 1	275 -> 275	22 -> 22	0 -> 0	0 -> 0	1 -> 1
`short_branchy_loop.gv`	1 -> 1	50 -> 50	5 -> 5	0 -> 0	0 -> 0	1 -> 1
`short_loop.gv`	1 -> 1	50 -> 50	2 -> 2	0 -> 0	0 -> 0	1 -> 1
`short_loop_with_calls.gv`	2 -> 2	176 -> 176	7 -> 7	6 -> 6	0 -> 0	2 -> 2
`short_loop_with_side_exits.gv`	1 -> 1	80 -> 80	7 -> 7	0 -> 0	0 -> 0	1 -> 1

The current fitness mechanism has significantly reduced the overall size of the traces.
It has indeed reduced fragmentation in several heavy workloads.
However, it has not increased the total number of loop closures at all.

So I think we should reduce EXIT_QUALITY_CLOSE_LOOP to close the loop in the long loop trace.

markshannon · 2026-04-22T15:59:38Z

Can you tell why richards is so different?

I don't see how reducing EXIT_QUALITY_CLOSE_LOOP would help. When we reach the end of the loop, we want to close it. To me it looks like the fitness is dropping too fast for some reason and the end of the loop isn't reached.

Also, instead of reducing the fitness for every uop, only start decreasing after the trace is getting long but decrease it more rapidly in that case?

We could add the fitness to the dumps, for more information.
Maybe add uint32_t fitness here and record the fitness when tracing. You'll also need to display it as well.
Then you might be able to see where the fitness gets too low.

Once again, thanks for doing this.

cocolato · 2026-04-22T16:59:56Z

Can you tell why richards is so different?

richards relies on object property access, linked list nodes, context switching, and small function calls, so it generates a large number of short but highly branched hot paths. The JIT frontend does not see a single, stable, long trace, but rather many short traces centered around the scheduler, each containing numerous guards and side exits.

Main branch:

cocolato · 2026-04-22T17:01:31Z

And the fitness branch:

cocolato and others added 13 commits April 1, 2026 00:57

add fitness && exit quality mechanism

1bfa176

Rewrite the code structure

2f9438a

address review

709c0a1

address many reviews

ef6ac24

Merge branch 'main' into jit-tracer-fitness

21f7122

optimize some constants

b99fe61

fix comment

d09afb5

fix constent

c9957c3

reduce frame penalty

9447546

add debug log

7d3e4c4

address review

2c1b5e0

address review

2409b2f

Merge branch 'python:main' into jit-tracer-fitness

88a91dc

cocolato requested review from FFY00, Fidget-Spinner, ZeroIntensity, ericsnowcurrently and markshannon as code owners April 4, 2026 12:54

bedevere-app Bot added the awaiting review label Apr 4, 2026

bedevere-app Bot mentioned this pull request Apr 4, 2026

Improving trace quality by tracking "fitness" and "exit quality" #146073

Open

cocolato added the skip news label Apr 4, 2026

cocolato added 2 commits April 6, 2026 13:51

Merge branch 'main' into jit-tracer-fitness

4e12f04

fine tune parameters

4bd251e

This comment was marked as outdated.

Sign in to view

remove some special cases

1d93208

cocolato added 3 commits April 10, 2026 17:50

Merge branch 'main' into jit-tracer-fitness

386c23a

rewrite fitness mechanism

83fd8ab

remove static assert

c900563

address review

e69443b

Fidget-Spinner added 6 commits April 14, 2026 19:05

Race MAX_TARGET_LENGTH to 800, compute branch after slots, ignore ENT…

751a1d9

…ER_EXECUTOR for RESUME

reduce MAX_TARGET_LENGTH

896e4fe

fix tests

1364159

fix a bug

9fbec75

magic numbers

76b9c9e

Treat back edges as an exit, not a penalty, this way they are more likely to end at a backedge instead of ending at random spots

lint

d565f41

cocolato commented Apr 14, 2026

View reviewed changes

Comment thread Python/optimizer.c

This comment was marked as outdated.

Sign in to view

reduce the trace length to less than half

598d332

markshannon requested changes Apr 15, 2026

View reviewed changes

bedevere-app Bot removed the awaiting review label Apr 15, 2026

bedevere-app Bot added the awaiting changes label Apr 15, 2026

Fidget-Spinner and others added 3 commits April 15, 2026 17:48

Address review

64f3468

Reduce ENTER_EXECUTOR's exit quality

7661e7b

Merge branch 'main' into jit-tracer-fitness

9c75bb6

Merge branch 'main' into jit-tracer-fitness

bafa264

	#define EXIT_QUALITY_CLOSE_LOOP (FITNESS_INITIAL)
	#define EXIT_QUALITY_CLOSE_LOOP (FITNESS_INITIAL - AVG_SLOTS_PER_INSTRUCTION*4)

	else if (target_instr->op.code == ENTER_EXECUTOR && !_PyJit_EnterExecutorShouldStopTracing(opcode)) {
	else if (target_instr->op.code == ENTER_EXECUTOR) {

Uh oh!

Conversation

cocolato commented Apr 4, 2026 • edited by bedevere-app Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

cocolato commented Apr 6, 2026

Uh oh!

markshannon commented Apr 7, 2026

Uh oh!

cocolato commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fidget-Spinner commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

markshannon commented Apr 14, 2026

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Fidget-Spinner commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markshannon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fidget-Spinner Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bedevere-app Bot commented Apr 15, 2026

Uh oh!

cocolato commented Apr 16, 2026

Uh oh!

markshannon commented Apr 21, 2026

Uh oh!

markshannon commented Apr 21, 2026

Uh oh!

cocolato commented Apr 22, 2026

Uh oh!

markshannon commented Apr 22, 2026

Uh oh!

cocolato commented Apr 22, 2026

Uh oh!

cocolato commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cocolato commented Apr 4, 2026 •

edited by bedevere-app Bot

Loading

cocolato commented Apr 14, 2026 •

edited

Loading

Fidget-Spinner commented Apr 14, 2026 •

edited

Loading

Fidget-Spinner commented Apr 14, 2026 •

edited

Loading

Fidget-Spinner Apr 15, 2026 •

edited

Loading