gh-146073: Add fitness/exit quality mechanism for JIT trace frontend#148089
gh-146073: Add fitness/exit quality mechanism for JIT trace frontend#148089cocolato wants to merge 35 commits intopython:mainfrom
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
|
It appears that the current parameters do not yet guarantee runtime safety; I will continue to work on fixes and optimizations. |
|
I've commented on the issue #146073 (comment) |
|
@markshannon Thanks for review! I'm holding off on changing the fitness parameters for now, but I can run some benchmarks if you think it's necessary. |
|
Still seeing a big slowdown in richards on https://github.com/colesbury/fastmark: Main of this branch: This branch: I'm going to check if this is affecting the optimizer somehow. |
…ER_EXECUTOR for RESUME
Treat back edges as an exit, not a penalty, this way they are more likely to end at a backedge instead of ending at random spots
|
Increasing the max trace length is only going to help if the trace is stopping too early. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
I think we can safely reduce the max trace length, let me do that. There were two problems with the older code:
New code has almost no slowdown on Richards, and a huge speedup on telco benchmark. Main: |
markshannon
left a comment
There was a problem hiding this comment.
A few more comments.
There are a few cases where we are still special casing some situations that fitness should handle and can be removed.
|
|
||
| /* Exit quality thresholds: trace stops when fitness < exit_quality. | ||
| * Higher = trace is more willing to stop here. */ | ||
| #define EXIT_QUALITY_CLOSE_LOOP (FITNESS_INITIAL) |
There was a problem hiding this comment.
| #define EXIT_QUALITY_CLOSE_LOOP (FITNESS_INITIAL) | |
| #define EXIT_QUALITY_CLOSE_LOOP (FITNESS_INITIAL - AVG_SLOTS_PER_INSTRUCTION*4) |
FITNESS_INITIAL is too high a value for this, but not by much. We want to unroll tiny loops a bit and, more importantly, we don't want to special case the start instruction to avoid zero length traces.
| * N_BACKWARD_SLACK more bytecodes before reaching EXIT_QUALITY_CLOSE_LOOP, | ||
| * based on AVG_SLOTS_PER_INSTRUCTION. */ | ||
| #define N_BACKWARD_SLACK 50 | ||
| #define EXIT_QUALITY_BACKWARD_EDGE (EXIT_QUALITY_CLOSE_LOOP / 2 - N_BACKWARD_SLACK * AVG_SLOTS_PER_INSTRUCTION) |
There was a problem hiding this comment.
NOTE:
The problem here is that when tracing loops, we are treating the start of the loop as the closing point, but we want to stop at the end of the loop otherwise.
We probably need to make the back edge quality calculation a bit more complex.
- if the jump is to the loop closing point: exit_quality = 0 (to ensure loop is closed)
- otherwise: exit_quality = high ~(FITNESS - 10 * AVG_SLOTS_PER_INSTRUCTION)
(this can be fixed in a separate PR if would complicate this PR too much)
There was a problem hiding this comment.
I prefer to do this in next PR.
|
|
||
| /* Backward edge penalty for JUMP_BACKWARD_NO_INTERRUPT (coroutines/yield-from). | ||
| * Smaller than FITNESS_BACKWARD_EDGE since we want to trace through them. */ | ||
| #define EXIT_QUALITY_BACKWARD_EDGE_COROUTINE (EXIT_QUALITY_BACKWARD_EDGE / 8) |
There was a problem hiding this comment.
Why are we treating these backward edges differently? They may be in smaller loops, but N_BACKWARD_SLACK already handles that.
There was a problem hiding this comment.
I think this will help tracer through groutines short loops.
| target_instr == tracer->initial_state.close_loop_instr) { | ||
| return EXIT_QUALITY_CLOSE_LOOP; | ||
| } | ||
| else if (target_instr->op.code == ENTER_EXECUTOR && !_PyJit_EnterExecutorShouldStopTracing(opcode)) { |
There was a problem hiding this comment.
| else if (target_instr->op.code == ENTER_EXECUTOR && !_PyJit_EnterExecutorShouldStopTracing(opcode)) { | |
| else if (target_instr->op.code == ENTER_EXECUTOR) { |
The fitness should handle this. If fitness is high, we will continue tracing. If it is getting lower, then we want to stop at the ENTER_EXECUTOR to join up with an existing trace.
There was a problem hiding this comment.
No there's an exception to this: we don't want to treat ENTER_EXECUTORS caused by RESUME as a EXIT_QUALITY_ENTER_EXECUTOR, and instead treat them as a default one. Stopping at RESUME forms small, fragmented, loop traces, which I've previously documented in my RESUME tracing PR as I saw actual slowdowns from it.
This is what _PyJit_EnterExecutorShouldStopTracing says:
// Continue tracing (skip over the executor). If it's a RESUME
// trace to form longer, more optimizeable traces.
// We want to trace over RESUME traces. Otherwise, functions with lots of RESUME
// end up with many fragmented traces which perform badly.
// See for example, the richards benchmark in pyperformance.
// For consideration: We may want to consider tracing over side traces
// inserted into bytecode as well in the future.
There was a problem hiding this comment.
I disagree.
The whole point of fitness/quality is remove these ad-hoc special cases.
The reason you were seeing many fragmented traces before was that we didn't have a principled way to do this.
We won't see lots of small fragmented traces because, as I said, the fitness will be high for short traces and will exceed EXIT_QUALITY_ENTER_EXECUTOR
There was a problem hiding this comment.
I just did a benchmark. It's 0.5% slower applying this change on fastmark.
Can we introduce EXIT_QUALITY_ENTER_EXECUTOR_RESUME? We need to differentiate the following ENTER_EXECUTORS (they have different qualities in reality):
- ENTER_EXECUTOR due to JUMP_BACKWARD (best).
- ENTER_EXECUTOR due to progress or is_control_flow (decent)
- ENTER_EXECUTOR due to RESUME (worst).
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
|
I ran some tests on macOS, and performance on the fitness branch appears to have dropped significantly. use fastbench: Machine: main branch: fitness branch: |
|
We seem to be going around in circles a bit here. @cocolato can you try out this script https://github.com/python/cpython/pull/148840/changes#diff-7d8d989c9e02ccababda3709e44e2465010f9aa25843f4764e4e742adcfaf39b to see if it offers any insight? I don't know if you can extract some of the key features of the benchmarks that are slower to find out why? |
|
Regarding performance. We also need to consider the interplay between trace fitness/length and warmup. Ideally we want to cover the hot part of the program fairly quickly, not trace any cold parts and not cover the same piece of code with multiple traces unless there is genuine polymorphism. Easier said than done though. I would prefer good traces, even it appears a little slower on one or two benchmarks and the performance is more likely to be consistent. |
|
@markshannon I run the new tests, this is the result:
So I think we should reduce |
|
Can you tell why richards is so different? I don't see how reducing Also, instead of reducing the fitness for every uop, only start decreasing after the trace is getting long but decrease it more rapidly in that case? We could add the fitness to the dumps, for more information. Once again, thanks for doing this. |
|
Reopen: #147966