add similar library to render diffs from regression test results#1296
add similar library to render diffs from regression test results#1296Aurashk wants to merge 6 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds the similar crate to improve regression-test failure diagnostics by rendering structured diffs instead of line-by-line mismatch lists.
Changes:
- Adds
similaras a dev-dependency. - Updates regression comparison to detect mismatches first, then render a diff for differing CSV outputs.
- Introduces
render_diffto show inserted/deleted lines with source line numbers.
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
Cargo.toml |
Adds similar to dev-dependencies for regression test diagnostics. |
Cargo.lock |
Locks the new similar dependency. |
tests/regression.rs |
Replaces per-line mismatch reporting with diff rendering for failed regression output comparisons. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| fn render_diff(lines1: &[String], lines2: &[String]) -> String { | ||
| let text1 = lines1.join("\n"); | ||
| let text2 = lines2.join("\n"); | ||
| let diff = TextDiff::from_lines(&text1, &text2); |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1296 +/- ##
==========================================
+ Coverage 89.46% 89.73% +0.27%
==========================================
Files 57 57
Lines 8361 8418 +57
Branches 8361 8418 +57
==========================================
+ Hits 7480 7554 +74
+ Misses 581 564 -17
Partials 300 300 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
alexdewar
left a comment
There was a problem hiding this comment.
Copilot's right. I tried this locally and when I changed one line of a commodity_flows.csv file, it showed a diff including every line where the text doesn't match exactly, even where the differences in floating-point values were below the threshold.
I think the right way to do this is to use the capture_diff_slices function from the crate. If it returns an empty Vec, the files are the same, otherwise they are different.
Unfortunately to do the comparisons correctly, you'll need to derive a custom type representing a field in the CSV file (an enum), which can be either a non-NaN floating-point number or a plain string. This type needs to derive Eq and Hash. Hash is, unfortunately, not defined for f64, because NaNs break everything. But! There is a NotNan type in the ordered_float crate, which does implement Hash, so you could use this. You'll define Eq to do the comparison logic we've currently got in compare_lines. Each CSV file will be represented as a Vec<Vec<MyCustomFieldType>> and you can pass these to capture_diff_slices.
Does this make sense?
For bonus points, you could use the colored crate to colourise the diff (we already have it as an indirect dependency), but we can probably get by without this too.
|
Actually, there are a couple of things I missed. You'll also need to make sure that similar floats have the same hash, which is annoying. Probably the easiest way to do this is to round to the nearest number divisible by the tolerance. You'll also need to store string representations of the lines somewhere so you can print them out for the diff. All in all, this is a little fiddly. |
…in floating precision add tests that check diffs rendered are as expected
alexdewar
left a comment
There was a problem hiding this comment.
This mostly seems to work -- and is definitely an improvement on what we had before! -- but there's something that's not working quite right.
What you're doing is:
- Compare files using the old comparison code
- If different, feed into
similar's algorithm to get diff - Print diff, excluding lines that
similarthinks are the same but are different according tocompare_line
What we should be able to do is:
- Feed files into
similar's algorithm directly - Discard any
DiffOps::Equals - If there are any diff ops remaining, print them all out
We shouldn't need the old code anymore -- if similar thinks lines are different, but the old code thinks they're the same or vice versa, we're not doing the comparison quite right. There are a bunch of output files with spurious replacements in them that we end up ignoring, so something's not working.
I'm not sure what to do, because your new code looks ok to me. It's just not giving the same results as the old. We don't want to let the perfect be the enemy of the good though and it is an improvement.
Maybe it would work better if we represented decimal numbers in base-10 instead (e.g. using this crate; though when I tried it, I was still getting some seemingly spurious mismatches).
If you'd like to fiddle with this more to see if we can yeet the old code, then that would be great, but otherwise I'm happy to merge this and open an issue about this problem. What are you feeling?
| for idx in 0..paired_len { | ||
| let old_idx = old_start + idx; | ||
| let new_idx = new_start + idx; | ||
| if !compare_line(&lines1[old_idx], &lines2[new_idx]) { |
There was a problem hiding this comment.
This extra comparison shouldn't be necessary 😞. The diff algo thinks they're different and if compare_line says otherwise, then it's not doing the same comparison that we're getting with the Eq implementation for CSVLine, which is a problem.
| let scaled = value / FLOAT_CMP_TOLERANCE; | ||
| let quantised = if scaled.is_finite() { | ||
| scaled.round() * FLOAT_CMP_TOLERANCE |
There was a problem hiding this comment.
@jamesturner246 As the resident floating-point guru, does this seem like an ok way to round a number to the nearest FLOAT_CMP_TOLERANCE? Numbers need to come out exactly the same if they are within FLOAT_CMP_TOLERANCE of one another, because they need to have the same hash.
There was a problem hiding this comment.
I know it's not be the answer you want, but I'd go with the approx equals routes.
Absolute tolerance: abs(a-b) <= abs_tol.
Relative tolerance: abs(a-b) <= rel_tol * max(abs(a), abs(b)).
The issue with this code is that
- there's the risk of overflow if
FLOAT_CMP_TOLERANCEis small - given
scaledwill be large, the rounding will be courser, distorting the calculation - It seems rust's
roundhas weird nonstandard behaviour: it rounds away from zero, rather than rounding to nearest, ties to even LSB required by IEEE-754
There was a problem hiding this comment.
@Aurashk It seems like this approach to quantisation isn't going to work because... floating-point numbers 😞.
Are you happy to have another go at this using the rust_decimal crate instead? That should give us proper, consistent rounding, so that we don't end up with two CsvLines within the tolerance giving different hashes (due to FP rounding errors), which is what we have currently. You may have to tweak the tolerance slightly or regenerate some test files to get things to pass because it is slightly different from the current tolerance checking we're doing, but that's ok.
NB: You can use different rounding strategies with this crate (see here). Dunno if this is something we'll want to change or whether the default MidpointNearestEven is fine.
Btw I have verified that we do see these rounding errors with our actual data, e.g. the missing_commodity test fails with:
-L608: 2040,10,BIOPRD,autumn.peak,25.852906323049822
CsvLine([Float(2040.0), Float(10.0), Text("BIOPRD"), Text("autumn.peak"), Float(25.852906323000003)])
+L608: 2040,10,BIOPRD,autumn.peak,25.852906323050078
CsvLine([Float(2040.0), Float(10.0), Text("BIOPRD"), Text("autumn.peak"), Float(25.8529063231)])
There was a problem hiding this comment.
I tried another thing too which seems to work and is much simpler, see if you are happy with this. If not I can have a go with the decimal crate.
- filter through each line with existing floating point tolerance first - if they meet the tolerance make the lines exactly equal
- feed everything to the diffing library
I appreciate it feels a bit wasteful going through twice, but there seems to be no palpable performance change. If you want to try it just put this in your regression.rs
There was a problem hiding this comment.
As @AdrianDAlessandro suggested, this would be a lot faster if we only do this extra work in the case that we already know the files are different, because currently it's comparing every line of one file with every line in another using our probably quite slow comparison method, so when comparing two files that are 1000 lines long, you'd get 1m comparisons, rather than 1000.
This still isn't quite correct though. If you have two identical lines in your CSV file and one of them is deleted, it won't appear in the diff, because it will find a match somewhere else in the file. For it to be totally right, the diffing algorithm has to be the single source of truth. That said, I suspect it will probably work fine in practice, at least most of the time.
You probably wouldn't need to change much to use the decimal type I mentioned though. It would mostly be a case of swapping out the NotNan for a Decimal and deleting the old comparison code.
There was a problem hiding this comment.
Sorry for the slow response, managed to have another look at this now.
I fiddled around with the rust_decimal crate for a bit, but it still seems to have similar problems to what we were facing before. My understanding now (could be wrong) is that however you convert/round/truncate the floating points you will always lose some information, so it's never going to be as good as comparing floating points directly. Even if you tweak the tolerance, the diff will still diverge from that tolerance in practice. I feel like we would need to use fixed point calculations throughout for that to work properly.
I applied the suggestion to only do floating point replacements if there is a mismatch between files and managed to improve on the other approach a bit by binning the 'equivalent' lines in maps before replacing instead of doing a brute force check-and-replace of each line. This seems to work okay. I did a stress test by turning down the tolerance to 1e-14 and it's still rendering all the diffs in a few seconds. Let me know what you think.
Note that, because of the binning, if there are two lines which are exact duplicates within the tolerance, it is possible for the reported diff to contain a slightly different line (within tolerance) to what is actually outputted to the file, although I'm not sure this is going to matter much in practice, I added a warning so it doesn't confuse people.
There was a problem hiding this comment.
Nw! I'm off in a mo though so won't have a chance to look at this again.
The thing about using a base-10 type is that while it does still have some of the other problems associated with binary floating-point numbers, you should be able to represent the rounded version exactly.
E.g. in a file, you have 1.23456789, but we only care about the first two DPs. If you represent it with the Decimal type you should be able to represent the original number exactly and also the rounded version (1.23). So every time you want to compare a number that rounds to 1.23, you can do so safely, because any number that rounds to 1.23 will always be represented in the same way -- byte for byte. This also means the hashes of any two of these values will also be equal, which is another property we need. You can't round to two decimal places exactly using a regular binary floating-point number in the same way.
There might be something I'm missing... anyway, shall we park this one until I get back?
There was a problem hiding this comment.
I agree with what you are saying - once you are in fixed precision land everything will work. It just doesn't seem to fit the problem when you try it though.
The rounding from floating point causes problems, for example:
0.100000005
0.1000000050601934845
rounded to 10 dp, come out to different numbers
0.1000000050
0.1000000051
even though they are only 6.0193475e-11 apart. We could round everything to (for instance) 10dp, make the claim that the regression tests enforce a similarity of 9dp, and turn down the dp until the regression tests pass, but I'm not really a fan of that. All our other tolerances are defined using floating point differences. I feel like if you're using fixed precision anywhere it should ideally be used everywhere.
In reference to what James says above too, it would be even better to use a relative floating point tolerance as well as an absolute tolerance, as 1e-10 could be insignificant between two values that are orders of magnitude apart
No problem, no rush to get back to me!
add more robust testing and more cases for csv diff
Description
Adds the similar diffing library to render the output from failing regression tests in a nicer way
Fixes #1186
Type of change
Key checklist
$ cargo test$ cargo docpresent in the previous release
Further checks