add diff diagnostics by avehtari · Pull Request #300 · stan-dev/loo

avehtari · 2025-08-31T17:00:07Z

Description

The corresponding issue #299

In this PR we are making changes to the output structure returned by loo_compare(). The main changes are a transition from a matrix to a data frame and the addition of several new columns. Here is an example of the updated output structure:

> loo_compare(w1, w2)
  model elpd_diff se_diff p_worse diag_diff diag_elpd
 model1       0.0     0.0      NA                    
 model2      -4.1     0.1    1.00   N < 100          
 
Diagnostic flags present.
See ?`loo-glossary` (sections `diag_diff` and `diag_elpd`)
or https://mc-stan.org/loo/reference/loo-glossary.html.

More context on the motivation behind these changes can be found in the updated case study "Uncertainty in Bayesian LOO-CV Model Comparison".

TODOs

implement changes and corresponding tests
update NEWS.md
fix all reverse dependency issues (see Comment)

avehtari · 2025-10-16T14:28:00Z

@jgabry , we discussed this with @n-kall, and I added an option to the print function so that the new information is not shown by default. Would you have time to look at this?

jgabry

I think this is a good idea. It's possible that this doesn't actually break any packages that depend on loo. I can check by running reverse dependency checks once the errors in the code are fixed (see individual comments).
The tests need to be updated, but we can wait to do that until we check that it doesn't break any reverse dependencies
We should put some information about this in main loo_compare documentation, not only the print method
I actually think it would be good to print these probabilities by default. What was the reason you turned it off by default? (Actually right now the printing will error by default, but once fixed it will be off by default). I don't think the printing should affect backwards compatibility (or am I missing something?). It would be adding more columns could potentially affect something, right? Or was there a different reason you turned off printing the probabilities by default?

Co-authored-by: Jonah Gabry <jgabry@gmail.com>

avehtari · 2025-10-17T17:26:08Z

Ok, let's change them to be printed by default.

I had the column name as p_worse so that it will also tell direction. Then I considered to switching to pnorm as it tells that it's based on the normal approximation, and then did get betweeb

jgabry · 2025-10-17T17:37:24Z

Setting simplify=FALSE when calling the print method also errors due to a weird interaction with the p_worse. But I think we can get rid of the simplify argument. I've never seen anyone use it and I doubt it will break any packages (we can find out by reverse dependency checks). I will push a commit fixing this.

avehtari · 2025-10-17T18:04:50Z

I would love to get rid of simplify argument to make it difficult to print looic. I have seen people using it. A quick search shows e.g. Doing Bayesian Data Analysis in brms and the tidyverse, https://www.statistical-rethinking.robitalec.ca/week-8.html, https://www.andreashandel.com/posts/2022-02-24-longitudinal-multilevel-bayes-3/, https://dirmeier.github.io/rstansequential/index.html. But if it's not used by packages, we can just drop it (and always simplify)

jgabry · 2025-10-17T20:14:54Z

+    diag_pnorm[khat_diff > 0.5] <- paste0("khat_diff > 0.5")
+  }
+  rownames(comp) <- rnms
+  comp <- cbind(data.frame(elpd_diff = elpd_diff, se_diff = se_diff,


This also changes the returned object to a data frame when it used to be a matrix. That could potentially cause reverse dependency issues, but we'll see when I run the checks. This is a necessary change, though, since we're mixing numeric columns with text columns for the diagnostic.

codecov-commenter · 2025-10-17T20:28:20Z

Codecov Report

❌ Patch coverage is 84.12698% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.70%. Comparing base (d0aa8b2) to head (e28085d).
⚠️ Report is 34 commits behind head on master.

Files with missing lines	Patch %	Lines
R/loo_compare.R	83.87%	10 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #300      +/-   ##
==========================================
- Coverage   92.80%   92.70%   -0.11%     
==========================================
  Files          31       31              
  Lines        3004     3029      +25     
==========================================
+ Hits         2788     2808      +20     
- Misses        216      221       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jgabry · 2025-10-17T20:34:27Z

I fixed the tests so this is passing now. I will run reverse dependency checks over the weekend hopefully. I just put one question for you @avehtari in the review comments about how to label to diagnostic.

jgabry · 2025-10-17T22:16:06Z

Starting a comment to keep track of reverse dependency issues. The reverse dependency checks are still running (they will take a while). I will update here as I discover new issues.

BAMBI
- Error in main function due to new return type being a data frame instead of a matrix
- For this particular error I was able to put a fix in loo_compare (it had to do with row/col names) so no PR is needed.
BayesERtools
- Test error due to new return type being a data frame instead of a matrix
- I forked the repo and fixed the test. Will wait to make a PR until we merge this PR into loo.
flocker
- Test error due to new return type being a data frame instead of a matrix (and with additional columns)
- I made a fork and fixed the test. Will wait to make a PR.
hbamr
- uses simplify=FALSE in a vignette but not in any package functions
- I made a fork and removed the use of simplify. Will wait to make a PR.
rstanarm
- has some custom print stuff that assumed a matrix
- I was actually able to get around this in loo_compare itself so no PR is necessary

avehtari · 2025-10-18T09:34:01Z

About the diagnostic messages
Should we use
N<100
|elpd_diff|<4
khat_diff > 0.5
or
N<100, small data
|elpd_diff|<4, similar predictions
khat_diff > 0.5, possible outliers
...
Maybe the shorter ones are better?
If one column name is p_worse, is diag_pnorm ok? Or should it be diag_p_worse, or just diag, diag_diff, or diagnostic?
Since we are switching to using data.frame, can we have model names as column and not as row names? I have been using tinytable for pretty printing, but then I need the model name as a column.
Illustration of how the output looks now (+ tinlytable) at https://users.aalto.fi/~ave/casestudies/LOO_uncertainty/loo_uncertainty_diag.html

jgabry · 2025-10-29T22:58:15Z

It's p, because the equations use p(). I've not seen before anyone proposing we should give up using p() in equations. Instead of prob, more logical would be Pr as that is sometimes used in equations that evaluate only to probability (while p() is used for both density and probability).

We could consider Pr_worse instead of p_worse. Although all of the other columns in the data frame have lowercase names.

ndevln · 2025-11-05T09:20:55Z

Hi,

thanks for entertaining my input. Just as context: I teach epidemiologists with differing degrees of statistical background, because of this I am maybe a little bit too cautious. And I think this package should also be used be these people, think your package is fantastic.

Diagnostics

For the diagnostics, I think an option to show all warnings would be nice, even if the value of the warnings is limited. In infectious disease surveillance you want your datasets to be small, because if you have large datasets you failed in containing the disease. A small sample size warning is therefore often not helpful, since your data often still contains useful information. And this data has to be published to be used in a meta-analysis at a later time.

If you consider adding an option, maybe a warning() could be given if some of the diagnostic warnings are hidden with an explanation how to show them. I know this complicates the function interface a little bit, but maybe it's ok.

p_worse

When thinking about this a little bit longer, I realized my worries with p_worse are maybe more about the interpretation of the probability. While I think Pr_worse is nice and maybe add clarity, I understand that p is the mathematically correct naming and the user should be able to differentiate this from a p-value.

My argument with the AUC was way simpler. When looking at the examples, I was unsure how to interpret p_worse since it often assumes values above 0.9 in the examples. For model selection, I have to consider multiple factors like:

elpd
known causal and biological relationships and the study design
parsimony
other model diagnostics...

The main tradeoff is often elpd and parsimony. And since p_worse is often above 0.9, I would want p_worse to be close to 1 to consider a more complex model superior (everything else being equal). This is what is unclear to me. Did I get the right idea?

Other thought in this regard:
There is no fixed decision threshold for p_worse, and it is more beneficial to interpret p_worse continuously. But since easy rules feel good, a p-value like decision threshold of 0.95 could be applied. But this would be too low and would benefit more complex models?

I hope you find my thoughts helpful, and sry for the rambling.

Greetings,
Alex

avehtari · 2025-11-05T19:28:53Z

@ndevln , thanks for the additional comments! I've been busy with a grant proposal, but now the deadline has passed, I will soon modify one of the case studies to provide more information about the diagnostics and interpretation of the results. I'll comment here and tag you when it's ready, and we'll see if it will help to clarify some things and your feedback will be useful for us to see whether we are able to explain things better.

avehtari · 2025-11-05T19:32:25Z

@ndevln, and don't be sorry, it's great that you rambled and wrote what you think even when being uncertain (not all have the courage) as that also helps us to see where we need to improve documentation to reduce such uncertainty

avehtari · 2025-11-07T15:19:58Z

We discussed this PR in our group meeting and I think at this point we should drop k_diff diagnostic as 1) it may produce too many false positives with typical data sizes, and 2) it's not directly connected to the LOO uncertainty paper point about model misspecification. Although it is likely to be useful, we need more experiments to be able to provide better advise how to interpret it and what additional checks can be done. We can keep the other diagnostics at this time.

avehtari · 2026-02-27T15:23:44Z

I have updated a case study using this PR https://users.aalto.fi/~ave/casestudies/LOO_uncertainty/loo_uncertainty_diag.html

There are now more explanations on how to interpret the diagnostics

florence-bockting · 2026-04-01T10:16:20Z

After a discussion with @avehtari, we agreed on the following next steps for this PR:

add an explanation of the diagnostics and their relationships into loo-glossary
include a link with reference to loo-glossary for further information to the print method
go over case-study and check whether it is self-contained (i.e., can be read without having read the paper first)

florence-bockting · 2026-04-02T05:45:30Z

I added additional sections in the loo-glossary and added a user message in the loo_compare output with reference to the loo-glossary as follows:

  model elpd_diff se_diff p_worse diag_diff diag_elpd
 model1       0.0     0.0      NA                    
 model2      -4.1     0.1    1.00   N < 100          

Diagnostic flags present. See ?`loo-glossary` (sections `diag_diff` and `diag_elpd`)
or https://mc-stan.org/loo/reference/loo-glossary.html.

avehtari

Excellent loo-glossary text. Just one comment about Bengio and Grandvalet reference

jgabry · 2026-04-16T19:58:53Z

@florence-bockting @avehtari Thank you in advance for doing the new round of reverse dependency checks.
@florence-bockting please also officially add yourself to the DESCRIPTION file as part of this PR! Or separately from this PR, whatever you prefer.

florence-bockting · 2026-04-19T15:54:23Z

Re-running reverse dependency checks. In the following a list of packages where issues occur:

BAMBI 2.3.6
- extracted model names as rownames which changed now due to new "model" column
- Public PR: PR#5
BayesERtools 0.2.4
- extracted model names as rownames which changed now due to new "model" column
- Public PR: PR#20
clmstan 0.1.1
- checked in tests for is.matrix; changed to is.data.frame in case "model" column exists
- Public PR: PR#1
disbayes 1.1.0
- uses in test rownames for extracting model; changed to use of column value if column "model" exists
- Public PR: PR#3
flocker 1.0.0
- test for rownames; tutorial shows output of loo_compare as comment -> changed corresponding test and output comment
- Public PR: PR#116
ubms 1.2.8
- assumes a type matrix for extracting models; update to switch between data.frame and matrix depending whether "model" column exists
- Public PR: PR#88

@avehtari: I have now worked out fixes for all affected libraries which are currently still local. When I understand the process correctly, then we can now merge this PR into master and I will make the PRs in the respective packages public. Or some other thoughts on this?

add diff diagnostics

a909cea

avehtari requested review from jgabry and n-kall August 31, 2025 17:00

add comments and option whether the probabilities are printed

b940192

jgabry reviewed Oct 17, 2025

View reviewed changes

Comment thread R/loo_compare.R Outdated

Comment thread R/loo_compare.R Outdated

Comment thread R/loo_compare.R Outdated

Comment thread R/loo_compare.R Outdated

avehtari and others added 2 commits October 17, 2025 20:22

Apply Jonah's suggestions from code review

6fe41bb

Co-authored-by: Jonah Gabry <jgabry@gmail.com>

fix argument name

fa0d79d

add more documentation (+ khat threshold 0.5 as we don't yet smooth)

65ef286

avehtari added 2 commits October 17, 2025 20:38

oops, fix khat comparison

b0ebac7

skip pareto-k check for low number of unique values

7410b81

remove simplify argument (need to check reverse dependencies)

b993b44

jgabry reviewed Oct 17, 2025

View reviewed changes

Comment thread R/loo_compare.R Outdated

remove simplify argument from print.compare.loo_ss

b118da2

jgabry reviewed Oct 17, 2025

View reviewed changes

start fixing tests

25f93fd

jgabry added 2 commits October 17, 2025 15:34

fix issues in preliminary reverse dependency checks

62027d1

Update loo_compare.R

cc3edfe

jgabry and others added 5 commits October 17, 2025 17:02

make sure p_worse is available

22f442f

use x instead of xcopy

ca843f4

Update loo_compare.R

d579c06

Merge branch 'master' into diff-diagnostics

cf20250

unify diagnostic messages

846a891

avehtari added 2 commits February 15, 2026 17:52

Merge branch 'master' of github.com:stan-dev/loo into diff-diagnostics

57150b7

drop k_diff

b0b6ef8

Florence Bockting added 3 commits April 2, 2026 08:04

docs: update loo-glossary and documentation wrt diag_diff and diag_elpd

b8d0782

style: fix typos and grammar

c3ce77a

chore: add user message with link to loo-glossary in loo_compare outputx

a49c641

avehtari commented Apr 2, 2026

View reviewed changes

Comment thread R/loo-glossary.R Outdated

docs: remove misleading citation

f723cc5

avehtari commented Apr 11, 2026

View reviewed changes

Comment thread R/loo_compare.R Outdated

Florence Bockting and others added 3 commits April 13, 2026 08:53

style: ensure message length <= 80 characters

c5b9d84

tests: update snapshot output for modified message

c404183

Merge branch 'master' into diff-diagnostics

9dc395e

Florence Bockting added 3 commits April 22, 2026 11:30

chore: add new contributor

0412f31

docs: add link to case study in loo-glossary

f7325c0

chore: update NEWS.md

e28085d

This was referenced Apr 22, 2026

Upcoming breaking changes in loo_compare() output jsocolar/flocker#116

Open

Upcoming breaking changes in loo_compare() output ecoverseR/ubms#88

Open

Uh oh!

Conversation

avehtari commented Aug 31, 2025 • edited by florence-bockting Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

TODOs

Uh oh!

avehtari commented Oct 16, 2025

Uh oh!

jgabry left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

avehtari commented Oct 17, 2025

Uh oh!

jgabry commented Oct 17, 2025

Uh oh!

avehtari commented Oct 17, 2025

Uh oh!

Uh oh!

jgabry Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jgabry commented Oct 17, 2025

Uh oh!

jgabry commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avehtari commented Oct 18, 2025

Uh oh!

jgabry commented Oct 29, 2025

Uh oh!

ndevln commented Nov 5, 2025

Diagnostics

p_worse

Uh oh!

avehtari commented Nov 5, 2025

Uh oh!

avehtari commented Nov 5, 2025

Uh oh!

avehtari commented Nov 7, 2025

Uh oh!

avehtari commented Feb 27, 2026

Uh oh!

florence-bockting commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

florence-bockting commented Apr 2, 2026

Uh oh!

avehtari left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jgabry commented Apr 16, 2026

Uh oh!

florence-bockting commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

avehtari commented Aug 31, 2025 •

edited by florence-bockting

Loading

jgabry left a comment •

edited

Loading

codecov-commenter commented Oct 17, 2025 •

edited

Loading

jgabry commented Oct 17, 2025 •

edited

Loading

florence-bockting commented Apr 1, 2026 •

edited

Loading

florence-bockting commented Apr 19, 2026 •

edited

Loading