Skip to content

add diff diagnostics#300

Open
avehtari wants to merge 57 commits intomasterfrom
diff-diagnostics
Open

add diff diagnostics#300
avehtari wants to merge 57 commits intomasterfrom
diff-diagnostics

Conversation

@avehtari
Copy link
Copy Markdown
Member

@avehtari avehtari commented Aug 31, 2025

Description

The corresponding issue #299

In this PR we are making changes to the output structure returned by loo_compare(). The main changes are a transition from a matrix to a data frame and the addition of several new columns. Here is an example of the updated output structure:

> loo_compare(w1, w2)
  model elpd_diff se_diff p_worse diag_diff diag_elpd
 model1       0.0     0.0      NA                    
 model2      -4.1     0.1    1.00   N < 100          
 
Diagnostic flags present.
See ?`loo-glossary` (sections `diag_diff` and `diag_elpd`)
or https://mc-stan.org/loo/reference/loo-glossary.html.

More context on the motivation behind these changes can be found in the updated case study "Uncertainty in Bayesian LOO-CV Model Comparison".

TODOs

  • implement changes and corresponding tests
  • update NEWS.md
  • fix all reverse dependency issues (see Comment)

@avehtari avehtari requested review from jgabry and n-kall August 31, 2025 17:00
@avehtari
Copy link
Copy Markdown
Member Author

@jgabry , we discussed this with @n-kall, and I added an option to the print function so that the new information is not shown by default. Would you have time to look at this?

Copy link
Copy Markdown
Member

@jgabry jgabry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think this is a good idea. It's possible that this doesn't actually break any packages that depend on loo. I can check by running reverse dependency checks once the errors in the code are fixed (see individual comments).

  • The tests need to be updated, but we can wait to do that until we check that it doesn't break any reverse dependencies

  • We should put some information about this in main loo_compare documentation, not only the print method

  • I actually think it would be good to print these probabilities by default. What was the reason you turned it off by default? (Actually right now the printing will error by default, but once fixed it will be off by default). I don't think the printing should affect backwards compatibility (or am I missing something?). It would be adding more columns could potentially affect something, right? Or was there a different reason you turned off printing the probabilities by default?

Comment thread R/loo_compare.R Outdated
Comment thread R/loo_compare.R Outdated
Comment thread R/loo_compare.R Outdated
Comment thread R/loo_compare.R Outdated
avehtari and others added 2 commits October 17, 2025 20:22
Co-authored-by: Jonah Gabry <jgabry@gmail.com>
@avehtari
Copy link
Copy Markdown
Member Author

Ok, let's change them to be printed by default.

I had the column name as p_worse so that it will also tell direction. Then I considered to switching to pnorm as it tells that it's based on the normal approximation, and then did get betweeb

@jgabry
Copy link
Copy Markdown
Member

jgabry commented Oct 17, 2025

Setting simplify=FALSE when calling the print method also errors due to a weird interaction with the p_worse. But I think we can get rid of the simplify argument. I've never seen anyone use it and I doubt it will break any packages (we can find out by reverse dependency checks). I will push a commit fixing this.

@avehtari
Copy link
Copy Markdown
Member Author

I would love to get rid of simplify argument to make it difficult to print looic. I have seen people using it. A quick search shows e.g. Doing Bayesian Data Analysis in brms and the tidyverse, https://www.statistical-rethinking.robitalec.ca/week-8.html, https://www.andreashandel.com/posts/2022-02-24-longitudinal-multilevel-bayes-3/, https://dirmeier.github.io/rstansequential/index.html. But if it's not used by packages, we can just drop it (and always simplify)

Comment thread R/loo_compare.R Outdated
Comment thread R/loo_compare.R Outdated
diag_pnorm[khat_diff > 0.5] <- paste0("khat_diff > 0.5")
}
rownames(comp) <- rnms
comp <- cbind(data.frame(elpd_diff = elpd_diff, se_diff = se_diff,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also changes the returned object to a data frame when it used to be a matrix. That could potentially cause reverse dependency issues, but we'll see when I run the checks. This is a necessary change, though, since we're mixing numeric columns with text columns for the diagnostic.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Oct 17, 2025

Codecov Report

❌ Patch coverage is 84.12698% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.70%. Comparing base (d0aa8b2) to head (e28085d).
⚠️ Report is 34 commits behind head on master.

Files with missing lines Patch % Lines
R/loo_compare.R 83.87% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #300      +/-   ##
==========================================
- Coverage   92.80%   92.70%   -0.11%     
==========================================
  Files          31       31              
  Lines        3004     3029      +25     
==========================================
+ Hits         2788     2808      +20     
- Misses        216      221       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jgabry
Copy link
Copy Markdown
Member

jgabry commented Oct 17, 2025

I fixed the tests so this is passing now. I will run reverse dependency checks over the weekend hopefully. I just put one question for you @avehtari in the review comments about how to label to diagnostic.

@jgabry
Copy link
Copy Markdown
Member

jgabry commented Oct 17, 2025

Starting a comment to keep track of reverse dependency issues. The reverse dependency checks are still running (they will take a while). I will update here as I discover new issues.

  • BAMBI
    • Error in main function due to new return type being a data frame instead of a matrix
    • For this particular error I was able to put a fix in loo_compare (it had to do with row/col names) so no PR is needed.
  • BayesERtools
    • Test error due to new return type being a data frame instead of a matrix
    • I forked the repo and fixed the test. Will wait to make a PR until we merge this PR into loo.
  • flocker
    • Test error due to new return type being a data frame instead of a matrix (and with additional columns)
    • I made a fork and fixed the test. Will wait to make a PR.
  • hbamr
    • uses simplify=FALSE in a vignette but not in any package functions
    • I made a fork and removed the use of simplify. Will wait to make a PR.
  • rstanarm
    • has some custom print stuff that assumed a matrix
    • I was actually able to get around this in loo_compare itself so no PR is necessary

@avehtari
Copy link
Copy Markdown
Member Author

  1. About the diagnostic messages
    Should we use
    N<100
    |elpd_diff|<4
    khat_diff > 0.5
    or
    N<100, small data
    |elpd_diff|<4, similar predictions
    khat_diff > 0.5, possible outliers
    ...
    Maybe the shorter ones are better?

  2. If one column name is p_worse, is diag_pnorm ok? Or should it be diag_p_worse, or just diag, diag_diff, or diagnostic?

  3. Since we are switching to using data.frame, can we have model names as column and not as row names? I have been using tinytable for pretty printing, but then I need the model name as a column.

  4. Illustration of how the output looks now (+ tinlytable) at https://users.aalto.fi/~ave/casestudies/LOO_uncertainty/loo_uncertainty_diag.html

@jgabry
Copy link
Copy Markdown
Member

jgabry commented Oct 29, 2025

It's p, because the equations use p(). I've not seen before anyone proposing we should give up using p() in equations. Instead of prob, more logical would be Pr as that is sometimes used in equations that evaluate only to probability (while p() is used for both density and probability).

We could consider Pr_worse instead of p_worse. Although all of the other columns in the data frame have lowercase names.

@ndevln
Copy link
Copy Markdown

ndevln commented Nov 5, 2025

Hi,

thanks for entertaining my input. Just as context: I teach epidemiologists with differing degrees of statistical background, because of this I am maybe a little bit too cautious. And I think this package should also be used be these people, think your package is fantastic.

Diagnostics

For the diagnostics, I think an option to show all warnings would be nice, even if the value of the warnings is limited. In infectious disease surveillance you want your datasets to be small, because if you have large datasets you failed in containing the disease. A small sample size warning is therefore often not helpful, since your data often still contains useful information. And this data has to be published to be used in a meta-analysis at a later time.

If you consider adding an option, maybe a warning() could be given if some of the diagnostic warnings are hidden with an explanation how to show them. I know this complicates the function interface a little bit, but maybe it's ok.

p_worse

When thinking about this a little bit longer, I realized my worries with p_worse are maybe more about the interpretation of the probability. While I think Pr_worse is nice and maybe add clarity, I understand that p is the mathematically correct naming and the user should be able to differentiate this from a p-value.

My argument with the AUC was way simpler. When looking at the examples, I was unsure how to interpret p_worse since it often assumes values above 0.9 in the examples. For model selection, I have to consider multiple factors like:

  • elpd
  • known causal and biological relationships and the study design
  • parsimony
  • other model diagnostics...

The main tradeoff is often elpd and parsimony. And since p_worse is often above 0.9, I would want p_worse to be close to 1 to consider a more complex model superior (everything else being equal). This is what is unclear to me. Did I get the right idea?

Other thought in this regard:
There is no fixed decision threshold for p_worse, and it is more beneficial to interpret p_worse continuously. But since easy rules feel good, a p-value like decision threshold of 0.95 could be applied. But this would be too low and would benefit more complex models?

I hope you find my thoughts helpful, and sry for the rambling.

Greetings,
Alex

@avehtari
Copy link
Copy Markdown
Member Author

avehtari commented Nov 5, 2025

@ndevln , thanks for the additional comments! I've been busy with a grant proposal, but now the deadline has passed, I will soon modify one of the case studies to provide more information about the diagnostics and interpretation of the results. I'll comment here and tag you when it's ready, and we'll see if it will help to clarify some things and your feedback will be useful for us to see whether we are able to explain things better.

@avehtari
Copy link
Copy Markdown
Member Author

avehtari commented Nov 5, 2025

@ndevln, and don't be sorry, it's great that you rambled and wrote what you think even when being uncertain (not all have the courage) as that also helps us to see where we need to improve documentation to reduce such uncertainty

@avehtari
Copy link
Copy Markdown
Member Author

avehtari commented Nov 7, 2025

We discussed this PR in our group meeting and I think at this point we should drop k_diff diagnostic as 1) it may produce too many false positives with typical data sizes, and 2) it's not directly connected to the LOO uncertainty paper point about model misspecification. Although it is likely to be useful, we need more experiments to be able to provide better advise how to interpret it and what additional checks can be done. We can keep the other diagnostics at this time.

@avehtari
Copy link
Copy Markdown
Member Author

I have updated a case study using this PR https://users.aalto.fi/~ave/casestudies/LOO_uncertainty/loo_uncertainty_diag.html

There are now more explanations on how to interpret the diagnostics

@florence-bockting
Copy link
Copy Markdown

florence-bockting commented Apr 1, 2026

After a discussion with @avehtari, we agreed on the following next steps for this PR:

  • add an explanation of the diagnostics and their relationships into loo-glossary
  • include a link with reference to loo-glossary for further information to the print method
  • go over case-study and check whether it is self-contained (i.e., can be read without having read the paper first)

@florence-bockting
Copy link
Copy Markdown

I added additional sections in the loo-glossary and added a user message in the loo_compare output with reference to the loo-glossary as follows:

  model elpd_diff se_diff p_worse diag_diff diag_elpd
 model1       0.0     0.0      NA                    
 model2      -4.1     0.1    1.00   N < 100          

Diagnostic flags present. See ?`loo-glossary` (sections `diag_diff` and `diag_elpd`)
or https://mc-stan.org/loo/reference/loo-glossary.html.

Copy link
Copy Markdown
Member Author

@avehtari avehtari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent loo-glossary text. Just one comment about Bengio and Grandvalet reference

Comment thread R/loo-glossary.R Outdated
Comment thread R/loo_compare.R Outdated
@jgabry
Copy link
Copy Markdown
Member

jgabry commented Apr 16, 2026

@florence-bockting @avehtari Thank you in advance for doing the new round of reverse dependency checks.
@florence-bockting please also officially add yourself to the DESCRIPTION file as part of this PR! Or separately from this PR, whatever you prefer.

@florence-bockting
Copy link
Copy Markdown

florence-bockting commented Apr 19, 2026

Re-running reverse dependency checks. In the following a list of packages where issues occur:

  • BAMBI 2.3.6
    • extracted model names as rownames which changed now due to new "model" column
    • Public PR: PR#5
  • BayesERtools 0.2.4
    • extracted model names as rownames which changed now due to new "model" column
    • Public PR: PR#20
  • clmstan 0.1.1
    • checked in tests for is.matrix; changed to is.data.frame in case "model" column exists
    • Public PR: PR#1
  • disbayes 1.1.0
    • uses in test rownames for extracting model; changed to use of column value if column "model" exists
    • Public PR: PR#3
  • flocker 1.0.0
    • test for rownames; tutorial shows output of loo_compare as comment -> changed corresponding test and output comment
    • Public PR: PR#116
  • ubms 1.2.8
    • assumes a type matrix for extracting models; update to switch between data.frame and matrix depending whether "model" column exists
    • Public PR: PR#88

@avehtari: I have now worked out fixes for all affected libraries which are currently still local. When I understand the process correctly, then we can now merge this PR into master and I will make the PRs in the respective packages public. Or some other thoughts on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants