The example shows that the usual stats aren't enough to pin down the true data. But in practice I imagine / wonder if these stats really are reasonable "sufficient stats" because the probability of seeing data with strong structure is unlikely in most contexts. In other words...
and p(data) is only strong for a "blob / cloud" of points, so when there's some correlation the observed stats tell you that you likely have a blob having some degree of correlation.
Content warning: This is a baker’s dozen not a regular dozen, in case anyone clicks through expecting to find twelve and is mildly and briefly perturbed.
The example shows that the usual stats aren't enough to pin down the true data. But in practice I imagine / wonder if these stats really are reasonable "sufficient stats" because the probability of seeing data with strong structure is unlikely in most contexts. In other words...
p(data | stats) = p(stats | data) * p(data) / p(stats).
and p(data) is only strong for a "blob / cloud" of points, so when there's some correlation the observed stats tell you that you likely have a blob having some degree of correlation.
A classic.
See also:
https://en.wikipedia.org/wiki/Datasaurus_dozen
Content warning: This is a baker’s dozen not a regular dozen, in case anyone clicks through expecting to find twelve and is mildly and briefly perturbed.
The scary thing is that yea we can see these in 2D and maybe 3D. But ...
usually there are more than 2 or 3 columns in our data :(
“The Datasaurus Dozen”:
https://blog.revolutionanalytics.com/2017/05/the-datasaurus-...
[dead]