1 minute read

Who knew that margarine consumption is correlated with the divorce rate in Maine? There is even a very scientific paper on the subject.

This is just one of thousands of spurious correlations from Tyler Vigen’s hilarious demonstration of data dredging (his figures are even included on the associated Wikipedia page). This is when you take many variables, say 25,237 like on his website, and blindly accept statistically significant correlations.

Turns out this is a major problem in the more statistical sciences, so much so that they now have a pre-registration format to describe what the study will investigate before any data is investigated.

This project also provides a great example of generating realistic looking content, in the form of scientific papers, from LLMs. Each paper shows the sequence of prompts that were used to create it.

AI-generated paper for the relationship between margine consumption and divorce rates in Maine

The author does point out that:

The silliness of the papers is an artifact of me (1) having fun and (2) acknowledging that realistic-looking AI-generated noise is a real concern for academic research (peer reviews in particular).
The papers could sound more realistic than they do, but I intentionally prompted the model to write papers that look real but sound silly.

Although, I’m sure you could convince some people that Anne Hathaway films are responsible for the number of votes for Republican senators