This year I volunteered to manage the fall fundraiser for my kids’ Girl Scout troop. The girls sell snacks and magazines to their family and friends, and then get a cut of the proceeds. They also can earn patches and little trinkets based on how many items they sell. The Girl Scouts use the volunteer managers as a free last-mile distribution network, which means that I spent the better part of my morning today going through bags of tchotchkes and packaging them up for the 20+ girls in the troop based on what they earned.
This is fiddly work. The patches are small and easily miscounted or mislaid, and there are more than 15 different kinds of rewards which makes it easy to accidentally skip over one. And mistakes are painful — I’m on the hook if anything is missing[^money] — so it’s important to get it right. So I borrowed something from my data analysis bag of tricks and used a checksum.
Wisconsin: land of beer, bratwurst, and cheddar cheese. I consumed copious amounts of all three during my time as a grad student at UW-Madison, but none is the food that I miss the most. That honor is reserved for spicy cheese bread.
I like to imagine that at some point in 2017[^2017] the NFL execs gathered
around a long mahogany table in their secret clubhouse
NYC headquarters. They took a break from their important discussions
about how to downplay the connection between football and CTE
or the best ways to sucker cash-strapped municipalities into funding
new stadium development, to grill the middle manager who was in charge of
their data. “Hey nerd!” I presume they opened, “Why the hell have we
been paying all this money for the last three years for these stupid
RFID chips? Teams are barely using it!”
A big problem, possibly the biggest problem in political polling, is that you can’t guarantee what demographics will be correlated with the candidates, or how voters will turn out at the ballot box or even pick up the phone to respond to a poll.
Previously in this series I discussed the concept of statistical sampling, and how even the perfectly constructed poll will produce a distribution of possible results due to the random chance of who happens to respond. Those are so-called “random errors”, and they’re relatively easy to predict and quantify. Now let’s talk about other kinds of errors, ones that pollsters spend the bulk of their time worrying about.
In the previous post I simulated an electorate as though every person in it was essentially the same. That was useful to show the effects of statistical sampling, but the real world works differently: different demographic groups vary in their candidate preference, turnout likelihood, and even in how they interact with polls.
The central idea that underpins all polling is the concept of statistical sampling, which may sound intimidating but for our purposes really boils down to two things:
In the aftermath of the 2020 U.S. elections, I was confused.