Statistics

A team at Carnegie Mellon released a report a few weeks back, detailing the results of an experiment to determine the how many people would run a suspicious application at different levels of compensation. The report paints a pretty cynical picture of the “average” Internet user, which generally meshes with our intuition on. Basically, the vast majority of participants ran suspicious code for $1, even though they knew it is dangerous to do so. Worse, a significant number of participants ran the code for the low, low price of one cent. This seems to paint a pretty dire picture, in the same vein as previous research where subjects gave up passwords for a candy bar.

However, I noticed a potential problem wit the report. The researchers relied on Amazon’s Mechanical Turk service to find participants for this study. When performing studies like this, it’s important that the population sampled are representative of the population which the study is intending to estimate. If the population sampled is not representative of the broader population, the results will be unreliable for estimating against the broader population.

Consider this scenario: I want to estimate the amount of physical activity the average adult in my city gets per day. So, I set up a stand at the entrance to a shopping center where there is a gym and I survey a those who enter the parking lot. With this methodology, I will not end up with an average amount of physical activity for the city, because I have skewed the numbers by setting up shop near a gym. I will only be able to estimate the amount of physical activity for those people who frequent this particular shopping center.

The researchers cite a previous study which determined that the “workers” of Mechanical Turk are more or less representative of the average users of the Internet at large based on a number of demographic dimensions, like age, income and gender.

I contend that this is akin to finding that the kinds of stores in my hypothetical shopping center draw a representative sample of the city as a whole, based on the same demographic dimensions, and in fact I see that result in my parking lot survey. However, my results are still unreliable, even though the visitors are, in fact, representative of the city. Why is that? Hours of physical activity is orthogonal (mostly) to the demographic dimensions I checked: income, age and gender.

In the same fashion, I contend that while the demographics of Mechanical Turk “workers” match that of the average Internet user, the results are similarly unreliable for estimating all Internet users. Mechanical Turk is a concentration of people who are willing perform small tasks for small amounts of money. I propose that the findings of the report are only representative of the population of Mechanical Turk users, not of the general population of Internet users.

It seems obvious that the average Internet user would indeed fall victim to this at some price, but we don’t know for sure what percentage and at what price points.

I still find the report fascinating, and it’s clear that someone with malicious intent can go to a market place like Mechanical Turk and make some money by issuing “jobs” to run pay per install malware.

Tag: Statistics

It’s All About The Benjamins… Or Why Details Matter