# Empirical example of the ecological fallacy

Having begun my reading about spatial statistics, I have naturally started Applied Spatial Statistics for Public Health Data (2004) by Waller and Gotway. One of the concepts brought up in the beginning of the book is the ‘ecological fallacy,’ which describes the phenomenon that the relationship between 2 variables can change depending on the level of data aggregation you use. Epidemiological concepts like this are counterintuitive, and so I usually have to prove to myself that they are true. In this case, I use simulated data to do so.

I was interested in the relationship between variables Y and Z. I assumed that there were 100 values for Y, which are independent and identically distributed (iid) as standard normal. To create a correlation with Z, I made Z a sum of Y and another iid standard normal random variable, A. Thus, Z = Y + A.

I calculated the correlation between Z and Y. Then, I took the same 100 observations for Y and split them into 10 groups of 10 observations. A mean value for Y was calculated in each of the 10 groups. Then, a new value for Z was calculated, which was the sum of these 10 means, as well as the first 10 values of A that were generated. The correlation between the new value for Z and the 10 means for Y was then calculated and compared to the correlation produced from the individual-level data.

### Code

y<-rnorm(100)

ind<-seq(10,100,10)

y.mean<-length(ind)

for(i in 1:10){

begin<-(i-1)*10+1

end<-i*10

if(i==1) x<-y[1:10] else x<-y[begin:end]

y.mean[i]<-mean(x)

}

a<-rnorm(100)

z<-y+a

z.mean<-y.mean+a[1:10]

cor(y,z)

cor(y.mean,z.mean)

### Results

The correlation for the individual-level data was 0.71, while the correlation for the aggregated, mean data was 0.54. These two correlations were estimated from the exact same random numbers for Y, A, and Z. Graphs of the values for Z and Y, in each situation, are shown below.

### Conclusions

The ecological fallacy can produce vastly different estimates, and sometimes inference, for the exact same individual-level observations. It is important to remember the limitations of aggregated data and their application to inference about individuals.