Search All of the Math Forum:
Views expressed in these public forums are not endorsed by
NCTM or The Math Forum.



Behaviour of Spearman's Rho when data are added in batches
Posted:
Jun 4, 2017 5:41 PM


Here's a curiosity I've just come across. I'd be curious if anyone knows anything about it, or knows anything that I don't.
I've had a problem trying to evaluate Spearman's Rho on a large number of (x, y) pairs when the values of x and y are not normally distributed, so I spent a couple of hours experimenting. The reason I am using Spearman's Rho rather than Pearson's r is that the numbers are not normally distributed.
I have two lists of rectangularly distributed random numbers, each 2000 numbers in length. I treat one as a list of x's and the other as a list of y's. I compute Spearman's Rho and its value is close to zero, as you would expect. Actually its value is 0.00547 to three sigfifficant nigures.
So I add the numbers in each list in batches of 10, so I now have two lists of 200 numbers and each number is the sum of ten consecutive numbers in the old lists. I compute Spearman's Rho on the two lists and its value is now 0.0266, a bit less close to zero, and negative.
I tried adding the numbers in batches of 25, so now I had two lists of 80 numbers, and the value of Spearman's Rho was 0.120
Finally I tried batches of 40, and I got a Spearman's Rho of 0.05741088
Now I wondered whether this always happened. I tried two lists of 2000 different numbers and I got these results
Batch size Individual R = 0.00505 10 R = 0.0457 25 R = 0.187 40 R = 0.119
Is this anything interesting or was it obvious to people who know more statistics than I do? Is there any reason to prefer to batchadd the x's and y's or to keep them separate?
Thanks Ken Johnson
Here is the code in R if you want to copy and paste it  I have no idea whether set.seed( ) works the same way on all implementations.
# First set of numbers
set.seed(42) s < runif(2000, min = 0, max = 100) t < runif(2000, min = 0, max = 100)
# Second set of numbers
set.seed(342) s < runif(2000, min = 0, max = 100) t < runif(2000, min = 0, max = 100)
# Batch add in batches of 10
s1 < vector(length = 200) t1 < vector(length = 200)
for (i in 1:200) { for (j in (((i  1) * 10) + 1):(((i  1) * 10) + 10)) { s1[i] < s1[i] + s[j] t1[i] < t1[i] + t[j] } }
# now try Spearman
cor.test(s, t, method = "spearm")
# Spearman's rank correlation rho # # data: s and t # S = 1.326e+09, pvalue = 0.8067 # alternative hypothesis: true rho is not equal to 0 # sample estimates: # rho # 0.005473995
cor.test(s1, t1, method = "spearm")
# Spearman's rank correlation rho # # data: s1 and t1 # S = 1368700, pvalue = 0.7085 # alternative hypothesis: true rho is not equal to 0 # sample estimates: # rho # 0.02657766
# same effect  try larger batch size of 25
s2 < vector(length = 80) t2 < vector(length = 80)
for (i in 1:80) { for (j in (((i  1) * 25) + 1):(((i  1) * 25) + 25)) { s2[i] < s2[i] + s[j] t2[i] < t2[i] + t[j] } }
cor.test(s2, t2, method = "spearm")
# Spearman's rank correlation rho # # data: s2 and t2 # S = 95518, pvalue = 0.2903 # alternative hypothesis: true rho is not equal to 0 # sample estimates: # rho # 0.1195265
s3 < vector(length = 40) t3 < vector(length = 40)
for (i in 1:40) { for (j in (((i  1) * 50) + 1):(((i  1) * 50) + 50)) { s3[i] < s3[i] + s[j] t3[i] < t3[i] + t[j] } }
cor.test(s3, t3, method = "spearm")
# Spearman's rank correlation rho # # data: s3 and t3 # S = 11272, pvalue = 0.7242 # alternative hypothesis: true rho is not equal to 0 # sample estimates: # rho # 0.05741088
# Second set of numbers # Batch size 1 R = 0.005046081 # 10 R = 0.04569264 # 25 R = 0.1874824 # 40 R = 0.1193246



