The proportion of twitter followers of handle X also following Y

#1
OK, someone recently asked me what the probability that followers of Twitter handle X are also followers of Twitter handle Y. I used some R code (using the twitteR package) to get 47,000 followers from X and 61,000 followers of Y (the two handles have a LOT of followers). I created a vector and found that roughly 1200 data points were repeats (so 600 followers came up twice). My question now is, how do I find the proportion of shared followers. This is my reasoning:

p(A and B) = p(B)*p(A)

where the p(A and B) = probability of sampling someone who follows X and Y pages two times (once when sampling for X and once for the Y) = 1200/(61000+46000) = 1200/108000 = 0.011

p(A) = probability of sampling an X follower who also follows the Y

p(B) = the probability of sampling an Y follower who also follows the X

I am going to assume that p(A) = p(B).

so we can substitute 0.011 for p(A and B) and p(A) for p(B) to get:
0.011 = p(B)*p(B)

0.011 = p(B)^2

.104 = p(B) = p(A)
So, roughly 10% of X's followers also follow Y's. Am I correct in this line of reasoning? Is there anything I am missing?
 

Dason

Ambassador to the humans
#2
I don't believe you are and I don't really follow any of your logic. But you have the data so you could just compute it directly: (# of people that follow both X and Y)/(# of people that follow X) will get you the proportion of X's followers that also follow Y. I couldn't completely follow your procedure but I think you're saying that there were 600 handles that were in both lists so that means the probability that an X follower also follows Y is 600/47000 which is roughly 1.3%
 
#3
I don't believe you are and I don't really follow any of your logic. But you have the data so you could just compute it directly: (# of people that follow both X and Y)/(# of people that follow X) will get you the proportion of X's followers that also follow Y. I couldn't completely follow your procedure but I think you're saying that there were 600 handles that were in both lists so that means the probability that an X follower also follows Y is 600/47000 which is roughly 1.3%
Thanks for the reply. However, isn't there a pretty good chance that I sample someone from handle X that also follows handle Y, but that person was not in the handle Y sample? In that case, I would be underestimating the proportion of people who follow both. Does that make sense?
 

Dason

Ambassador to the humans
#6
Do you know how many followers each handle has? And how did you get your sample? Post your code if you can.
 
#7
So handle X has roughly 6,000,000 followers and Y has roughly 300,000.

Here is the code (using twitteR package in R and data.table). The handle names have been replaced by X and Y


X <- getUser("X") #get information for X
X_follower_IDs <- X$getFollowers(retryOnRateLimit = 300) #get a sample of their followers (I stopped after a few minutes because it took so long - hence only 47,000 followers)
Xfollowerdf <- rbindlist(lapply(X_follower_IDs, as.data.frame)) #convert to data frame
Y <- getUser("Y")
Y_follower_IDs <- Y$getFollowers(retryOnRateLimit = 300) #Again, I stopped after a few minutes, and got about 61000 followers
Yfollowerdf <- rbindlist(lapply(Y_follower_IDs, as.data.frame))
combineddf <- data.frame(
followers = c(Yfollowerdf$screenName, Xfollowerdf$screenName),
Y = c(rep("Y", length(Yfollowerdf$screenName)), rep("X", length(Xfollowerdf$screenName))))