Choosing the best variables for a linear regression

#1
I want to perform a regression between two variables. One is the total revenue of an organisation and the other is the total revenue collected indirectly - this assumes that there is also direct collection. The total revenue is the sum of the indirect collection and the direct collection.

My linear regression aims to establish a relationship between indirect collection and revenue. I did the calculation with excel data analysis and the result was 0.822, which I think is pretty good.

However, when thinking about this I thought, "If indirect collection is a portion of a total collected, wasn't I supposed to run the regression using percentage values instead of the raw nominal values?"

Then I converted the indirect collection values (in the table below, first column) into percentage values (third column), that is, the indirect collection corresponds to X% of the total revenue:

Indirect collection Total revenue Indirect collection Total revenue
384 520,76 ___________ 488 035,07 ________________78,79 _______________488 035,07
318 212,53 ___________ 575 745,64 ________________55,27 _______________ 575 745,64
423 518,04 ___________ 708 524,95 ________________ 59,77 _______________708 524,95
487 925,30___________ 764 408,51 ________________ 63,83 _______________ 764 408,51
580 426,63 ___________745 557,47 ________________ 77,85 _______________ 745 557,47
668 926,14___________ 864 968,77 ________________ 77,34 _______________ 864 968,77
...

The recalculation of the regression is 0.34... In other words, the argument is no longer in effect. So I got really confused. And this would be for use in a paper, so I need to know which of the two methods is the more important to calculate the regression. The aim is to find out whether developments in indirect collection values affect total revenue.

Can you help this noob out? Thanks in advance.
 
#3
0.822 is the R square. The correlation is a bit higher.

What you say makes sense. Do you know any other way to analyse the possible impact of one variable on the other?

Many thanks.
 

katxt

Well-Known Member
#4
You could just average the percentages, and find a confidence interval for the average. I'm not sure of the context, but that seems simple and effective. You can then say something like "I'm not certain of the percentage Indirect, but I'm reasonably sure that it is somewhere between 68% and 82%."
 
#6
The problem is, in that way am I not comparing a static variable to a non static variable? When I calculate the average of the percentage, I am using a value that scales from 0 to 100. In theory this means the variable Total can increase and the Indirect remains at 100%. Isn't that a problem?
 

katxt

Well-Known Member
#7
I don't really follow what you are saying here. Perhaps I haven't grasped what you are trying to do.
For me, saying that on average the indirect collection is some particular percentage of the total revenue is much more meaningful than saying the regression Rsquare is 0.8, even if the calculation was actually correct.
Do you suspect that the percentage changes in an important way as the level of Total revenue changes?
 
#8
Just to give some context. I'm studying the finances of an organisation between the 16th and 19th centuries. Revenue could be collected by private contractors, with the farming system (i.e. indirect collection) or by the organisation's own officers (direct collection). The argument in the literature is that indirect collection was more efficient than direct collection.

My empirical evidence shows that until the early 18th century the organisation I study used more indirect collection, then it switched to direct collection. Looking at my graphs, I see that real (deflated) revenue decreases over time. The thesis I want to find out is if the decrease in revenue is related to the decrease in indirect collection - that is, if the indirect model is more efficient and if the organisation switched to a direct model (which is less efficient), then the cause for the decrease in total revenue might be there.

That's why I thought about a regression. My first proposition, which I mentioned in the first post, is: "the decrease in the gross nominal value of indirect collection led to a decrease in gross nominal total revenue". Here the Rsquare is 0.8.

The second proposition is: "the decrease in the percentage of indirect collection led to a decrease in gross nominal total revenue". Here the Rsquare is 0.3.

In other words, these are two completely different interpretations and they fundamentally change the conclusion.

---------

Now, total revenue is the sum of indirect collection and direct collection. It may happen that indirect collection goes up and direct collection goes down, or vice versa. However, there are situations in which all three variables go down: total revenue decreases as a result of the decrease in the gross nominal totals of indirect or direct collection.

However, (and here is the problem associated with my last publication), indirect collection may fall more than direct collection. This means that, in percentage terms, there is an increase in direct collection, not because there was a nominal increase in direct collection, but because gross nominal indirect collection fell at a greater rate. My insecurity and uncertainty lies in this type of reasoning, whether the analysis of percentages might not be misleading in this type of scenario?
 

katxt

Well-Known Member
#9
That's why I thought about a regression. My first proposition, which I mentioned in the first post, is: "the decrease in the gross nominal value of indirect collection led to a decrease in gross nominal total revenue". Here the Rsquare is 0.8.
The second proposition is: "the decrease in the percentage of indirect collection led to a decrease in gross nominal total revenue". Here the Rsquare is 0.3.
In other words, these are two completely different interpretations and they fundamentally change the conclusion.
Neither of these are valid because you have the same variable on both sides in both cases.
The first is indirect vs indirect + direct. The second is indirect/total vs total. This will always introduce spurious correlations.
Your question isn't precise enough for me. If you want to see if the ratio has changed over the years, try regressing % on to time. There may be problems here but it could be a first exploratory step.
 

katxt

Well-Known Member
#10
A little experiment for you.
in A1 type =RAND() copy down to A20.
In B1 type =A1+RAND() copy down to B20
in C1 type =CORREL(A1:A20,B1:B20)
Even with random numbers the correlation will be about 0.7 simply because column A is included on both sets.
 
#11
Thank you for your replies.

I made the experiment and you are right. But then I tried this: so the TOTAL REVENUE is the sum of INDIRECT and DIRECT collection. This means both variables are included in the total. Which, according to your experience they will have a high correlation because they are included in the TOTAL REVENUE variable. But the correlation and regression results are very different.

The correlation between INDIRECT and TOTAL REVENUE is 0.90 and the Rsquare is 0.82. The correlation between DIRECT and TOTAL REVENUE is 0.34 and the Rsquare is 0.12. If the fact that the variables DIRECT and INDIRECT are included in the variable TOTAL REVENUE determines a link, shouldn't the values be at least somewhat closer?

Perhaps I am complicating things. I guess what confused me was some feedback I got elsewhere poiting in the oposite direction. So I asked this person about the fact that one of the variables is included in the other and that person said "No it is not an issue. All regressions you are assuming the dependent variable is some function of the independent variables. Here you know the structure exactly which is exactly what you want. The coefficient will represent the average ratio of indirect to total."

Thank you for yoyr help.
 

katxt

Well-Known Member
#12
The correlation between INDIRECT and TOTAL REVENUE is 0.90 and the Rsquare is 0.82. The correlation between DIRECT and TOTAL REVENUE is 0.34 and the Rsquare is 0.12. If the fact that the variables DIRECT and INDIRECT are included in the variable TOTAL REVENUE determines a link, shouldn't the values be at least somewhat closer?
The indirect is much bigger than the direct so it has more influence on the total so has a higher correlation. This all makes sense to me.
The coefficient will represent the average ratio of indirect to total.
It is true that if you do the regression of Total on Indirect the slope will be the average ratio. However, the regression Rsquare will be wrong, the correlation will be wrong, the p values will be wrong, the standard error of the slope will be wrong, the F values, ... pretty well everything in the regression output will be wrong except the slope.
But you can also get the average ratio just by averaging them as suggested in post #4.