With only a modest grasp of inferential stats and maths in general, I've been wondering how one might derive the weights for a weighted average from a variable that isn't as straightforward as a frequency score. Consider the following scenario.
Let's say I have measurements of effectiveness for a set of wheat fertiliser products (such as a growth rate), along with measurements of various associated behaviours in the crop (improved absorption levels, nutrient conversion rates etc.) Observations are made at many farms across the globe and throughout the year so conditions vary, hence the need for historical averaging in an analysis of the products. Also, notably, the effectiveness of a product and its behaviour profile often vary a lot from one wheat variety to the next with some products working better on variety A, others on variety B, or variety C etc.
Therefore, if I want averages for the various readings of a particular product on variety A, I could simply exclude all observations made on other varieties from the analysis. However, let's assume the amount of data is limited so that excluding those observations would be too costly. In that case I wouldn't want to treat all of the observations for the product in question with equal relevance since observations on non-A varieties would carry less 'weight'. Taking weighted averages would of course address this obstacle, but how to go about calculating the weights here?
My first idea would be to simply use the ratio of a given product's average effectiveness between variety pairs. For example, if a product's average effectiveness is 50 units on variety A and 40 units on variety B, and I want averages of the behaviour readings of this product on variety A, could the weight assigned to observations on variety B be calculated as follows: 40 / 50 = 0.80 (where observations on variety A are assigned a weight of 1)? But then what if the scores were reversed (40 units on variety A and 50 units on variety B) - would the following work:
50 / 40 = 1.25 ,
1.25 - 1 = 0.25 ,
1 - 0.25 = weight of 0.75 ..?
My problem (confusion) with this method is in understanding how a difference in effectiveness (as presented here) can be directly used as a measure of dependence between datasets. If a product is 20% more effective on variety A than on variety B, is there any mathematical reasoning that leads to the conclusion that observations on variety B are 20% less relevant in an analysis of the product on variety A?
Compare this with an alternative approach. I believe I could find the correlation between the average effectiveness scores for each pair of varieties across all of the products (assuming there are enough products) and use the correlation scores directly as the weights, since correlation certainly is a measure of dependence. But wouldn't this technique blanket over the individual differences between the products in a way that the former approach would not, leading to less accurate weighted averages for a given product?
Sorry if some major rookie errors are overcomplicating things here but my head really hurts! Any pointers would be appreciated.
Let's say I have measurements of effectiveness for a set of wheat fertiliser products (such as a growth rate), along with measurements of various associated behaviours in the crop (improved absorption levels, nutrient conversion rates etc.) Observations are made at many farms across the globe and throughout the year so conditions vary, hence the need for historical averaging in an analysis of the products. Also, notably, the effectiveness of a product and its behaviour profile often vary a lot from one wheat variety to the next with some products working better on variety A, others on variety B, or variety C etc.
Therefore, if I want averages for the various readings of a particular product on variety A, I could simply exclude all observations made on other varieties from the analysis. However, let's assume the amount of data is limited so that excluding those observations would be too costly. In that case I wouldn't want to treat all of the observations for the product in question with equal relevance since observations on non-A varieties would carry less 'weight'. Taking weighted averages would of course address this obstacle, but how to go about calculating the weights here?
My first idea would be to simply use the ratio of a given product's average effectiveness between variety pairs. For example, if a product's average effectiveness is 50 units on variety A and 40 units on variety B, and I want averages of the behaviour readings of this product on variety A, could the weight assigned to observations on variety B be calculated as follows: 40 / 50 = 0.80 (where observations on variety A are assigned a weight of 1)? But then what if the scores were reversed (40 units on variety A and 50 units on variety B) - would the following work:
50 / 40 = 1.25 ,
1.25 - 1 = 0.25 ,
1 - 0.25 = weight of 0.75 ..?
My problem (confusion) with this method is in understanding how a difference in effectiveness (as presented here) can be directly used as a measure of dependence between datasets. If a product is 20% more effective on variety A than on variety B, is there any mathematical reasoning that leads to the conclusion that observations on variety B are 20% less relevant in an analysis of the product on variety A?
Compare this with an alternative approach. I believe I could find the correlation between the average effectiveness scores for each pair of varieties across all of the products (assuming there are enough products) and use the correlation scores directly as the weights, since correlation certainly is a measure of dependence. But wouldn't this technique blanket over the individual differences between the products in a way that the former approach would not, leading to less accurate weighted averages for a given product?
Sorry if some major rookie errors are overcomplicating things here but my head really hurts! Any pointers would be appreciated.
Last edited: