How to compare similar ratings?

#1
Most of the products on Amazon have a 5-star rating (1.0 - 5.0). Few products have ratings below 4.0 and almost none below 3.0. The majority are clustered around 4.2 to 4.7 or so. The ratings include the number of reviews from which those ratings were calculated. Those numbers can vary from just a few to tens of thousands. Clearly a 4.8 rating based on 50,000 reviews is more reliable than another 4.8 rating based just 25 or even 500 reviews. But just how much more reliable? Amazon also shows the breakdown of how many of those reviews are 5 stars, how many are 4 stars, how many 3 stars, and so on.

The table below lists several products (car phone holders) showing their ratings (col I), the breakdown (cols C-G), and the number of reviews (col J). In col K, I calculate the rating by adding up the percentages (C-D/100) multiplied by the corresponding number of stars. For example, K5 = 4.87 = .91*5 + .07*4 + .01*3 + 0*2 + .01*1. This gives a rating with 2 decimal places, whereas Amazon only displays 1 decimal point.

In col L, the values in K are rounded to 1 decimal place to match what Amazon displays. The differences are in col M. Most match the Amazon ratings, but a few are off by 0.1.

Can someone suggest a formula that will generate an "adjusted" rating taking into account the number of reviews? For example, I would assume that of the 3 products that have a rating of 4.9 (rows 4-6), the product in row 5 with 368 reviews would be more reliable than either of the other two.

Amazon Ratings.jpg

Thanks
 

Karabiner

TS Contributor
#2
But just how much more reliable?
You can use the formula for a 95% confidence interval of a mean, and put n=500 in that formula, then n=50.000.
https://www.mathsisfun.com/data/confidence-interval-calculator.html
Amazon also shows the breakdown of how many of those reviews are 5 stars, how many are 4 stars, how many 3 stars, and so on.
Great, so you can also calcuate the standard deviation, which is required in the formula for the confidence interval,
and need not guess.

Can someone suggest a formula that will generate an "adjusted" rating taking into account the number of reviews?
You could weight the individual means by their respective sample sizes, and calculate the total mean with that.

With kind regards

Karabiner
 
#3
Great, so you can also calcuate the standard deviation, which is required in the formula for the confidence interval, and need not guess.
I think I have the mean, but the std deviation is eluding me. Can you help with the formulas for std deviation?

1664680614944.png

Columns C-H contain the Amazon ratings. I-N are based on those numbers.

Columns Q-U are my calcvulations of the actual numbers of ratings in each category. V-X are based on those numbers.

You can find the Excel workbook here:

OneDrive folder containing Excel workbook

Thanks
 
#4
(sigh) I believe I have been thinking backwards, or maybe sideways. I don't want the z scores within the ratings for a single product. I want the z scores for the ratings among (or between) all of the products bweing considered.

I believe this table handles it correctly. Let me know if this is not so.

1664694466882.png

I guess my next step is the confidence interval. Any tips on that?
 
#5
OK. After some reading and some fiddling, I think I have settled on z scores as the best way to compare sample ratings on diverse scales.

The table below illustrates some ways that this could be implemented for Amazon product ratings. These are the actual ratings for 19 actual products. These happen to be cell phone holders for cars,

Amazon ratings are on a 5-star scale. Users can rate a product from 1 to 5.
  • C-H: The percentages of each rating (5, 4, 3 2, 1).
  • H: The sum of those percentages. They don't always add up to 100 because Amazon only reports the numbers rounded to integers.
  • I-K: The actual ratings, the associated z scores, and the ranks.
  • L-N: The number of reviews, the associated z scores, and the ranks.
  • O-Q: The square root of the number of reviews (#^0.5). This is an attempt to reduce the std dev. As you can see, there is no change in the ranks. I see a minimal change at #^0.45. There is not a significant change until #^0.3.
  • R-S: Add the z scores for the ratings and the number of reviews. Is this a better ranking than the z scores for just the rating?
  • T-U: Add the z scores for the ratings and the square root of the number of reviews. The rankings in U are identical to those in S.
  • V-X: The sum of the percentages of 5 and 4 star ratings and the associated z scores and ranks. I wonder if this is a better ranking? Or what about doing one for the sum of 1 & 2 star ratings and subtracting that from the 4 + 5 ratings?
1665123808524.png

I would appreciate any comments or suggestions.

Thanks
 

Karabiner

TS Contributor
#6
Add the z scores for the ratings and the number of reviews. Is this a better ranking than the z scores for just the rating?
This means that you calculate a composite of the ratings and the numer of purchases/responses.
What have you finally achieved with this?

And what precisely do you want to achieve with the whole analysis?

By the way, the Spearman rank correlation between mean rating and number of ratings is very high (rho= -0.74).
So I think the mathematics are less interesting (or difficult) here than finding out what the frequencies and the
mean ratings really express.

With kind regards

Karabiner
 
#7
My end goal is a tool that I can use to help me select the best product of the ones that are available.

Whenever I set out to buy something that has a lot of “factors” that I need to consider, such as price, weight, resolution, response time, calories, capacity, noise level, power, efficiency, and ratings from one or more sources, I get frustrated trying to keep track of so any factors for so many products. They are all on different scales and the scales may go high-to-low or low-to-high.

I thought if I had a tool that would allow me to convert each product factor to a common scale, all going in the same direction (High to Low), I could then compute a composite value. I could even assign weighting factors to each factor to ensure that the ones that are the most important to me get an appropriate weight.

After playing around a bit and reading the opinions and references here and elsewhere, I believe that z scores (standard scores) will allow me to convert pretty much any factor to a common scale with mean=0 and std dev=1. That will eliminate a lot of the vagaries of many of the factors, especially the ratings. It not only arranges the values in order, but also indicates by how much they are above or below the mean for that sample. This is critical for things like the ratings that tend to be bunched at one end of the scale. This should allow me to combine them and add weighting factors.

The table below is my latest attempt. It looks to me like it will do the job. Of course, I would be interested in any criticisms or suggestions.

In the main table, columns C & D contain the product ID and any comments about how I chose the values. These are all fictitious products. Once I get the underlying code working, I’ll test it on real products.

These 8 products are being evaluated on 4 factors: Price (E-G), Rtg (H-J), MPG (K-M), and dB (N-P). The ratings are an Amazon 5-star system. MPG is miles/gallon. dB is the noise level in decibels. In each section, the 1st column is the factor value, the 2nd column is the z score, and the 3rd is the rankings based on the z score.

The rows above the table contain the mean (3), std dev (4), order (5), and weight (6). The mean and stddev are for the z score calculation. The order specifies whether high (HiLo) or low (LoHi) values are favored, which determines the sign on the z scores. The weights allow me to give more or less weight to each factor.

Columns Q & R show the sum of the z scores and the rankings based on them.

Columns S & T do the same but with the mean of the z scores, rather than the sum. Since they both result in the exact same rankings, which is what I would expect, I see no benefit to using the averages, so I’ll probably just use the sums.

Column U contains the z score of the z score sums. I did this this just out of curiosity. As expected, the mean is close to zero. I don’t see any benefit to this, either.

Columns V & W show the weighted sums and the rankings based on them. If the weights are all 1 (or any other number), then this will be identical to columns Q & R. In this example, I assign a weight of zero to the price, indicating that I don’t care what it costs, leaving the composite rating to be based on the other factors. Column X contains the difference between the weighted ranking and the unweighted one. Most of the products only moved by 1 rank.

Product F moved up 5 levels to #1. It was the second most expensive product (rank 7), but that factor was eliminated. It got the lowest rating, but that factor has the lowest weight of the remaining 3. It was #2 on MPG and #1 on dB.

1665441662403.png

Thanks for the help.
 
#8
You can use the formula for a 95% confidence interval of a mean, and put n=500 in that formula, then n=50.000.
https://www.mathsisfun.com/data/confidence-interval-calculator.html

You could weight the individual means by their respective sample sizes, and calculate the total mean with that.

With kind regards

Karabiner
I have been trying to come up with a way to adjust a list of ratings, as in the example below, based on the number of reviews, but have not been able to come up anything. Can you help me with a formula for Column E that makes some sort of adjustment to the Amazon ratings in C based on the number of reviews in D?

As is often the case, the higher ratings have fewer reviews. Part of this is no doubt because some of them are phony reviews submitted by friends or even people paid by the seller. I would think a product with a 4.7 rating based on 13,749 reviews would be worth more than one with a 5.0 rating on just 23 reviews.

I tried to follow your suggestion of using the confidence interval, but that requires that I know the standard deviation, which I do not have.

1666071380278.png
 

Karabiner

TS Contributor
#9
I have been trying to come up with a way to adjust a list of ratings, as in the example below, based on the number of reviews, but have not been able to come up anything. Can you help me with a formula for Column E that makes some sort of adjustment to the Amazon ratings in C based on the number of reviews in D?
I just thought of a simple calculation of the overall mean with weighting by sample sizes,
i.e. if you have two means m1 and m2, and two sample sizes n1 and n2, then
mean(total) = (mean1*n1 + mean2*n2)/(n1+n2).
This means, of course, that one source with 10,000 ratings is worth tremendously more
than 10 sources with 100 ratings each. If one wants to achieve a larger influence of the
small databases, then perhaps there may exist other weighing schemes.

I tried to follow your suggestion of using the confidence interval, but that requires that I know the standard deviation, which I do not have.
You have the percentages for each rating, and the total sample size.
So in the first step, you compute the total sum of squares, then in the
next step you divide this by the total sample size to get the variance.

Say, you have 10 one-star ratings, and 20 two-star ratings, then the
mean is (10*1 + 20*2)/30 = 1.67
The contribution of the one-star ratings to the total sum of squares
is 10*(1.66-1)², the contribution of the two-star ratngs to the total
sum of squares is 20*(1.66-2)². Sum up these contributions, divide
by totatl n, and you have the variance. Take the square root of the
variance, and you have SD.

HTH

Karabiner
 
#10
OK, let's see if I understand anything.

Here's that same data with, I think, the calculations you suggested. I did it step by step so I could see the how it works.

Columns F & G are the first 2 steps. The totals are in Row 25. Column H is confidence interval. Column I is the adjusted rating after subtracting the confidence interval. Column J shows how much that changed the original rating. Column K calculates the ranks of the adjusted ratings. To my surprise, none of then changed at all as shown in Column L. I had to set the Alpha to .01 before any ratings changed rank. And even with it set to 0.0001, there were no changes in the top 8 or bottom 4 ratings.

The data on the right in N6:p13 display the calculations in the table columns. The data in N15:p19 show how I calculated the values you suggested.

The data in Column P shows the actual formulas. This is an Excel table, so the syntax is a little complicated. The text inside the brackets (eg [@Rtg]) refers to table column headers. If it includes the @ sign, it refers to the cell on that row. Without the @ sign, it refers to the entire column.

1666144694892.png

What do you think? Is this valid data? Are there any tweaks you would suggest?
 

Karabiner

TS Contributor
#11
Column I is the adjusted rating after subtracting the confidence interval.
Why do you subtract the confidence interval? I've never seen this, and I cannot imagine
what that is useful for. A question here had been "how reliable?", and the 95% CIs would
be a means to gain am impression of possible chance fluctuations. If you want to use
it as a weighting factor, then some multiplication, not subtraction would be the operation.

With kind regards

Karabiner
 
#12
Why do you subtract the confidence interval? I've never seen this, and I cannot imagine
what that is useful for. A question here had been "how reliable?", and the 95% CIs would
be a means to gain am impression of possible chance fluctuations. If you want to use
it as a weighting factor, then some multiplication, not subtraction would be the operation.

With kind regards

Karabiner
I thought that a 95% (0.05) confidence interval was a +/- range about the mean within which we have 95% confidence. So if the mean is 50 and the 95% confidence interval is 5 (+/-5), then we have 95% confidence that the true value is somewhere between 45 and 55. I did the subtraction to get the lowest end of the confidence interval.

Is that not accurate?

All I want to do is convert the ratings + counts to a single value that I can sort on.