# Normalized Score v Z Score

#### Jennifer Murphy

##### Member
I would like to come up with an algorithm that will allow me to compare different products based on ratings from different sources using different scales. This Excel table compares 5 products (A-E) with 3 ratings (Rtg1 - Rtg 3). The actual ratings are in columns H, K, & N. The maximum and minimum values for each rating are above the table in rows 4 & 5. For example, Product A got a 4.8 rating on a 1-5 scale, a 7 rating on a 0-10 scale, and a 79 rating on a 0-100 scale. The mean and std dev for each set of ratings are below the table in rows 14 & 15.

The values in columns I, L, & O are the ratings normalized to a common 0-1 scale. The equation is (Rating-Min)/(Max-Min). These are then summed in column D.

The values in columns J, M, & P are the Z Scores. The formula is (Rating-Mean)/Std Dev. These are then summed in column F.

Does anyone have any comments on this approach?

Are the Z Scores a better measure than the normalized values?

Is there something even better?

Thanks

#### katxt

##### Well-Known Member
The two methods you have chosen seem to give wildly different ranks. This may simply be because you have made up random data to illustrate the problem and so the rankings are in effect just random too.
What makes one method "better" than another is that it reflects the true ranking of the products more closely. If you don't already know the true ranking, you can't determine which of these method is better. Mathematically they are roughly equivalent.

#### Jennifer Murphy

##### Member
The two methods you have chosen seem to give wildly different ranks. This may simply be because you have made up random data to illustrate the problem and so the rankings are in effect just random too.
What makes one method "better" than another is that it reflects the true ranking of the products more closely. If you don't already know the true ranking, you can't determine which of these method is better. Mathematically they are roughly equivalent.
I don't think there is a "true" ranking or rating. It depends on who you ask. In this example, I asked three sources. The data is artificial, but not totally random. I have looked at a lot of ratings from different sources. It is very common to have a product rated very highly on one source and much lower on another. My goal is to be able to reliably identify the "consensus" best rated product by combining the ratings from multiple sources.

My question is this: Of these two methods, (1) normalizing to a [0,1] scale and (2) calculating the Z Score, which will do a better job of calculating that composite rating? Or, is there something else that is better than either of these.

I was hoping that someone here might have some experience with doing just this sort of thing and could offer some insight.

Thanks...

#### katxt

##### Well-Known Member
There are certainly methods of determining how reliable raters are. Try looking up inter-rater reliability.

#### Jennifer Murphy

##### Member
There are certainly methods of determining how reliable raters are. Try looking up inter-rater reliability.
I'll do that, but I am not trying to determine the reliability of the raters. I am trying to determine an "average" or "composite" rating of the product.

#### katxt

##### Well-Known Member
OK. How about average rank as another approach, then you don't need any scaling.

#### Jennifer Murphy

##### Member
OK. How about average rank as another approach, then you don't need any scaling.
That's an interesting idea. I put it to work with my 2 methods in this sheet. Columns D-F have the 3 sets of ratings. Notice that the second set is in reverse order. Columns G has the sum of the Z Scores in columns L-N. Column H those ranks. Column I has the sum of the scaled ratings in columns O-Q. Column J has those rankings. Column K has the method you suggested, the average of the rating rankings in columns R-T. Column L (which is missing the label) has tjhose rankings.

I think the Z Scores are the most instructive. Item A is the clear winner. It gets top ratings from all 3 sources. This is most clearly shown in column G, where A gets a score of +4.22, while all the rest are negative or just barely positive. The Normalized ratings are pretty good as are the average rankings, but neither shows how much better A is than all the rest.

Here's the same table with different data. There is no clear winner here. All 3 methods have the same top 3, but in diffrerent orders.

In this example I introduce a weighting factor to each rating. Suppose I have twice as much confidence in the ratings in column F. The result is a more clear winner.

I also found several papers by Evan Miller. They are over my head, so I will need to study them a lot more.

How Not To Sort By Average Rating – Evan Miller
Bayesian Average Ratings – Evan Miller
Ranking Items With Star Ratings: An Approximate Bayesian Approach – Evan Miller

Thanks

#### katxt

##### Well-Known Member
I don't know how much difference it makes, but in your original table the 0-1 sum doesn't equal the three 0-1 ratings.

#### Karabiner

##### TS Contributor
If you normalize, then the different ratings have different dispersions (although there is a maximum).
If you transfom into z-scores, then all standard deviations are = 1.
When summing up normalized scores, then those with larger dispersion will have a larger impact on the final ranking.
This may or may not be of interest for you.

In addition, it may or may not be of interest for you to weight by sample size and/or some measure of reliability of the
ratings from different sources.

With kind regards

Karabiner

Last edited:

#### Jennifer Murphy

##### Member
I don't know how much difference it makes, but in your original table the 0-1 sum doesn't equal the three 0-1 ratings.
Yes, I discovered an error or two in the formulas. I think the most recent tables are correct.

#### Jennifer Murphy

##### Member
If you normalize, then the different different ratings have different dispersions (although there is a maximum).
If you transform into z-scores, then all standard deviations are = 1.
When summing up normalized scores, then those with larger dispersion will have a larger impact on the final ranking.
This may or may not be of interest for you.
This is an excellent point. Thank you. Another important vote for Z Scores.

In addition, it may or may not be of interest for you to weight by sample size and/or some measure of reliabilty of the
ratings from different sources.
In my last example, I did add a weighting factor. My intent was to give me a way to assign weights to the ratings. My goal in this project is to include product properties in addition to ratings. For example, I was recently looking at portable treadmills for home use. I had ratings from Amazon and a couple of other sources. But I also wanted to include product properties such as price, weight, width, length, min speed, max speed, warranty, motor hp, etc.

Your comment about sample size is interesting. Do you know a way to take that into account?

Here is a set of products on Amazon with a range of ratings from 4.3 to 5.0 and a range of number of reviews from 1 to almost 100,000.

Do you have a formula for adjusting the ratings (col D) based on the number of reviews (col E) and/or the percentage of ratings at each level (5-1)?

#### Karabiner

##### TS Contributor
This is an excellent point. Thank you. Another important vote for Z Scores.
Interesting. I would have seen this as an argument in favor of normalizing.
But of course, it depends on the situation and one's goals.
Your comment about sample size is interesting. Do you know a way to take that into account?
The most simple approach would be to sum all sample sizes = overall sample size, and then calculate
the relative proportion of each sample size with regard to overall sample size, and then weight by this
before calculating the average score.
If sample sizes differ very much, then the large ones will dominate the result, of course. One could
consider a transformation in such cases (take the logarithm of the sample sizes, for example).

With kind regards

Karabiner

#### katxt

##### Well-Known Member
Here is another transformation you might consider (Rating given - Min rating actually given)/(Max rating actually given - Min rating actually given) which puts the worst at 0% and the best at 100%.
There still seems no way of effectively comparing methods unless you have some idea what the answer should be.

#### Dason

I think they mentioned that they did that particular transformation in their first post

#### katxt

##### Well-Known Member
I think they mentioned that they did that particular transformation in their first post
I think the first post one was (Rating given - Min rating possible)/(Max rating possible - Min rating possible) That gives transformed ratings over a narrow range. But how much difference it makes I don't know.
There still seems no way of effectively comparing methods unless you have some idea what the answer should be.
Perhaps you could set up a Monte Carlo test. Give true values of say 95, 90, 85, 80, and 75. Add random error of say SD 5.
Try each method and see how close the calculated ranking is to the true ranking (correlation perhaps?) Choose the method which gives the highest average correlation if there is a significant difference.
Just a thought.

#### Jennifer Murphy

##### Member
Here is another transformation you might consider (Rating given - Min rating actually given)/(Max rating actually given - Min rating actually given) which puts the worst at 0% and the best at 100%.
The Normalize comparison is similar. As you say, this one transforms the ratings to the narrower range of the actual ratings, rather than maximum range. This is also what the Z Score does, but it also centers them about the actual mean, so that I see positive and negative ratings, and I think it does a better job of showing the relative distance between ratings.

There still seems no way of effectively comparing methods unless you have some idea what the answer should be.
Well, yes and no. My goal in comparing methods is to get a sense of which one seems to produce a result that seems "fair" to me. My true objective is to compare products. I am looking for a method that will give me an aggregate rating that will give me a good sense of which product is most likely to be a good fit for me. This is why I have included a weighting factor. This will allow me to give more weight to ratings that I feel are more valid or more important and less weight to those I feel are less valid or important.

Here's an example of some ratings that with equal weights make C the winner with B second.

But suppose I have the most confidence in rating 1, next most in rating 2, and the least in rating 3. If I change the weights, I get this. Now A is the clear winner.

And the table will not be limited to ratings. I plan to add product properties with my own ratings of each of them. I'll post an example later.

#### katxt

##### Well-Known Member
OK. It sounds as if you have answered your question yourself, which is often the best way. Mathematically, the correlation between true values and rated values will the same after any linear transformation of the rated values, so do what think looks and feels best.

#### Jennifer Murphy

##### Member
Now that I have the z-scores working, I need to generalize to other kinds of ratings.

I would appreciate suggestions for methods for generating rating scales that I can merge with the z-scores to come up with an overall composite score. Here are the challenges:

• Price. I'd like to set a maximum price. Lower prices would get a slightly higher rating. Higher prices would get a significantly lower rating. Maybe I set an "ideal" price and an "absolute maximum" price.
• Amazon Ratings. Here I show the rating and the number of reviews. I can also include the percentage of the ratings that are 5, 4, 3, 2, and 1 stars. Is there a way to factor all that in? I'd like a way to include the number of reviews in the score.
• Fakespot. This is a company that rates ratings. I could assign values, like A=1, B=0.8, C=0.5, D=0.25, F=0.0.
• Width & Length. I'd like to set a minimum and maximum. Anything outside that gets a zero.
• Min Speed & Max Speed. I'd like to set minimums. Anything below gets a zero. Anything above gets a "1".
• HP & Max Load are like Min Speed & Max Speed.
• App & Remote. Yes = 1, No = 0.
• Warranty is like HP.
I thought about adding code to disqualify a product that fails any test.