Delete Excess Data and Not Lose Stat Significance?

#1
Hi. I'm working with data that may get too large.

I'm wondering if there is a way to keep the data relatively accurate and manageable by deleting "insignificant" data.

For simplicity, let's say it's a top 100 list of all-time (from when our list was started).
Users enter their favorite song.
The vote is ongoing.

Saving multiple data sets (such as by year) is not an option.

My first thought was capping the total unique songs at a certain number.
Yet, once the cap is reached, new unique songs would never make it on the list.

So I wondered, when the cap is reached, cutting the lowest ones to make room.
For example, when unique songs reach 1000 (cap), the lowest 500 (cut) would be removed.
This cycle would repeat every time the cap is reached.

Yet, I do not know if this gives a nearly un-surmountable advantage to the surviving songs?
Thus, making the list statistically worthless.

Is there a formula to know what sort of "cap" and "cut" is needed
(to give new entries an equal chance to climb their way to the top)?

Thanks for any insights on how to tackle this.
 

Buckeye

Active Member
#2
What kind of analysis do you plan for the data? Do you have a research question? There is no rule of thumb that makes some data useful and other data useless.
 

hlsmith

Less is more. Stay pure. Stay poor.
#3
Perhaps you need to create database architecture with different data parts stored in different locations and include primary keys.

Deleting data seems like a bad idea for sure. You would definitely have selection bias. You need to at least store meta data. Do you generate revenue from this that just needs to be fed back into the system to expand resources or personnel?
 
#4
Hi. I'm working with data that may get too large.
I'm wondering if there is a way to keep the data relatively accurate and manageable by deleting "insignificant" data.
If you have only one input, such as the example of favorite song, then the question appears to be a decision on what proportion of the votes constitutes "enough" so that the remainder is an insignificant proportion. Sort the list by most votes, and crop the list at any desired level of significance, such as top 99% of votes.

If on the other hand you have a variety of inputs that each contribute to a single value, then the system for reducing that set of inputs to the most significant group with the least loss of detail is called Principal Component Analysis.
 

hlsmith

Less is more. Stay pure. Stay poor.
#5
Do people vote more than once and do they give a score or just say this is the best song? You could purge your old data if you just wanted to say this is the best song in the past 365 days. So just recalculate and drop old data.
 
#6
Thanks for all the responses.

The only analysis is having a relatively accurate top 100
(or whatever number is ultimately decided).

Selection bias is a concern.
If I understood the question, the list is to inform decisions about allocating resources.

There is only one input.
The complicating factor is the vote doesn't close.

People can "vote" more than once.
Doing it by date could be the fallback.
But the current goal is for the list to be of all-time not just the last year.

For example, if something last year had more votes than anything this year,
it shouldn't disappear just because a new year starts.

Data is often restricted based on dates.
As noted, top songs of the last year.
I'm wondering, what if the constraints are data limits rather than date limits?

Instead of directly capping total unique songs,
what about capping the total votes?
For instance, every 1000 votes, data is cut?

In the date scenario, combining top songs of Year A and Year B for the top songs of 2 years, the songs that don't make it to the new top 100 would be cut, right?

However, how can data be cut safely?
For example, let's say a song ranks 101 in Year A.
It just missed making the chart.
In Year B, it would make the all-time top 100 if Year A's votes were added.
Ideally, the calculations would allow for this while keeping the data size in check.

In the data scenario, once the first 1000 votes are cast, the same questions arise.
How can the top data be saved without growing exponentially?

Hope that helps better explain things.
 
Last edited:

Dason

Ambassador to the humans
#7
If the main concern is tracking all time counts then would it be possible to just have two different views of the data. Certainly you need to track each individual song with an entry in your database regardless so why not have an additional column for total all time play count as well. Any time a song is played you increment this. Also if you're concerned about tracking plays in the past year you could do a date stamped log of songs and drop data that is older than your threshold but you'd still be able to aggregate from there. If space is really a concern you could add some logic to drop songs from this log if they don't meet a threshold for the past week/month/whatever.
 
#8
The current prototype does track unique songs and how many votes each has.
The goal, however, is to not cut data based on date.
The dilemma is how to cut data to keep within a data cap.

The key question is how to cut songs
without giving the remaining songs an unfair advantage.
Would scaling be the answer?

Every 1000 unique songs,
tally votes of top 500,
tally votes of bottom 500,
tally vote total,
scale down top 500,
cut bottom 500.

Does this make sense?
If so, would the correct calculation to scale things down for each song be...

songA votes - (songA votes x (bottom 500 / top 500))

Then round the result.

The logic is to deduct the votes being lost by the bottom 500 from the top 500.

Would this be proportionally accurate and fair (not give any song an unfair advantage or disadvantage) and hold up after multiple cycles?
 
Last edited:
#9
The current prototype does track unique songs and how many votes each has.
  1. The goal, however, is to not cut data based on date.
  2. let's say a song ranks 101 in Year A. It just missed making the chart. In Year B, it would make the all-time top 100 if Year A's votes were added.
  3. Would [calculation] be proportionally accurate and fair and hold up after multiple cycles?
What's the difference between "multiple cycles" and "based on date?"
Are they not both methods of expressing the passage of time?
 
Last edited:
#10
Capping data based on date or data limits, either unit may be used.
Once there's a solution, I can just adapt it to the data model.

The reasons capping data was emphasized
(aside from that is what will be preferably used)
were some date tips didn't account for the "all time" question.

That's understandable coming at it from a traditional date cap method.
Traditionally, just have a data set for each year.
Since the question didn't seem to be a traditional one,
I was trying to better focus it.
Sorry if it made it more confusing.

I updated the calculation in my previous post.
Thanks for any input on the proposed solution.
 
Last edited:
#11
The dilemma is how to cut data to keep within a data cap.
This seems like a desire to implement memory loss in a finitely large storage system for the sake of resource re-use.

What score system can be applied to sort competitors into winners and losers for survival, when overcrowding prompts the mechanisms of starvation? Game theory models of resource competition may have some research done on related content.
 
Last edited:
#12
That game theory re-framing is helpful and pulls in solid concepts.
I haven't had much luck yet finding related models.

I've been working with some more simple examples
to see what other strategies may work.

The scaling solution I proposed,
when looking at the data at any given moment,
may highlight recent trends too strongly.

The latest thought (thinking of the 68–95–99.7 rule and normal distribution) was
if only 1000 songs (song vote & name) could be saved at a time,
would removing the bottom 2% of songs
allow enough room for new songs to climb the list?

Or would any new song
in any bottom 2% slot
be bumped before it could (equally) amass enough votes?

While the bottom 2% may constantly change,
would a popular new song eventually power through
even if it gets bumped off the list multiple times
due to the normal distribution of votes coming in?

Also, if opening 2% of the slots is enough room,
what percent of the list would be considered accurate?

Update
After multiple tests with 100K random votes
(comparing each capped list versus the full vote),
I didn't see a satisfying model for the original question.

For now, when data exceeds the cap,
the oldest will be the first to go.

A second data set will save the top 100.
However, this "all-time" list differs from the original question.
The max score is limited to votes racked up in a single data cap's worth.

I realize this still leads to the same concerns as before.
When data is cut, there's a possibility that
(in the long run if it wasn't cut)
it could outrank others.

This issue probably cannot be completely eliminated,
but there may be ways to reduce it.

For anyone that may pursue the original question,
I wondered if those songs that may normally not meet the criteria
may show momentum or trends that can be accounted for.

For example, after the previous wave of votes,
they have a slight increase
when another is having an unexpected decrease.

Anyway, thanks for all the help.
If anyone comes up with a solution,
I'd be interested in hearing about it.
 
Last edited: