Need help with a pointer in the right direction :)

#1
Hi everyone :)

I'm trying to understand what solutions there are for solving my problem.

What the dataset looks like:
1. I have a dataset in forms of a table.
2. The columns represent days and the rows represent urls
3. The data in the set represent users (in numbers)
4. In total the dataset represents the amount of users who visited each url each day.
5. Lets say that I have 200 days of data over 100 urls = 20 000 values.

Example:
I made an example table in codepen, just to clarify what the dataset looks like: https://codepen.io/EmilWallgren/pen/abzqGzP

What i need:
I need to find a way to cluster the urls with similar users/day patterns.
I don't know if there is a statistical model for doing this. If not, what would be the next best thing to look into?
Could one way be to find correlations between all urls and extract those with values as close as 1 as possible? (Or are there already clustering solutions in statistics which does this?)

What would be the best model or resource to go for to solve this problem according to you?
I'm not asking for you to write out an extensive answer (since it might take some time). But if anyone could provide insights into what kind of analysis to conduct or the names of known models/formulas to apply I would be extremely thankful!

Have a wonderful day :)
 

Karabiner

TS Contributor
#2
You could take a look at "cluster analysis". Roughly speaking, you first find
method to represent the proximity or the distance within each pair of days.
I'd guess that the ->Euclidian or the ->squared Euclidian could be useful here.
Correlation would be useful only if you only want to model relative changes, i.e.
number of visitors on days 1 to 5 for
urlA: 9 4 5 6 7, and for
url B: 1009 1004 1005 1006 1007
would result in a perfect correlation r=1, but obviously the levels are very much
different.

In the next step, after the distances (or the proximities) between days have been
calculated, you choose an algorithm for building clusters. The most difficult
thing often is to determine how many clusters there might exist.

With kind regards

Karabiner
 
#3
You could take a look at "cluster analysis". Roughly speaking, you first find
method to represent the proximity or the distance within each pair of days.
I'd guess that the ->Euclidian or the ->squared Euclidian could be useful here.
Correlation would be useful only if you only want to model relative changes, i.e.
number of visitors on days 1 to 5 for
urlA: 9 4 5 6 7, and for
url B: 1009 1004 1005 1006 1007
would result in a perfect correlation r=1, but obviously the levels are very much
different.

In the next step, after the distances (or the proximities) between days have been
calculated, you choose an algorithm for building clusters. The most difficult
thing often is to determine how many clusters there might exist.

With kind regards

Karabiner
Huge thanks for your reply Karabiner!
I didn't even know where to start looking but think I found a solution now thanks to your answer.

I have spend hours (possibly days) looking for how to go about solving this. And I'm so close now thanks to you!

Once again thanks Karabiner, you rock! :cool: