# Probability Mass/Density Function

#### sohaib.rafique

##### New Member
Hi,

I have a matrix of 24x15500 (time x no.of occurrences of an event). The number of occurrences is in the form of binary (0/1). I took the percentage of occurrence for each time, so I ended up getting the following 24x1 matrix.
1
0.999935612645676
1
0.994913399008435
0.973150473247054
0.901873672010817
0.763827184340995
0.495331916811538
0.388964007468933
0.325478076105853
0.314596613225163
0.348657523662353
0.355160646449037
0.373317880368296
0.433841993432490
0.553022986285494
0.718498486897173
0.820230506728479
0.856287425149701
0.888931813791771
0.924924344858670
0.953190393406735
0.974631382396497
1

Based on the above data, I have plotted the following curve; The x-axis is time (t) and y-axis is the percentage of occurrence of an event (P).

If I would find a function (by curve fitting), for P(t), can this be called a probability density/mass function.

OR otherwise, how can I find the best fit (pdf/pmf) for my data?

#### Attachments

• 46.5 KB Views: 1

#### Archidamus

##### Member
Given your data you can make an empirical CDF(below). Because the CDF is a one-to-one function it will be easier fitting a function to it. I fit a 3rd order polynomial, just because I thought it would be easier to work with. Should be easy to differentiate and create a PDF. If you really want, do a 5th order fit, then your PDF will be a 4th order. These are just suggestions for your 24 data points. I wouldn't be building any models that people's lives depend on. #### Attachments

• 10.1 KB Views: 0

#### sohaib.rafique

##### New Member
Hi Archidamus,

Thanks for your help. I understand that dealing with CDFs will be more convenient. Can you also throw some light on the technique that I applied to extract these 24 data points?

Note (a brief explanation of data and extraction of 24 points)

Dataset contains 24 rows (representing 'time of day') and more than 15000 columns. All entries in the dataset are binary (0|1). I calculated the percentage of ones (1s) for each row and assumed that these are the probabilities of occurrence of an event. As each row represents the 'time of day', so we can say that the calculated percentages are the probabilities of occurrence of an event for the respective 'time of the day'. Then I plotted the percentages against 'time of the day'.

Objective:
The objective is to develop a function/model representing the pattern/behavior of occurrences of events, and using the developed model to randomly generate data in the same pattern OR to predict the occurrence of events following the same pattern.

Is this the right approach to achieve the desired objective OR is there any violation of basic statistical practice/rule/phenomenon?

#### Archidamus

##### Member
15000 observations in an hour should give you a pretty good estimate of the paramater, so there is validity in that. Since this is time series data I would suggest trying to fit a an appropriate model. Honestly this looks like a simple sine function could fit the data. Try a non linear regression to estimate the paramater p, which the the probability of the event given the hour of day.

As for data generation, if you get a strong predicton model of p, then you could just run random generators.