How to compare simulation outputs to empirical data?


I have 2 questions concerning how to compare simulation outputs to empirical data.
They are framed using smaller numbers but in reality there were more people, more frequent observations, 17 activity states (although they can be grouped into 3 categories to give a less detailed perspective).

1) How can one compare the following two datasets, one empirical, the other simulated.

Simplified, both datasets contain records for 20 people.
Every 10 seconds, for 1 hour, the people were observed, and their activity state recorded.
There are 4 mutually exclusive activity states: they are just categorical – not ordered.
Hence each record in the datasets look like this: personid, timestamp, statenumber
and there are 7200 records, 360 for each of 20 people.

I would like to know what procedures could be used to compare the datasets in order to say whether there is a significant difference between the simulated data and empirical data, or not.

I have merged the two datasets into personid, timestamp, empiricalstatenumber, simulatedstatenumber
I thought to use a Chi-squared statistic to quantify the difference overall?
But, it would take a lot of effort to find out more detail using pairwise comparison (in the real 17x17 table I have).
I thought that using a ‘confusion matrix’ (used in machine learning) might be the way to go?
While this would give more detail about discrepancies, I don’t see how it gives a statistic for deciding overall match or no-match.
Also, this is time-series data, but the observed variable is categorical, not numeric - how can information contained in the sequences of states be used?

2) I also need to know how to compare individuals.

For each person there is a summary record containing the relative percentage of time in each of the 4 mutually exclusive states.
Hence in these 2 other datasets the records look like this: personid, state1%time, state2%time, state3%time, state4%time
(so the row total% is 100%) and there is one record for each of the 20 people.
And each person has a record in both datasets.

I would like to know what procedures could be used to state that an individual was realistically simulated, that there was no significant difference between the simulated data and empirical data.
I can combine these 2 datasets, but I don’t see what the point would be.
I have no idea how to combine the 4 variables (let alone the actual 17) or make a comparison on the 4 separately and then combine the results.

Any help will be appreciated!