Coding Data

I am examining the relationship between employee level of education and the severity of their disciplinary offenses.

I coded education categorically as (1=High School, 2=Associates, 3=Bachelors, etc...)

Next, I coded disciplinary offenses using the agencies already existing disciplinary scale of (1=counselling, 2= oral reprimand, 3= written reprimand, through 10=termination).

My question is how to handle/code employees with multiple disciplinary offenses. For instance, many employees have multiple disciplinary offenses. If I intend to conduct a correlational study of the aforementioned, should I?:

1. just use the highest recorded disciplinary offense for each employee
2. use the sum of each employee's disciplinary offenses as a "offense score"
3. sum up the offenses and divide by the number of offenses committed for an "offense score"




TS Contributor
Interesting study. A couple of questions:
- When employees earn disciplinary action the first time, they receive counseling (level 1), but is it always the case the second action results in #2 (oral reprimand). I'm asking whether disciplinary action is cumulative or not. If so, the maximum will be the best action work (option #1). Rather than thinking about this as a scale of 1-9, you could look at this as simple count data. One employee might get three that counts as a three. I'm guessing it the data follows a Poisson distribution, with many employees at zero or 1 offense. It might be good to view the frequency histogram to see if there's zero-inflation (and likely overdispersion) in your data.

To summarize my thoughts and more directly answer your question, I would probably avoid the last two options. The disciplinary action scale, I'm guessing, is cumulative and thus not independent. I would worry about creating a index score.
Thank you for the quick response and your thought. The disciplinary cases/investigations are separate events and are independent of each other. Hence, depending on the offense(s) committed one employee might accrue 5 separate level one offenses while another may only commit one level 10 offense and be terminated. I created the 1-10 scale based on the employees sustained outcomes (1- counselling, 2-written, 3-8 hr suspension, 4-16 hr suspension..... 10-termination). Your thoughts/advice?

Thank you again,

How you scale your data depends on your research question. If you are interested in what types of offense people get in, just listing the most signficant probably makes sense. If you are interested in how many offenses then you would count them. There is no correct or incorrect coding scheme, it depends on your purpose.

I may have missed this, but what exactly are you trying to test with your data?


TS Contributor
OK, noetsi's question yielded a good answer. Good.

So, considering that each employee can receive more than more reprimand for seemingly independent offenses (i.e can receive more than one Grade-1 disciplinary action), I would take a mixed-effects approach. Specify individual as a random effect. Still, I would keep the response as count data (1-9) , since the rank scale has a baseline starting point of 1. I might shift the scale down to zero (by substracting 1) so as to avoid dealing with a zero-truncated distribution. So, all your data is now 0-8 and presumably follows a Poisson distribution. As I see it, you can run the raw data (not the maximum) just to determine if there is an affect of educational level on median offense score. After your parameter estimation, simply add 1 again. This would take care of your severity metric. Follow the same approach for number of disciplinary offenses.

This would be the approach I would take. Others might have opinions about taking this approach with data on categorical scales. Still, the above approach might be the simplest and might be a more than adequate solution.
I think about my audience a lot when I run data. Particularly how much they know about statistics (and if they are interested). If this is going to managers they will probably want to keep it simple.

A really simple way to do this is just to calculate a count of offenses and compare that to levels of education. You can use Chi Square, Cramer's V etc. It will tell you if there is a meaningful relationship. Then run Spearman's Rho which will give you a correlation (it assumes that both variables are ordinal, but that seems realistic to me if you code education from the lowest to highest level).

You can calculate an average severity and do the same thing for severity.

Of course that is pretty simple, I am still admiring the Poisson analysis jpkelley noted. I have a lot to learn.....
I think you need to add a field that addresses offense number. As you correctly point out, you could have multiple offenses receiving multiple levels of discipline. You need to be able to distinguish between a first offense that receives counseling and a third offense that receives counseling (which is possible). You could also have a single offense that was so egregious that it warranted termination without prior discipline (even though that may be a rare occurrence that doesn't actually show up in your sample).

Perhaps the field in question is "Number of prior disciplinary actions."


Can't make spagetti
You can calculate an average severity and do the same thing for severity.
im having an issue with this part. calculating correlations between averaged-data restricts the variance which also restricts the range of the correlation. the variability of means is less than the variability of raw data points, which is kind of the issue that gave birth to hierarchical linear modeling (as a specific instance of random-coefficient regression...)

jpkellys approach might seem more complicated but it does take into account the clustering found in the design..
It is certainly true it will be less efficient than the method using poisson. My point was that, if your audience is most managers, the simpler method will far more prefered. I don't think most managers are going to understand what a poisson distribution is. And the ones I have worked for preferred something easy to understand among non-statisticians even if it was less accurate than more sophisticated methods.


TS Contributor
You can calculate an average severity and do the same thing for severity.....Of course that is pretty simple, I am still admiring the Poisson analysis jpkelley noted.
spunky already addressed the issue with taking the mean, but I have to second it (though I'm not thinking in the precise terms that spunky is). The consequence of what spunky mentioned is that your parameter estimate (i.e the effect of education) is going to be inaccurate, since the data aren't from a normal distribution (nor could they be transformed to such).

And don't admire the Poisson approach too much. Once you start admiring the Poisson, the whole world starts looking like count data.


TS Contributor
I must say that it's troubling to think that managers would not better practices in their organization. Troubling, I say!

Regardless of whether you use Poisson or not, I don't think you can avoid the fact that you have multiple offenses per individuals. If you take the median of all offenses per individual, you ignore the total number of incidents, If you take the maximum, you ignore the fact that one individual might have 1000 Grade-1 offenses. If you take the mean, you end up with potentially incorrect parameter estimates. Use the variation between and within individuals to your advantage. And you never have to mention the Poisson distribution. Just say that you conducted a test that took into account the fact that there were some individuals with many offenses and that the data were skewed. Or just go all out and baffle them with stats!


Can't make spagetti
And the ones I have worked for preferred something easy to understand among non-statisticians even if it was less accurate than more sophisticated methods.
uhmm... so... if i follow your argument correctly, i should choose a wrong solution rather than the correct one just because it's simpler? how "less accurate" can i go before my solution is wrong? because if we're talking about simple models, the mean is the simplest (linear) model that exists so one should only show managers means and variances?

whether your audience understands you or not does not depend on the method of choice but on your ability to communicate it. my husband is not a statistician (he didnt even finish high school) but he understands some of the subtleties of maximum-likelihood estimations because i've made sure to bring it down to a level he can understand. i cant see why any manager wouldnt be able to understand a poisson process if you take enough care to explain it properly and present enough examples.

denny borsboom in his annual address to the int'l psychometric society last year was very clear that the new mission of quantitative methodologists in the social sciences was to fight this idea of simple models can address complex questions... the most *appropriate* models should be used to address the most appropriate questions and it is our job to make sure other people understand and use these *appropriate* models... sadly, history is plagued with events where inappropriate statistical analyses (which tended to be simpler) ended up hurting people (like in that horrendous book The Bell Curve) because people (either intentionally or unintentionally) decided to ignore the subtleties involved in analysing data and forumlating a correct research design.
uhmm... so... if i follow your argument correctly, i should choose a wrong solution rather than the correct one just because it's simpler?
What is wrong in statistics is not all that clear to me. It's not uncommon to use less sophisticated approaches (which may well be less accurate but still accurate enough for your purpose) even in academic journals. My regression professor was told to make interval analysis categorical in nature (change the form of regression used) by a journal because that is the way they did things.

whether your audience understands you or not does not depend on the method of choice but on your ability to communicate it.
Respectfully I disagree. I have the glorious experience of explaining odds ratios to those with little to no statistical background and told to make analysis simpler (by a very bright doctorate in economics) because senior managers could not understand the (more accurate) measure I suggested. Most managers have limited interest in statistics, if you bring up something like a poisson distribution in the discussion it's essentially over regardless of how well you explain it.

I suspect denny borsboom has not worked a lot in corporate america (or government outside academics). I think if you were to survey those who present data to such real world audiences, what I said here would get a D'OH response (that is its so obvious to them that it's taken for granted).

There is a reason businesses and government ignore academics. Overly complex methods to make acceptable data better is the primary one.

Sorry is this is off topic. :) It is a sore point with me... I was required to make a report signficantly simpler today and all it involved was ANOVA and the like.


Cookie Scientist
I see your general point but I don't see what it really has to do with this situation. It's certainly the case that a managerial crowd is not going to understand the details of a mixed-effects Poisson regression. But do you really think that they are going to understand the details of Spearman's ranked correlation, the solution you suggested instead? :rolleyes: Perhaps more to the point, why is it necessary that they understand the details of the statistical analysis in the first place? It seems to me that the goal is simply to make them understand the conclusions derived from the analysis, a goal which should be pretty much indifferent to what type of procedure you happened to use.


TS Contributor
Just to bring the conversation around to the OP's question...any criticism or questions about the statistical solution proposed in this thread?


Can't make spagetti
oh no, and i mean i totally welcome the discussion because i think it's a very relevant one for any of us who work as "knowledge translators" between statisticians/quantitative methodologists and everyone else... besides, we have a history here of hijacking other people's posts and take them in weird tangents.

you touch on a very important point there when you mention it's possible to use less sophisticated approaches (which is always desirable) with the huge caveat that they have to be "accurate enough for your purposes". the point that jpkelly and i are trying to make is that statistical estimates from Berley's data derived through traditional correlational methods (regular OLS regression, pearson's correlation, etc) could end up being so biased that analyzing them through simple approaches would end up doing more harm that good. i'm not sure what its name is in the ecological sciences (where jpkelley is our local expert) but here in the province of social sciences/educational measurement/psychometrics is called the "unit of analysis" problem, which is perfectly exemplified the school-setting paradigm: should analysis be done at the student-level? classroom-level? school-level? district-level? and performing analysis ignoring this clustering of the data (which can arise naturally like in the school example or by design as in Berely's case) produces such bad estimates that a whole new area of statistics called hierarchical linear models/multilevel models was created just to tackle this problem. so just from starters it is known the estimates derived from averaged data will not be accurate enough because there's about 20-yrs worth of analytical, simulation and real-data studies in the academic literature that backs that up.

which takes us to the second point. i dont think denny borsboom has ever dealt with corporate america (my assumption only. i have never asked him for his CV. he is a university professor in amsterdam) but he is well acknowledged as one of the most important living figures in the area of quantitative analysis for the social sciences and *the* most brilliant psychometrician of the post-IRT generation. the point that you make is very good, but i think that's true from any analysis. gov't agencies or the private industry usually care about results, and as someone who'se done interships at ETS (developers of the SATs, GREs and pretty much all the major standardised tests used in america and the world today) i understand these people end up wanting the "what" more than the "how did you get that". just as jpkelley said... why would you even mention in the first place a poisson distribution? or regression? or even the variance? you're not talking to experts here, what's relevent to them are the results of the analysis, not how you got there... because how you got there requires a certain degree of technical knowledge most people are not interested in acquiring.

so i ask you... let's assume you're my boss and i'm your number cruncher. what if i asked you: "i can analyze this in a very simple way. it will be wrong and mostly useless, but you'll be able to follow the logic of what i did perfectly. or i can do a super-convoluted analysis that will get you excellent estimates but you wont understand *bleep* of what i did. which one do you prefer?" and if we encourage people to do the wrong thing just because it's easy we are not gonna get very far, arent we?

albert einstein once said "make things as simple as possible... but not simpler". i mean, i could also try and fit some incredibly bizarre likelihood equation with strange discontinuities to Berely's data and probably get estimates just a tiiiiny bit better than would come from out from a mixed-effects regression. but the improvement from a regular OLS regression/correlation with averaged data to a mixed-effects regression is so substantial, that it is called for, even if it's more complicated to implement and/or understand.


Can't make spagetti
Just to bring the conversation around to the OP's question...any criticism or questions about the statistical solution proposed in this thread?
it's kind of what i would do, but i'd need to have a look at the data to see whether poisson seems like a suitable solution or not.. :D


Can't make spagetti
I agree. I'd like to have a look at the data as well. I wonder if the original poster might provide the forum with a fake data set?
or even better... give us the parameters and the data format and we'll simulate it... even simulated data is better than no data at all... :p