Predictor

noetsi

Fortran must die
#1
I have a predictor with 49 distinct levels. Its not truly interval (possibly interval like) but certainly not categorical either. I am not sure how this impacts interpretation.
 

hlsmith

Less is more. Stay pure. Stay poor.
#3
Are they regions? If so, but we have no idea :), could some type of covariance matrix be used.
 

noetsi

Fortran must die
#4
They are unemployment rates. Over the course of a single year by month and county. But over that year there were only 49 distinct levels of unemployment in all the counties.
 

noetsi

Fortran must die
#6
I have a related question. I have a personal theory I am testing for my agency that how much the median wage is in a given county determines how high an income you make when we place you (we is a state agency that finds people jobs in the various counties).

I have an ordinal scale that ranks the county from highest income to lowest by median income (there are 67 in my state). I am trying to decide how to use that measure. One possibility is to just use 67 values, but the regression will assume this is an interval measure when it is not (here it is clearly ordinal unlike my other question) . Another is to build a dummy or set of dummies, but I have never seen a discussion how to build dummies in this case. Do I do it bottom half, top half of counties (one dummy). Top ten percent bottom ninety percent....

Nothing in the literature I have seen addresses how you should split the data if you build dummies in Vocational Administration (or anything actually that I have read).
 

Miner

TS Contributor
#7
They are unemployment rates. Over the course of a single year by month and county. But over that year there were only 49 distinct levels of unemployment in all the counties.
This is probably due to round-up to fewer decimal places. Can you gain access to the data used to calculate the rates?
 

Miner

TS Contributor
#9
You can still analyze it as continuous data, but it may show up as "chunky" on normality or residual plots. This can throw off the p-values in a normality test even though all the data points fall along a straight line. It may violate all sorts of assumptions and prevent you from publishing in a journal, but I have built many perfectly useful models with it.
 

noetsi

Fortran must die
#11
I thought to be continuous it had to actually have a certain number of values not just theoretically have an infinite level of values. So if you only had say 40 distinct levels the data could not be continuous. Which sound now like a bad assumption on my part. :p

Miner this is not for publication. It is for work - thanks for your comments about building useful models with this. I have 20,000 or so data points so I doubt it will have huge impact on p values. Technically I have the whole population not a sample (although one can argue I guess that it is a subsample of what could occur in the future). It is doubtful p values even apply in this analysis although I will use White standard errors in any case given your comments. I look at the residuals and other tests for violations of the assumptions.

But with so much data non-linearity is generally the only assumption I really worry about.