# Regression analysis(?) for multiple independent variables

#### Heywood

##### New Member
Hello all,

Apologies for posting an elementary query, but my stats is very rusty. Not looking for an explicit solution, necessarily, just a pointer in the right direction. (And if I've posted to the wrong sub-forum, I'd be grateful for suggestions.)

I have N records. Each contains M real values (an individual's known characteristics) and one measurement of that individual's result on a particular test. I know that N>>M. As a simple example, suppose I have the age/height/weight of 1000 individuals (thus M=3, N=1000), as well as each person's time t_run on a 10km run at maximum effort. Importantly, in some cases, the person could not complete the run at all, so t_run for those records is undefined.

I would appreciate any help with understanding the following:

1. Assuming this data is representative of some (larger) population, what is a reasonable way to predict someone's test result (here, 10km time) as a function of known characteristics (here, age/height/weight)? Since N>>M, one idea I had was to compute the least-squares coefficients, k_m (for m = 1 .. M), such that t_run_predicted = k_1*age + k_2*height + k_3*weight.
2. What is the correct term for the approach described in (1) -- linear regression? correlation analysis? (I just need to figure out where to start looking.)
3. I'm concerned that setting t_run = (infinity) for those records where the test subject was unable to complete the run will cause problems (e.g. undefined matrix inverse and/or pseudoinverse). Would setting t_run as, say, 10X the largest t_run recorded by anyone who completed the run be a reasonable workaround?
4. I'm uncertain if the problem is linear in the known characteristics. For example, the run time might be roughly linear in height and weight, but quadratic in age. Is there a standard approach for estimating the best exponents (orders?) in such a polynomial, if any or all of them are not unity? (Again, I'm not necessarily asking for the answer -- just what this analysis is called, so I can try to teach myself how to do it)

Thanks very much in advance for any pointers or suggestions!

-Heywood

Last edited:

#### JesperHP

##### TS Contributor
1. Multiple Linear Regression
2. Specifically >>Multiple Linear Regression<< when M>1 for in specific M simply referred to as >>Linear Regression<<
3. What is it you want to model? You could limit the model to people capable of completing the race.
and then make another model - logit - predicting whether or not a person will complete the race.
4.
I'm uncertain if the problem is linear in the known characteristics.
Not a problem for multiple linear
regression. The model is called linear because it is linear in the coefficients not in the characteristics. Put in som quadratics
and check if coefficients are significant.

#### Heywood

##### New Member
Hi Jesper,

Thanks for your explanations! A quick clarification:

when M>1 for in specific M simply referred to as >>Linear Regression<<
Did you mean "for non-specific M" (that is, for M not known to be a specific value)?

My (primitive) understanding is that the term "Multiple Linear Regression" means M>1, while "Linear Regression" either means M=1 exactly (precise definition) or M>0 (informal definition). Is that right?

Put in som quadratics and check if coefficients are significant.
OK, I can certainly do that. What I'm wondering is, does there exist a systematic approach to solve for the exponents explicitly, in the same way that LSQ solves for the coefficients? Or is that an ill-posed problem, regardless of how overconstrained (N>>M) the system of equations is?

Sorry if these followups are a bit obtuse. Thanks again for any suggestions,

-Heywood