Estimating the age of labor market entry

#1
I need to estimate the age of labor market entry for a given year (2016), but I can only use four explanatory variables: age, sex, income value, and the category in which the worker classifies himself (employee, self-employed, domestic worker, and employer). The data comes from a household sample survey. Additionally, I thought about running a logistic regression in order to estimate the probability of contributing to Social Security in that same year, and, then, using it as a fifth explanatory variable. For that matter, I could use other variables, such as level of education, and so on… In the end, I would have five explanatory variables (age, sex, income value, category and probability of contribution in 2016) and the response variable would be the age of labor market entry (age when started first job).
I know it is not much, but is that acceptable enough to model the age of entry? Which model could I use? Should I introduce any modified variable (like the square of age) in order to capture changes between generations, for example?
I need this modelling in order to apply it on administrative records containing the history of contributions to the social security system. In this official dataset (the whole universe, not a sample), I would only use the same five variables: age, sex, the category in which the worker is officially classified (employee, self-employed, etc.), if the insured contributed or not in 2016, and, if so, the mean value the official income value registered for the same given year (2016).
Unfortunately, I only have monthly data for the last 10 years (monthly records between jan/2007-dec/2016). The older records (before jan/2007) are also available, but are aggregated, so I can tell how many years a given person contributed before this 10-year-period but I do not know when these contributions were made nor when this person made its first contribution or entered the labor market. In countries where informality is high, these ages may differ greatly. Since I want to estimate the contribution density (number of months of contributions up to dec/2016 ÷ number of months since the worker entered the labor market up to dec/2016) for the whole population of insured workers, I need to input for each insured at the dataset an estimate for the age of labor market entry.
 

noetsi

Fortran must die
#2
Why do you have to use only these 4 variables? You only have data for them? I think, this is always the case, you should try to build your model on theory, your own if you can find no other writers. You talk about acceptable. What does that mean? Why are you doing this analysis and what would acceptable to you mean? The closer you get the model to the real world the better your answer will be, but its impossible to know what the real world answer would look like in most cases (if you knew you would not be running the model). :p

Generally logistic regression is going to give you an odds ratio not a probability. They are not the same thing.

Its impossible for any of us, without knowledge of your data to know if 120 points is enough or not (what does enough mean, enough statistical power)? You really need to start with a theoretical understanding of what you are measuring if possible.