fixedLassoInf returns a p-value of 1 for every coefficient, what do I do wrong?

#1
for my LASSO model, I am using CV to get the one standard error rule lambda (using glmnet). I then try to get p-values for my coefficients using the r package selectiveInference, specifically the fixedLassoInf function. I want to calculate p-values for the coefficients at lambda.1se


bose<-coef(lasso.mod, x=lassotemp$x, y=lassotemp$y, weights=lassotemp$weight, s=cv.out$lambda.1se, exact=TRUE)[-1]
fixedLassoInf(lassotemp$x,lassotemp$y, bose,lambda=cv.out$lambda.1se*8)


I get the following result:


Standard deviation of noise (specified or estimated) sigma = 0.480

Testing results at lambda = 0.002, with alpha = 0.100

Var Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
1 0.000 3.229 1 -Inf 0.000 0 0
3 0.091 3.579 1 -Inf -2.552 0 NaN
4 -0.029 -1.988 1 1.434 Inf NaN 0
5 -0.039 -3.127 1 1.232 Inf 0 0
6 -0.078 -2.841 NaN 2.735 Inf NaN 0
7 0.104 1.907 1 -Inf -5.448 0 NaN
8 0.101 3.575 1 -Inf -2.839 0 NaN
10 0.000 -1.438 1 0.000 Inf NaN 0
11 -0.077 -1.502 1 5.140 Inf NaN 0
12 0.126 2.713 1 -Inf -4.644 0 NaN
13 0.007 2.414 1 -Inf -0.306 0 NaN
14 -0.002 -1.270 1 0.145 Inf NaN 0
15 0.000 6.834 1 -Inf -0.003 0 NaN
16 0.116 3.741 1 -Inf -3.104 0 NaN
17 0.000 0.007 1 -Inf -1.487 0 0
19 0.005 0.056 1 -Inf -8.744 0 NaN
20 0.070 2.837 1 -Inf -2.458 0 NaN
21 0.050 1.213 1 -Inf -4.081 0 NaN
22 0.005 0.148 1 -Inf -3.287 0 NaN
23 0.021 0.656 1 -Inf -3.181 0 NaN
24 -0.359 -1.767 1 20.348 Inf 0 0
25 -0.074 -1.549 1 4.772 Inf NaN 0
26 -0.206 -8.226 1 2.508 Inf NaN 0
27 -0.063 -1.263 1 5.018 Inf NaN 0
28 0.151 4.101 1 -Inf -3.693 0 NaN
29 0.333 5.950 1 -Inf -5.596 0 NaN
31 -0.028 -0.776 1 3.572 Inf NaN 0
32 -0.116 -3.713 1 3.116 Inf 0 0
33 -0.112 -3.777 1 2.959 Inf NaN 0
34 0.183 6.341 1 -Inf -2.882 0 NaN
35 -0.119 -3.529 1 3.373 Inf NaN 0
36 0.053 1.127 1 -Inf -4.692 0 NaN
37 0.074 1.683 1 -Inf -4.375 0 NaN
38 0.043 0.839 1 -Inf -5.115 0 NaN
39 0.122 2.419 1 -Inf -5.033 0 NaN
40 0.202 2.678 1 -Inf -7.543 0 NaN
41 0.071 1.009 1 -Inf -7.064 0 NaN
42 -0.174 -3.563 1 4.871 Inf NaN 0
43 -0.057 -1.012 1 5.616 Inf NaN 0
44 0.120 2.237 1 -Inf -5.346 0 NaN
45 -0.115 -2.240 1 5.149 Inf NaN 0
46 -0.155 -2.056 1 7.523 Inf NaN 0
47 -0.048 -0.845 1 5.700 Inf NaN 0
49 0.039 0.725 1 -Inf -5.437 0 NaN
50 -0.403 -8.360 1 4.823 Inf 0 0
51 -0.150 -2.285 1 6.584 Inf NaN 0
52 -0.142 -1.978 1 7.205 Inf NaN 0
53 -0.071 -1.254 1 5.693 Inf NaN 0
54 -0.139 -2.336 1 5.963 Inf NaN 0
55 -0.260 -2.739 1 9.506 Inf NaN 0
56 -0.598 -7.131 1 8.392 Inf NaN 0
57 -0.253 -1.899 1 13.332 Inf 0 0
58 -0.544 -7.194 1 7.565 Inf NaN 0
59 -0.392 -4.033 1 9.715 Inf NaN 0
61 -0.140 -1.679 1 8.331 Inf NaN 0
62 -0.407 -3.422 1 11.906 Inf NaN 0
63 -0.259 -4.187 1 6.177 Inf NaN 0
64 -0.416 -5.548 1 7.491 Inf NaN 0
65 -0.595 -8.929 NaN 6.665 Inf NaN 0
66 -0.169 -3.375 1 5.003 Inf NaN 0
67 -0.359 -7.107 1 5.057 Inf NaN 0
68 -0.310 -6.923 1 4.485 Inf NaN 0
69 -0.456 -8.973 1 5.087 Inf 0 0
70 0.019 3.179 1 -Inf -0.583 0 NaN
71 -0.001 -4.969 1 0.011 Inf NaN 0
72 -0.012 -1.471 1 0.816 Inf NaN 0
73 0.000 -0.252 1 0.052 Inf NaN 0
74 0.040 4.468 1 -Inf -0.888 0 0

Note: coefficients shown are partial regression coefficients
Warning messages:
1: In fixedLassoInf(lassotemp$x, lassotemp$y, bose, lambda = cv.out$lambda.min, :
Solution beta does not satisfy the KKT conditions (to within specified tolerances)
2: In fixedLassoInf(lassotemp$x, lassotemp$y, bose, lambda = cv.out$lambda.min, :
Solution beta does not satisfy the KKT conditions (to within specified tolerances). You might try rerunning glmnet with a lower setting of the 'thresh' parameter, for a more accurate convergence.
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


First question: is it correct, to multiply the lambda value by 8 when calling fixedLassoInf? The documentation for fixedlassoinf says to divide the lambda value when getting the coef by 8, but then I do not get the coefficients I am looking for. I found some slides from some university professor online, who did not divide or multiply anything, thus I am not sure whether my approach is correct?
Next: I found that for Lambda values of 0.05 fixedLassoInf provides reasonable results. But not for small Lambda values, as my chosen 1se lambda of 0.002. Can anybody explain why that is?
Finally: I also found that for a large tolerance value for beta (tol.beta = 1) fixedlassoinf gives reasonable results for the small lambda values, can I just increase the tolerance level, or is that simply wrong?
[PS. There are no missing values in the data, as has been the source of similar problems for others]
 

hlsmith

Not a robit
#2
I can try to help you with this tomorrow when I am at my computer. This is an example for a continuous outcome? I haven't done this with a continuous DV, but have a few times with binary DV. I think the only difference is the intercept term.

It might also help if you post generated trace plots, etc. so we can follow along with your steps, such as standardizing all IVs if they are not on the same scale.

In my dealings with it I did not multiply anything with lambda but divided it by n-value. can you provide source reference for multiplying it? Code to change kkt tolerance.
tol.kkt=0.1

Can you tell us more about the background context of you project, so we have a ballpark idea what is going into the model.
 
#3
Thanks for your offer! This is a model with a continuous DV, yes. I also have an intercept included in the model.

This is just a very simple econometric model of log(wage) being regressed on 74 predictors, nothing too complicated going on here. I got quite a few dummy variables in there, already checked for linear dependence among them. I got about 4000 observations. The model itself works just as intended, I perform cv to get my one standard error rule lambda and then proceed to get the coefficients.

The complete code for doing inference reads as follows:

Code:
  gpitemp <- scale(lassotemp$x,TRUE,FALSE)
  gpitemp <- scale(gpitemp,TRUE,TRUE)
  cv.out <- cv.glmnet(x=gpitemp, y=lassotemp$y, lambda=l, weights=lassotemp$weight, alpha=1, nfolds=100)
  lasso.mod <- glmnet(x=gpitemp, y=lassotemp$y, lambda=l, weights=lassotemp$weight, standardize=FALSE)
  library(selectiveInference)
  bose<-coef(lasso.mod, x=gpitemp, y=lassotemp$y, weights=lassotemp$weight, s=cv.out$lambda.1se, exact=TRUE, thresh=1E-22)[-1]
  fixedLassoInf(gpitemp,lassotemp$y, bose,lambda=cv.out$lambda.1se*8, tol.beta=0.1, tol.kkt=0.1)
I tried raising tol.kkt to get rid of those warnings, even raising it to 10000 [sic!] did not make them disappear..

The issue with dividing lambda by 8 is, that I did cv to get to that specific value of lambda. so if I later calculate the coefficients using an eighth of the 1se-lambda value, I do in fact not get the coefficients for the chosen model. The Vignette for fixedlassoinf though states that it uses a different definition of the LASSO, such that the lambda values have to be adjusted for. Dividing one of the lambdas by 8 or multiplying the other by 8 should not make any difference, right? They should then use the same lambda for their calculations. I did try to run the model with the "standard procedure" of dividing the lambda passed to coef(), still the p values were all equal to 1.
 

hlsmith

Not a robit
#4
nfolds=100 is quite a bit. I have never seen that many, which would mean you are using 3960 obs for model and applying it to a holdout fold of 40 obs. Seems pretty unbalanced. Do you have a reason for this? You usually use 5-8 folds or more say 10 if your dataset is small.

Not following where the 8 value is coming from originally, since you have 100 folds not 8. Even if you had 8, I still don't follow. Given 8 folds you just fit 8 models in CV, and selected the most regularized model that performed the best in the hold out set. That is the lambda value used. why do you think you need to * 8. Please provide resource for this. I may also throw in an "intercept=TRUE" in the gaussian fixedLLassoInf part - to make sure that is happening

Given you get all the code working you could just have a kkt issue. I had that once, I remember trying to look up the kkt criteria and getting pretty confused. I remember also looking into the selectiveInference package to see when it generate that error (what prompts it), but that was a long time ago.

P.S., what are your weights in the model?
 

hlsmith

Not a robit
#5
Well, I took a look at the package documentation and it is as I remember:

Code:
lambda = .8

beta = coef(gfit, s=lambda/n, exact=TRUE,x=x,y=y)[-1]

# compute fixed lambda p-values and selection intervals

out = fixedLassoInf(x,y,beta,lambda,sigma=sigma)

out
You are doing fixed lambda version, where you fixed lambda based on CV.glmnet. So above you use your fixed lambda value per lambda = out$lambda.1se. The part of correcting lambda is in the beta line, your bose. See the lambda / n, n = number of rows in the dataset. So you use your fixed lambda in fixedLassoInf and you adjust it in your bose line of code. Let me know if this works out for you.

Another option you have given your sample size is you could fit the LASSO on a 60% training model and use the active feature set in a OLS model applied to the 40% hold out set, this would give you interpretable categorical variables. Because I am guessing you haven't gotten this far in you head, but what is a standard deviation increase in a categorical variable - gets pretty un-interpretable.

Pages: 6-8: https://cran.r-project.org/web/packages/selectiveInference/selectiveInference.pdf

P.S., Welcome to the forum Phil!
 

hlsmith

Not a robit
#6
Sorry to bombard you, but I have been trying to get better at this approach, so I find it interesting. I will also note that as approaches zero then the model become equivalent to LS. The lamda you posted on CV site is pretty small, meaning your estimates are probably comparable to those from least square. The selective inference part helps you address that the final active set was not stated a priori and is dependent on the modeling process - so it has some merits.

I also found this line in my notes, which is relevant to your other post: Intercept does not get shrunk during the process since it represents the average or baseline prevalence. This is comparable to the feedback you got on CV site.
 
#7
Thanks, yeah you are right about the number of folds, I made a mistake there. Still, that is not the cause of the error.

The 8 value should be the number of observations, so again I made a silly mistake here. It is coming from here: rdocumentation.com/packages/selectiveInference/versions/1.2.4/topics/fixedLassoInf

Be careful! This function uses the "standard" lasso objective 1/2∥y−xβ∥22+λ∥β∥1. In contrast, glmnet multiplies the first term by a factor of 1/n. So after running glmnet, to extract the beta corresponding to a value lambda, you need to use beta = coef(obj, s=lambda/n)[-1], where obj is the object returned by glmnet (and [-1] removes the intercept, which glmnet always puts in the first component)
I removed the 8 such that it now reads:
Code:
gpitemp <- scale(lassotemp$x,TRUE,FALSE)
gpitemp <- scale(gpitemp,TRUE,TRUE)
cv.out <- cv.glmnet(x=gpitemp, y=lassotemp$y, lambda=l, weights=lassotemp$weight, alpha=1, nfolds=10)
lasso.mod <- glmnet(x=gpitemp, y=lassotemp$y, weights=lassotemp$weight, standardize=FALSE)
library(selectiveInference)
 bose<-coef(lasso.mod, x=gpitemp, y=lassotemp$y, weights=lassotemp$weight, s=cv.out$lambda.1se, exact=TRUE, thresh=1E-22)[-1]
  fixedLassoInf(gpitemp,lassotemp$y, bose,lambda=cv.out$lambda.1se*nrow(gpitemp), tol.beta=0.1, tol.kkt=0.1, intercept=TRUE)
and then get these results:

Code:
 Var   Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
1  0.026   2.673       1      -Inf   -0.982           0          0
3  0.043   4.290       1      -Inf   -1.010           0        NaN
4 -0.022  -2.051       1     1.054      Inf         NaN          0
5 -0.035  -3.285       1     1.076      Inf         NaN          0
6 -0.033  -3.081       1     1.059      Inf         NaN          0
8  0.044   4.320       1      -Inf   -1.029           0        NaN
12  0.081   7.562       1      -Inf   -1.069           0        NaN
14  0.086   7.249       1      -Inf   -1.189           0          0
15  0.051   4.508       1      -Inf   -1.135           0          0
18  0.004   0.382       1      -Inf   -1.080           0          0
19  0.028   2.823       1      -Inf   -0.995           0        NaN
22  0.028   2.792       1      -Inf   -1.018           0        NaN
23  0.081   6.718       1      -Inf   -1.209           0          0
24 -0.021  -1.909       1     1.077      Inf         NaN          0
25 -0.008  -0.760       1     1.008      Inf           0          0
26 -0.095  -8.462       1     1.119      Inf         NaN          0
28  0.073   4.846       1      -Inf   -1.505           0        NaN
29  0.106   6.690       1      -Inf   -1.591           0        NaN
31 -0.004  -0.294       1     1.221      Inf           0          0
32 -0.049  -3.606       1     1.361      Inf         NaN          0
33 -0.053  -3.671       1     1.444      Inf         NaN          0
34  0.059   5.572       1      -Inf   -1.054           0          0
35 -0.049  -4.480       1     1.101      Inf         NaN          0
37  0.020   1.812       1      -Inf   -1.113           0        NaN
42 -0.057  -5.599       1     1.022      Inf         NaN          0
44  0.027   2.639       1      -Inf   -1.029           0          0
45 -0.039  -3.789       1     1.023      Inf           0          0
46 -0.027  -1.980       1     1.361      Inf         NaN          0
50 -0.058  -5.288       1     1.094      Inf         NaN          0
54  0.005   0.465       1      -Inf   -1.006           0        NaN
56 -0.061  -6.066       1     1.009      Inf           0          0
58 -0.065  -6.211       1     1.043      Inf         NaN          0
62 -0.024  -2.460       1     0.985      Inf           0          0
64 -0.053  -3.848       1     1.377      Inf         NaN          0
65 -0.073  -7.097       1     1.029      Inf           0          0
67 -0.040  -3.894       1     1.017      Inf         NaN          0
69 -0.071  -6.195       1     1.146      Inf         NaN          0
74  0.093   4.959       1      -Inf   -1.880           0        NaN

Note: coefficients shown are partial regression coefficients
The KKT warnings are gone! But the p-value still reads 1 for every coefficient.

The weights are provided by the dataset I am using. The weights rescale the observations such that they represent the true population better.
 
#8
I just now realized there were two more replies, thank you mate. So if I understand the first one of them correctly I should run:
Code:
  bose<-coef(lasso.mod, x=gpitemp, y=lassotemp$y, weights=lassotemp$weight, s=cv.out$lambda.1se/n, exact=TRUE, thresh=1E-22)[-1]
  fixedLassoInf(gpitemp,lassotemp$y, bose,lambda=cv.out$lambda.1se, tol.beta=0.1, tol.kkt=0.1, intercept=TRUE)
This again gives me p-values=1 for all the variables :/
 

hlsmith

Not a robit
#9
Yes, that looks better. The p-values of 1 seems troubling given that for the confs both are on the same side of "0". This probably comes back to the Inf and tail areas not be near 0.05. It might not hurt to explicitly put alpha-0.05 in the fixedLassoInf, its probably the default, but I prefer to see all of the defaults.
 
Last edited:
#10
Are you creating the lambda plot and seeing the SE value. You put lambda = 1 in your cv.glmnet line. Since you are ending up with a min SE lambda not = to 1, you are probably fine there, but I usually use the following:

Code:
grid=10^seq(10,-2,length=100)
cv.glmnet...,
nfolds=10, lambda=grid)
I think you need to look up the kkt criteria or function of beta.tol. Please report back what you find.
 
#11
Thank you! I looked into the code and what tol.beta does it to drop those coefficients from the calculations which are smaller than tol.beta scaled by some factor.
From the manual:
The confidence interval construction involves numerical search and can be fragile: if the
observed statistic is too close to either end of the truncation interval (vlo and vup, see references),
then one or possibly both endpoints of the interval of desired coverage cannot be computed, and
default to +/- Inf. The output tailarea gives the achieved Gaussian tail areas for the reported
intervals—these should be close to alpha/2, and can be used for error-checking purposes.
So I reckon the p-values=1 show up since some of the variables are too close to vlo and vup. I will adjust the tolerance to get as many variables in as possible, but I dont think there is anything else to do..
 
#12
Ideally you don't want to run different models on same data since it can come off as mining. But it may be fine given your scenario. As I mentioned before, you could do a data split where you use LASSO on the training set and then fit OLS with the subset of features on the holdout set. I have also seen language stating that you should then use a Bonferroni correction on estimates from the holdout set. Though it was note explicit on if they meant alpha/2 since it was the second model.
 
#13
sry for the long inactivity.. yeah sample splitting / bootstrapping might work for my case. I found another couple of techniques in Statistical Learning
with Sparsity The Lasso and Generalizations by Trevor Hastie, Robert Tibshirani & Martin Wainwright, which is available online for free. Unfortunately I could not find any R packages that perform these techniques...
 
#14
I am familiar with that book - it's good. Which methods are you looking at. It's likely I haven't used them, but I could benefit from examining them.
 
#16
Yes, I remember reading these subsections, but I don't regularly use continuous outcomes so I have not attempted to over examine them. Have you looked into the papers they cite for a worked out example?

I have seen reference to PoSI before in genetic research, not so much debiased method.

Of note, I believe I may have chiseled out of that book that the number of models possible given a set of covariates (k), is 3^10 to address if variable is significant yes/no, positive yes/no, or negative yes/no. This means if you have 10 candidate terms their are 59,049 possible models I believe that you need to keep in mind when making corrections. In your case, 3^74 = 2.0E35.

If you want to examine one of these approaches I am happy to study along and see if I could be of any assistance.
 
#17
yes I am interested in these methods, and I read a couple of papers but I am afraid I couldnt figure out how to properly apply them. Furthermore I think due to the high number of possible models you pointed out, the required computational power exceeds what I am capable of doing.. :(
 
#18
Yes, I am interested in those procedures as well. You overall objective is to find the best subset of predictors, correct? You could always do the data split, where you use LASSO to get a set them apply them to the holdout set. Though, I had seen some were that you can apply a bonferroni correction when applying the OLS to the holdout set. It was unclear in your case if that would be alpha/74 or alpha/active set of covariate kept. Have you seen any literature on this?