Power Analysis Logistic Regression

Buckeye

Active Member
#1
I apologize in advance, but I have to be a little vague to avoid sharing proprietary info. My company is working with a third party vendor. This vendor is selling some data which we think can be helpful for improving some of our processes. I have an idea to predict an outcome (yes/no) using the shared data in addition to some in house variables. As you can imagine, this data comes at a cost. We need to tell the vendor how much data we need up front while keeping in mind that this data comes to us on an "as available" basis. Essentially, we collect data over time and not all at once.

The main problem is, I don't have a hypothesized model in mind. All I know is the historical rate of (yes/no) for the population of interest. I can maybe dig up info for models we've used in the past to predict the outcome of interest, but none of the variables would be similar to the new shared data from the vendor. So, how do I go about calculating a sample size needed to get value out of this opportunity?
 

Buckeye

Active Member
#3
I think I mentioned this in a thread here about simulation. I've always been taught to simplify the design and work and recognize that the sample size is an estimate at best.
 

hlsmith

Less is more. Stay pure. Stay poor.
#4
Will all data go into the same model. So for example I am fitting a patient survival model. I link the patient's census tract to census data (e.g., median income for tract, home ownership, education, rurality, etc.), then add that data to my model that includes patient level data. So if I had to pay for this census data would it be worth it. Well how predictive is that data on the outcome. For me, historically it is not more informative than the medical record level data I already have - so if it wasn't free - I wouldn't do it. But since I can do it, I can say theses possibly associated socioeconomic status variables are not helping to explain the outcome. Not sure if this helps or not.
 

Buckeye

Active Member
#5
I know that there are a few other outcomes that my company wants to investigate while using this data. However, these other areas are not my expertise. I am thinking of creating a "control group" (by no means the experimental definition). But, this group would have similar makeup to the population of interest. Only difference is the vendor would not share data for this "control group".

I want to say something about the value of including these new variables for predicting the outcome compared to not using the new variables.