# Help , struggle with statistical analysis

#### siyareigner

##### New Member
I have patient characteristics such as gender, age, tissue type (normal or tumor), tumor location (cardia, corpus, antrum, fundus), Lauren classification (intestinal, diffuse, mixed), etc. and dependent variable is expression of a protein (= continued variable). I want to see if there is a correlation between all those clinical parameters and expression protein and if there is a difference in expression tissue type (normal or tumor) for male and female. What is the first step? Do I have to make plots and visually check if there are associations first? I also want to do multiple regression with the clinical parameters and protein expression as a dependent variable

Your help will be kindly appreciated .

#### Karabiner

##### TS Contributor
How large is your sample size? How many characteristics are included, and how many of them are categorical (such as tumor location), with how many categories?

You could well start with bivariate descriptions and graphs to see what's going on. Is "expression protein" a contiuous variable, or what is it?

With kind regards

Karabiner

#### siyareigner

##### New Member
How large is your sample size? How many characteristics are included, and how many of them are categorical (such as tumor location), with how many categories?

You could well start with bivariate descriptions and graphs to see what's going on. Is "expression protein" a contiuous variable, or what is it?

With kind regards

Karabiner
Sample size is 94. Twelve characteristics (13 with protein expression) of which 11 are categorical. Between two and five categories. And yes, protein expression is a continuous variable. I must add that for each patient I have protein expression for both normal as well as tumor tissue ( variable is tissue type). Thus, paired data. So if i was to compare protein expression between categorical variables it would be based on tissue type.

#### Karabiner

##### TS Contributor
A categorical predictor with k levels will be transformed into k-1 dummy variables,
if you want to use it in regression. So you'll soon have two or three dozens of
predictors in your model, and only 94 observations. Therefore you should perform
a pre-selection of characteristics to be used in the multiple regression(s). The pre-
selection should preferably be based on substantial considerations (theoretical
or practical interest), not on statistcal pre-tests.

With kind regards

Karabiner

#### siyareigner

##### New Member
so I should first do maybe a visual test or correlation test to see if there is any correlation and then take only these variables that are interesting into my regression? Another question, is it ok if my data for tissue type are arranged like this in the column: normal, tumor, normal, tumor for every patient thus 2 values? Or is it better to split my protein expression variable into two variables, such as protein expression for normal tissue and protein expression for tumor tissue? Then I don't need to have patient ID twice every time in my column and also Tissue type variable is no longer needed.

Thank you.
Best regards

#### hlsmith

##### Less is more. Stay pure. Stay poor.
What would be the purpose of doing this analysis? How do you plan to use the results?

#### Karabiner

##### TS Contributor
so I should first do maybe a visual test or correlation test to see if there is any correlation and then take only these variables that are interesting into my regression?
This is quite the opposite of what I wrote. The selection should not be based on statistical criteria,
but on theoretical and/or practical considerations.
Another question, is it ok if my data for tissue type are arranged like this in the column: normal, tumor, normal, tumor for every patient thus 2 values? Or is it better to split my protein expression variable into two variables, such as protein expression for normal tissue and protein expression for tumor tissue? Then I don't need to have patient ID twice every time in my column and also Tissue type variable is no longer needed.
Well, IMHO it depends in which statistical software you use, which statistical analysis
you perform, and how familiar you are with both of them.

With kind regards

Karabiner