Generate large dummy variable

#1
Hey guys.

I'm actually a little depressed that I can't figure this one out myself - mostly because it, instinctly, seems easy.

Lets say I want to create a data set from scratch. I have a variable, var1, and I want it to be a dummy variable with a specific amount of each. Let's say I want var1 to be 19382 observations = 0 and 3827746 observations = 1. I don't care where each observation go as long as they're all there. I get that I need to go:

set obs 3827746

and then, when I go:

gen dummy = 0

But what is the right command? How to I make 19382 of the 3827746 = 0 and the rest = 1?
 

bukharin

RoboStataRaptor
#2
I'm a little unclear about the total number of observations you want. I'm assuming it's 3827746.

The observation number is _n (see -help _variables-); you can use this to generate your dummy in one simple step:
set obs 3827746
gen byte dummy=_n>19382

If you didn't know about _n you could do it in two steps:
set obs 19382
gen byte dummy=0
set obs 3827746
replace dummy=1 if missing(dummy)

If you don't need to specify the exact number of positive/negative dummies, but rather want dummy to be 1 in around 99.5% of the observations, you could generate the dummy randomly using the rbinomial() function:
set obs 3827746
gen byte dummy=rbinomial(1, 0.995)

If you wanted to go down that path it would be worth using -set seed- before rbinomial() to ensure replicable results.
 
#3
This one is just perfect: gen byte dummy=_n>19382, but I have a problem. I was hoping to create 5 variables with different amount of obs in each. It's longitudinal data with enrolment numbers in a specific tertiary education institution 2005-2010 by gender. So 2005 would maybe be 29384 females and 75637 males, 2006 would be 43394 females and 87483 males and so on. When the total number of obs increase throughout the variables I need to create, I kinda **** up when I get to a variable that decrease again.
 
#4
Maybe generate them in different set and then merge them? Seems like a difficult way to do something you (the RoboStataRaptor) could probably do in one command :)
 

bukharin

RoboStataRaptor
#5
I would just do this:
Code:
input year male count
2005 0 29384
2005 1 75637
2006 0 43394
2006 1 87483
...
end

expand count
Depending on what you'll be doing with this dataset you may not even need to use the -expand- command - you may just be able to run other commands with [fweight=count] to indicate that the "count" variable should be used for frequency weighting.
 
#7
One last thing; if I do the [fweight] in a cross tabulation, how would I do it? If I had, fx., two numeric variables like this:

Count - Hat
1. 77382 773882
2. 37748 774883


Now I can't choose witch one I need to do the fweight on, since I need them both to be frequencies. Can I do something clever? (And by me, I mean you) :)