append vs. merge, which one should I use in my case?

#1
quantiles with 2 variables

First, I will give a small introduction what I am planning to do.
I have data for approximately 6 years of retail investors. I want to check there performance for e.g. 3 years and then check how this influence their trading behavior (especially volume/turnover) and performance in the next 3 years.

I have the data for every month in separate excel files. So, what I did is: for every month I created a new excel file with the variables I want. I imported this into stata (OBDC data source --> excel files) and created a .dta file. After that I started to create a do-file, and this is where I have some trouble to decide what is the best option for me.

Short example:
Code:
use jan00.dta, clear

    append using febr00.dta
    save master.dta, replace

    append using march00.dta
    save master.dta, replace

    etc
I want to merge the datasets together. However, at the moment not all my variable names are not the same (e.g. net_performance_jan_00, net_performance_febr_00, etc). So, the append command doesn't work properly then. Now my question is: is it better to use "append" and put the data together in these (eight) variables, or is it better to use "merge" and add new variables every month?

The reason why I am asking this question is that I have a variable called account_ID, which are numbers. Most numbers will come back every month, and I am wondering if this can create any problems if I use append to merge the different .dta together.

Thank you for your time and I hope that my question is clear :)
 
Last edited:
#2
I changed some of my variables and tried to do use "append" to merge some of the months. So far it looks okay, so I guess I should use "append" to merge it.

Now my goal is to find the quantiles (thus Gross_performance linked to the accountID). If I do the following:

Code:
summarize Gross_performance, detail
I get the quantiles, but only of Gross_performance of course. But I want to know, which accounts_ID are for example in the top 25 percentile. What is the best way to do this?

I guess, one way could be if I create a new variable like this:

Code:
xtile test = Gross_performance, nquantiles(4)
Then I create a new variable, and the top 25 percentile gets the number 4. However, how can I ask stata to give me the accountID which only have the number 4 in the variable "test"? Or is there maybe an easier way to do this? Furthermore, I have a problem that he takes every ID seperate. So if use 3 monhts, accountID 1 is has a Gross_performance for every single months thus there are 3 observations. I want to add up every accountID in the data set. Does anyone know how to do this?
 
Last edited:

bukharin

RoboStataRaptor
#3
I don't understand what you're trying to do. Do you literally want to add the values of Gross_performance for each accountID over the months, then look at the top 25% of total Gross_performance? If so you could do something like:
Code:
egen total=total(Gross_performance), by(accountID)
egen tag=tag(accountID) // just tag 1 observation per account
sum Gross_performance if tag==1, d
gen top25cent=total>=r(p75)
 
#4
Hi bukharin,

I found it out this morning how to add up the gross_performance for the IDs. It's exactly what you wrote. However, my dataset contains 72 months. As I want to compare different time periods with eachother I need to create different variables for every time period. I need to use the "if" command for that, but I have some problems with that

Example: I want to add up the gross_performance for 3 months (and not only for 72 months).
Code:
accountID, gross_performance,               date
1              .1                         01jan2000             
1              -.1                       01febr2000
1              -.1                       01march2000 
3              .2                         01jan2000
3              .02                        01febr2000
3              .03                        01march2000
4              -.01                       01jan2000
4              -.02                       01febr2000
4              .04                         01march2000
I tried to do something like this

Code:
egen total_gross_quarter00 = total(gross_performance), by(accountID) if date=01jan2000, 01febr2000,01march2000
or
egen total_gross_quarter00 = total(gross_performance), by(accountID) if date>=01jan2000<=01march2000
But that doesn't work, thus I created a new date

Code:
generate newdate = dofc(date)
Then 01jan2000 has the value 14610, but then the if command doesn't work either.

Code:
egen total_gross_quarter = total(gross_performance), by(accountID), if newdate >=14610 <=14670
but if I do it like this, it works:

Code:
egen total_gross_quarter = total(gross_performance), by(accountID), if (newdate <=14670)
However, it would be better for me if I could say if (newdate = 14610, 14641, 14670. So, I guess I make a mistake somewhere but I don't really know what.
 
Last edited:

bukharin

RoboStataRaptor
#5
Are you just looking at it by quarter? If so, I would suggest re-formatting the dates as quarters:
Code:
gen quarter=qofd(date)
and optionally:
Code:
format %tq quarter
Then you can, for example, adapt the code I gave earlier as follows (the example is for the first quarter of 2010):
Code:
egen total=total(Gross_performance), by(accountID quarter)
egen tag=tag(accountID quarter) // just tag 1 observation per account per quarter
sum Gross_performance if tag==1 & quarter==tq(2010q1), d
gen top25cent=total>=r(p75) & quarter==tq(2010q1)
You could be more sophisticated and calculate the top 25th centile for all quarters by replacing the last 2 lines with:
Code:
levelsof quarter, local(quarters)
foreach quarter of local quarters {
    sum Gross_performance if tag==1 & quarter==`quarter', d
    gen top25cent=total>=r(p75) & quarter==`quarter'
}
It is probably (hint: definitely) worthwhile having a read of the Stata User's Guide for the language syntax. For example, where you've written:
Code:
if newdate >=14610 <=14670
you should have:
Code:
if newdate >=14610 & newdate <=14670
However, it is also worth pointing out that your newdate and date are exactly the same, they're just formatted differently. Try doing:
Code:
format %td newdate
list newdate in 1
display %td 14610
display td(01jan2000)
It's the same principle for quarters.

It would be well worth reading -help dates_and_times- ... handling dates and times in any stats package can be complicated! Stata's way of doing it is a bit intimidating at first, but once you get used to it you see it's actually very powerful and makes certain operations (such as the analysis by quarter above) easier than you'd think.
 
#6
No, I don't need quarters. I used quarters more as an easy example.
I am gonna use 6 months, 12, 24, 36 months and compare it with the next 6,12,24,36 months.
regarding the dates: I also read that it was the same, but I thought that was maybe causing an error or something. I am gonna try it!

Alright, this works partly :) thanks. This is the code which I am using as soon as I open the master file:

Code:
generate newdate = dofc(date)
egen total_gross_performance=total(gross_performance), by(accountID)
egen gross_performance_1quarter00=total(gross_performance), by(accountID), if newdate >=14610 & newdate <=14670
egen tag=tag(accountID) 
sum gross_performance_1quarter00 if tag==1, d 
gen top25cent= gross_performance_1quarter00 >=r(p75)
summarize gross_performance_1quarter00 if top25cent==1, detail
However, "gen top25cent= gross_performance_1quarter00 >=r(p75)" gives the value "1" also to months which are not in the first quarter (April,etc). Those months don't have any value in the "gross_performance_1quarter00" variable (they just have a "." as output there). Do you have any idea how I could solve that?
 
Last edited:

bukharin

RoboStataRaptor
#7
It's because missing (.) is a very very very high number, so matches >=

Try this instead:
Code:
gen top25cent= gross_performance_1quarter00 >=r(p75) if !missing(gross_performance_1quarter00)
 
#8
It's because missing (.) is a very very very high number, so matches >=

Try this instead:
Code:
gen top25cent= gross_performance_1quarter00 >=r(p75) if !missing(gross_performance_1quarter00)
I tried this, but what happens now is that he assigns a "0" to the first three months (1st quarter) and a "." to the other 3 months.
 

bukharin

RoboStataRaptor
#9
In terms of the missing data, isn't this what you expected? Why would you want to compare the performance in the other quarters with the centiles of the first quarter? Aren't you comparing quarter by quarter?

In terms of them all being 0 it's probably because you didn't tag accountID just in the quarter of interest (see the code I posted earlier)
 
#10
Alright, I think I can figure out where it went wrong.

For the descriptive data, I want to follow the top performing investors. So, If I know for e.g. 3 months who the top performing investors are, I need only those investors (accounts) for the next 3 months and see how they behave (in terms of performance and trading activity).
 
#11
I still have a question about this:

I used the following code:

RekeningID = accountID

Code:
egen gross_performance_years=total(gross_performance), by(RekeningID year) // create gross_performance per year
egen tag_year=tag(RekeningID year) // tag a 1 for each year per RekeningID

gen topq_2000=gross_performance_years>=r(p75) & year==y(2000)
gen topq_2001=gross_performance_years>=r(p75) & year==y(2001)
gen topq_2002=gross_performance_years>=r(p75) & year==y(2002)
gen topq_2003=gross_performance_years>=r(p75) & year==y(2003)
gen topq_2004=gross_performance_years>=r(p75) & year==y(2004)
gen topq_2005=gross_performance_years>=r(p75) & year==y(2005)
However, I only get "0" values when I create the top quantiles for seperate years. The weird thing is that I got it right for 2000, but when I put in the code for the other years it didnt work anymore.

Can somebody help me out with this? I only need a "1" per accountID if the performance is in the top quantile in that year.
 
Last edited:
#12
I solved it with the xtile command:
xtile topq_2000 = gross_performance_years, nquantiles(4), if tag_year==1 & year==2000 :)

-------------
So now I know which of the people are in the top 25 percentile. For example, if I want to know the mean performance of the top performing group in year 2000, I type:

summarize gross_performance_year if topq_2000==4

However, if I want to know how the top performing group of 2000 performed in 2001 it is harder for me to come up with a code.

my identifier = accountID (but I dont know how to use that to get the results)

If I try this:

summarize gross_performance_year if topq_2000==4 & year==2001 & tag_year==1

I get 0 observations.
Does anybody have any tips?
 
#13
My questions have nothing in common with the topic title. So, I open a new topic which makes it easier if someone uses the search later on.
 
Last edited: