Help with: Benford's Law Second Digit Test (have .ado for First Digit Test)

#1
Hi all,

I'd like to perform Benford tests on a dataset using STATA and would much appreciate any help with this.

To perform the 'basic' Benford test on first digits I found a nice .ado which will do exactly this for me (link).

Does anyone have any input how I could extend the analysis to other digits, most importantly the second digit in each observation? For reference, Wikipedia has a quick briefing of the calculation (link). I imagine it should be possible to create an updated .ado which will perform the second digit test without too much pain, based on the firstdigit ado. Any help would be great.

Thanks
 
Last edited:

Dragan

Super Moderator
#2
Hi all,

Does anyone have any input how I could extend the analysis to other digits, most importantly the second digit in each observation?
I have done this using the bootstrap. More specifically, we know what the mean, variance, skew, and kurtosis are for the second digits ---assuming the data follow Benford's Law. They are mean=4.1873, variance=8.25381, skew=0.133114, and kurtosis=-1.20839. Simply bootstrap these descriptive statistics and obtain confidence intervals. I've done this using S+ ---Stata, however.
 

bukharin

RoboStataRaptor
#3
This is a tweak of Nick Cox's -firstdigit-. I have not tested it extensively but it seems to work. It would be worth validating it if you have an example from elsewhere. You could save this as seconddigit.ado and put it in your personal ADO folder (if in Windows, C:\ado\personal - otherwise run -sysdir- and Stata will tell you)

Code:
* seconddigit
* slight tweak of Nick Cox's -firstdigit- command from SSC
program seconddigit, rclass byable(recall)
	version 9.1 
	syntax varlist(numeric) [if] [in] ///
	[, BY(varname) ALLobs MISSing PERcent ]

	quietly { 
		// what observations to use 
		if "`allobs'" != "" marksample touse, novarlist 
		else marksample touse
		
		if "`by'" != "" & "`missing'" == "" markout `touse' `by', strok 
		
		count if `touse' 
		if r(N) == 0 error 2000

		// variable(s) or group(s) 
		local nv : word count `varlist'

		// by option 
		if "`by'" != "" { 
			if `nv' > 1 { 
				di as err ///
				"by() cannot be combined with `nv' variables"
				exit 198 
			}

			tempname stub
			separate `varlist' if `touse', ///
			by(`by') g(`stub') veryshortlabel `missing' 
			local varlist "`r(varlist)'" 
			local nv : word count `varlist' 
		} 	

		// display preparation and initiation 
		local nam = 0 
		foreach v of local varlist { 
			if "`by'" != "" {
				local nam = ///
				max(`nam', length(`"`: variable label `v''"')) 
			}	
			else local nam = max(`nam', length("`v'")) 
		}	
		local col = min(28, `nam') + 1 
	
		local txt ///
		"       n   chi-sq.  P-value   digit   observed   expected"
		noi di _n as txt "{col `col'}`txt'" 

		local line = `col' - 1 + length("`txt'")
		noi di "{hline `line'}"

		tempvar thisuse 
		gen byte `thisuse' = . 
		local ofmt = cond("`percent'" != "", "%11.2f", "%11.0f") 

		// loop over variables (-separate- results if -by()-) 
		foreach v of local varlist { 
			local which = ///
			cond("`by'" != "", `"`: variable label `v''"', "`v'") 
			local which = abbrev(`"`which'"', `col' - 1) 	
			replace `thisuse' = `touse' & `v' < . 

			mata : sd_work("`v'", "`thisuse'", "`percent'") 
		
			noi di as text `"`which'{col `col'}"'        "`sp'" ///
			       as res %8.0f r(N)                            ///
                               as res %10.2f r(chisq)                       ///
                               as res %9.4f r(p)                            ///
			       as res "       0" `ofmt' r(obs0) %11.2f r(exp0) 		                            
			return scalar obs0 = r(obs0) 
			return scalar exp0 = r(exp0)
 
			noi forval i = 1/9 { 
				di as res "{col `col'}`sp'{space 34}`i'"     ///
				`ofmt' r(obs`i') %11.2f r(exp`i') 
				return scalar obs`i' = r(obs`i') 
				return scalar exp`i' = r(exp`i') 
			}

			return scalar p = r(p) 
			return scalar chisq = r(chisq) 
			return scalar N = r(N) 

			local --nv 
			if `nv' noi di " " 
		}
	}	
end 

mata : 

void sd_work(string scalar varname, 
             string scalar tousename, 
             string scalar percent) 
{ 
	real colvector y, obs, exp 
	real scalar n, i, chisq   
	string scalar name 

        y = st_data(., varname, tousename)    
	n = rows(y) 
	exp = obs = J(10, 1, 0) 

	y = strtoreal(substr(strofreal(y), 2, 1))

	for (i = 0; i <= 9; i++) {
		j = i+1
		obs[j] = colsum(y :== i) 
		exp[j] = n * (log10(1 + 1/(10+i)) + log10(1 + 1/(20+i)) + ///
			log10(1 + 1/(30+i)) + log10(1 + 1/(40+i)) + log10(1 + 1/(50+i)) ///
			 + log10(1 + 1/(60+i)) + log10(1 + 1/(70+i)) + log10(1 + 1/(80+i)) ///
			  + log10(1 + 1/(90+i)))
		name = "r(obs" + strofreal(i) + ")"
		st_numscalar(name, 
			percent == "" ? obs[j] : 100 * obs[j] / n) 
		name = "r(exp" + strofreal(i) + ")"
		st_numscalar(name, 
			percent == "" ? exp[j] : 100 * (log10(1 + 1/(10+i)) + log10(1 + 1/(20+i)) + ///
			log10(1 + 1/(30+i)) + log10(1 + 1/(40+i)) + log10(1 + 1/(50+i)) ///
			 + log10(1 + 1/(60+i)) + log10(1 + 1/(70+i)) + log10(1 + 1/(80+i)) ///
			  + log10(1 + 1/(90+i))))
	} 

	chisq = colsum(((obs - exp):^2) :/ exp) 
	st_numscalar("r(p)", chi2tail(9, chisq)) 
	st_numscalar("r(chisq)", chisq)
	st_numscalar("r(N)", n)
}  	

end
 
#5
Thank you both for your replies, they're both helpful. I'll test the .ado tomorrow and report back with the results. I believe I'll have the chance to test it against a previous example. Cheers!
 
#6
Hi again all,

Firstly, thanks a lot for the help with the code. I've been trying it out and it looks very interesting. One question though. When running the second digit test I noticed that the sum of the obsereved digits do not match the sample size (i.e. the 'n' is different from the total I get if I add up the obsereved digits 0-9 in the 'Observed' column...). Any help on explaining this discreprancy, or with 'debugging' the code would be much appreciated. My initial thoughts are that it has something to do with when the variable 'y' is defined, but I'm not sure...

Thanks
 

bukharin

RoboStataRaptor
#7
Yes, this is what I was getting at above when I mentioned problems with the denominator.

Try replacing line 63:
Code:
replace `thisuse' = `touse' & `v' < .
with
Code:
replace `thisuse' = `touse' & !missing(substr(string(`v'), 2, 1))