Determining tree cover on agricultural fields - is my method statistically sound?

Dear all,

My name is Jan and I am a master student in the field of Tropical ecology. This is my first post and request for help on this forum. It’s because I am struggling right now with a research proposal I am writing. The study is different than what I am used to until now, so I’m not sure about the statistical implementations. I really hope that one of you can share your knowledge about this.

In a large area in Africa, there is a large-scale project established by a non-for-profit organisation which tries to motivate farmers to increase the amount of natural growing trees on their farms. The farmers involved in the project report the amount of “new” trees every year to specific farmers who collect the data of multiple farmers and send the numbers to the organisation responsible for the programme.
The following is applicable:
  • 115.000 farmers are involved
  • 338 villages are included in the project
  • 1200 farmers are collecting the data of all the involved farmers
  • A total amount of 8.5 million new trees are reported
  • The size of the total programme area (in hectares) in unknown
  • The size of each farm (in hectares) is unknown
I have to assess how successful this programme is in terms of actual growing trees vs. reported trees by the farmers. Since the total area and the area per farm is unknown, I have to determine the actual amount of new trees in a different way than just selecting random plots in the area. That’s why I came up with the following method:

I will “count” the trees on randomly selected farms (so I use the 115.000 farms as the population). This I will compare with the reported amount of trees each year for those particular farms in order to come to an accuracy and correctness of the reported numbers by the farmers themselves. With that information, I can eventually calculate the number of actual trees involved in the whole programme. More details:
  • I calculated a sample size of 69 farms, based on the following parameters (these are not just randomly chosen, but based on certain methodology):
    • Confidence level: 90%
    • Margin of error: 10%
    • Population proportion: 50%
    • Population size: 115.000
  • Due to time and resource constrains, I want to use a clustered random sampling method:
    • 1: Random selection of 12 villages
    • 2: Random selection of 6 farms per selected village
  • Each sample will deliver a percentage, namely the number of actually found trees divided by the number of reported trees by the farmer.
  • With the sample, I think I will calculate the mean percentage and use that to “correct” the total number of reported trees.
I was wondering whether this method is statistically sound. I have ample experience with sampling based on area (like using sample plots to measure certain aspects of a forest area), but this is a bit different type of research. For example, can I just use this method to calculate the mean percentage and use that on the total amount of reported trees (8.5 million)? In some way it feels too easy.

Thanks in advance for your help!


No cake for spunky
This is really about the design not the statistics although those type of questions are fine (just not a lot are experts here in that which would include me).

Since the total area and the area per farm is unknown, I have to determine the actual amount of new trees in a different way than just selecting random plots in the area. That’s why I came up with the following method:
Why are they not known? If you don't know the true population then you really can't sample it correctly. This is a pretty common problem in analysis. Without knowing how many farms there are, you can't really be sure if you can generalize from your data or not (a validity issue).

I think, it has been a while, when you do cluster sampling you can't use the normal standard errors. You should look this up.

I don't think this would be invalid statistically. I am not sure it is a valid design, I lack the expertise to address that. Certainly this type of sampling is done because of resource limitations and not knowing the true population (I think).
Hi noetsi,

Thank you for your reply. I agree with you that it is not purely statistical. Unfortunately the non-for-profit organisation didn't think thoroughly about how to monitor the progress/success of the programme, so I have to make do with the information they have.

In my opinion the true population in my case is the total number of involved farmers (115.000). So, even though I don't know the total area (which is normally used to determine things like biomass or number of trees per hectare), the total number of farmers is known. So the design is more "socially focused" by determining how accurate the farmers are in reporting the number of trees on their farms. So if I can determine based on my samples that for example only 50% of the reported trees (by the farmer itself) can be found, I hope I can use that data to adjust the total number of reported trees (8.5 million).

Thank you for the tip to look up whether or not I can use normal standard errors when using cluster sampling. I will definitely do that!

So, besides the questions concerning the research approach, I read between the lines that no clear "red flags" are present concerning statistical part of this study.

Well, if you think on basis of my reply that statistically there are problems with my research, please let me know. If not, than I would like to really thank you for taking time to reply to my questions!