Skewed Data

#1
Good Morning,

I am wondering if anyone can provide some advice regarding a problem with a large data set I have.

I have a lot of outbreak data for a plant disease across the UK. I am creating a summary map of this data but for two locations I have significantly more data due to highly active people recording occurrence. It is all valid but due to these highly active reporters it does appear to occur more frequently in certain areas. Is there a method to deal with this, or do I just explain all the time why this is occurring.

Thanks for any advice

Siobhán
 
#2
...but due to these highly active reporters it does appear to occur more frequently in certain areas. Is there a method to deal with this, ...
To my knowledge, no there is no way to deal with this.

As I understand it, you can't know if it is a high occurrence of the disease, or a high reporting propensity.

One could speculate about doing a sample survey and interviewing potential "reporters", but that would need a large sample to get accurate estimates of reporting proportions, and besides people don't actually behave as they say that they do.

An other speculation could be that if you have an other method of finding a unit with the disease, would be to use the capture-recapture method. But that seem to demand even more work.

Let us see if someone else has got a suggestion.
 

bugman

Super Moderator
#3
Well, Greta has some good suggestions. But if I understand correctly you have a high number of workers and therefore more data in one area compared to the other due to fewer workers.

This being the case, can't you just standardise these data and report as a percentage?
 
#4
But if I understand correctly you have a high number of workers and therefore more data in one area compared to the other due to fewer workers.

This being the case, can't you just standardise these data and report as a percentage?
I did not think of this possibility. But if the "workers" all have the same knowledge, the same working intensity and the same reporting propensity, then I agree with Bugman.

I was thinking of the case where the general public reports, with maybe very different knowledge and reporting propensity. Also the reporting propensity might increase or decrease over time so that it is not possible to make statements about if the disease is increasing or decreasing.
 
#6
Good Morning,

Thank you very much for the feed back and advice, very much appreciated!

My data is for a ten year period and is reported by a combination of the public and official 'scouts' of disease. I know that it is skewed as there are some scouts in certain areas who are you could say, more passionate about the disease than others and thus are much more active and return a far greater number of samples each year. This only occurs in two to three regions of Britain though,I know who these active people are. All other areas have occurrence of the disease but lack such a highly active reporters. I have simply mapped outbreaks over the ten year period but as expected, the intensity in these areas with active scouts is extremely large in comparison with the rest of Britain.

If I took averages or proportions it is difficult. I do not have a set number of fields that are being monitored each year. The system works essentially based on a 'we hope people report disease'. We get samples from different fields each year and sometimes allotments etc. It proves difficult then to know the percentage of what is infected each year. I have attempted this though, considering that the total disease area is any location that has reported a disease occurrence in the last ten years, but this seems wrong, there are too many areas which have reported only one occurrence in the last ten years.

Thanks again for your advice

Siobhan
 
#8
Maybe you can use a capture-recapture model. Suppose you have a group of enthusiast or or ordinary people from the public. Their investigation would be the "capture" part (also called "mark"). Then some of your professional investigators comes to the same area. That would be the "recapture" part. I believe that that could be an objective estimate of the true population size in that area.

You can do the same for each area. Or maybe you can use it like a sample from the population of areas and use it like a calibration model.

(Sorry, I have not (yet) read the link that Bugman gave. I also guess that in this case there is a very good potential for combining data from different methods and sources.)
 

bugman

Super Moderator
#9
(Sorry, I have not (yet) read the link that Bugman gave. I also guess that in this case there is a very good potential for combining data from different methods and sources.)
Yeah, this is becoming a "big thing" now. Long data sets combining government, university and citizen science. I have seen a number of papers trying to address exactly this issue, but this one is most relevant I think because it talks about spatial aspects.