Don't know where to start, please help

:eek: Please any kind person help, as I am horribly stuck!!!

I have a *very* large dataset (n>300,000 rows) of water readings in this format;

Reading1; Date1; Reading 2; Date 2; Reading3; Date3 etc
50Kl; 23/04/06; 122Kl; 23/07/06; 45Kl; 25/10/06

Two datasets have 16 columns each as above, and one has 30 columns (Biannually 2006 - 2009 and Quarterly 2006 - 2009). Each reading is taken at the following date, and reflects the water use until that day (ie. Reading 2 is all the water used between Dates 1 and 2).

What I want to do is analyse change over time, how water consumption reduced over the period. Also, look at dates of implementation of policy, and see if water consumption figures "responded" accordingly. I also want to aggregate water consumption to an areal (census) measure, to analyse this against Census data; and look at the significance of various other variables, such as land value and lot size.

I know I can do a Repeated Measures for some of this; but the problem I have is that all dates of reading are NOT the same, and might differ by up to two months. i.e. under the column Date 1, are many different dates.

I haven't the vaguest idea where to start, so any help is greatly appreciated!!!!
I can suggest how I would start. To deal with water usage records that spanned portions of the desired date categories I would find (and test) some reasonable means of estimating the consumption to be fitted to those categories.

In other words I would work out a mapping between the uncategorized data provided and the categorized data required for my analysis.

To give a hypothetical example: suppose I found through regression study that residential users displayed roughly linear water usage, but with different slopes during summer months. I would then write a spline function based upon the month spans, and use that to calculate estimated usage for each month (or week or whichever date categories I needed) for the residential users. Each user may well have a unique usage profile, requiring a unique mapping.

I would do that on some subset of the 300,000 data records, and then test my results for validity (perhaps by testing against similar analysis of other random subsets) before tackling the whole large data set.

Once I was satisfied that my derived data was a fair (and quantified) representation of the source data then I would carry out the comparison analysis using the derived data.

I have done that for heating oil consumptions and, with some hours of programming, data study and revision, I produced results that proved reasonably accurate when tested against validation data.
Thank you both for very comprehensive answers. Just a quick question, what if I run a Linear Mixed Model on the dataset (after restructuring)? Fixed, Random effects etc, given that the data is longitudinal.