Delphi assessment..., which tests?!

Hi all,

I've recently undertaking a Delphi workshop in an attempt to establish an expert opinion upon the importance of a series of behaviours to a specific species' welfare.

I worked with two panels of experts independently, one 15 in number the other 4 (I know not ideal!).

Individually each expert was asked to rank 23 behaviours using 12 criteria with a score from 0-5. Each criteria gave a slightly different insight into the potential welfare significance of each behaviour. Where experts felt they could not provide a score for a specific behaviour criteria combination they recorded a blank.

The questions I'd like to ask are, is there an agreement in ranks within and between the two groups to see if there is a group effect and to see if the ranking process is consistent / repeatable.

It would also be interesting to see whether the scores for each behaviour co-vary, so do behaviours typically score similarly for all criteria of is there a spread? It would also be interesting to determine which criteria were most consistent.

I have already generated a consolidated % score for each expert for each behaviour (combining their 12 ranks) and then averaged that within each panel (generating a panel specific % for each behaviour), and carried out a Pearson's correlation to compare the consolidated scores (n22, r=0.8361, p=<0.00001). I'm not certain that is the right way to go about this and I am certain the data has more to reveal!

Any help would be much appreciated!