Choosing an inter-rater reliability measure when different sets of raters score items?

I have a set of 77 different face stimuli, that have each been rated on 10 likert questions. The ratings on these questions can be aggregated to produce scores on 5 different subscales. So, I end up with 5 scores for each of the 77 stimuli from each rater, and I would like to report the inter-rater reliability of each score when averaged across raters. These average scores will be used in further analysis.

However, I am struggling to identify the correct reliability measure, because the 77 stimuli haven't all been scored by the same raters. The 77 stimuli have been split into four batches (of 18, 19, 20 and 20), and each has been scored by a different group and number of raters (N=26, 26, 30, 30 respectively). This appears to rule out some straightforward measures. Is there a measure that would be suitable? Or should I calculate inter-rater reliability separately for each of the four 'batches' of stimuli then average them together? Any advice would be greatly appreciated, I don't have much experience with this type of analysis.