Interrater reliability for many raters (20) and ordinal/nominal variables

I am trying to assess interrater reliability for different medical training groups. Each group consists of 20 raters/doctors. They then assess 5 cases. Each case involves a series of 30 questions. The questions are either ordinal or nominal. The ordinal ones have 3 or 4 rankings. The nominal questions are just binary yes/no measures.

To assess interrater reliability for the ordinal measures I planned to use intraclass correlation. Specifically ICC(2,20).

To assess interrater reliability for the nominal/binary variables I feel more confused. Should I be using Fleiss’s Kappa? (In which case I have to get more familiar with R as I’m currently using SPSS). I considered doing an average Kappa of all rater pairs... but with 20 raters, I’m overwhelmed with the sheer quantity of pairs and I’m not the best coder.

Can anyone confirm that I’m headed in the correct direction? The ultimate goal is to assess whether particular training achieves more consistent grading.

Thanks a ton for any help/suggestions!