How to apply multi-level logistic regression in this context?

Dear All,
we've done a pretty standard psychological/cognitive experiment and received reviewer feedback asking for a different kind of analysis. We have no experience with this kind of analysis, and I can't get my head around it how to 'apply' it to our context. So some help to get started would be greatly appreciated!

What we have done (our experiment, described with simplified hypothetical example):
- We have a set of stimuli of 3 different categories, say houses, cars, and trees. It's real-life pictures of different exemplars (between 120 and 150 different images per category)
- We wanted to know whether participants are able to identify the images. So we presented them one image at a time and offered three clickable answer buttons, one each for house, car, tree. So, we have data of the nature "Shown tree, responded car". We classified the responses either as "correct" (answer corresponds to shown category, here' tree') or "incorrect" (answer doesn't correspond to shown category, here 'car' and 'house')
- Because the experiment would take too long to test all ~400 pictures, we split the experiment into three 'runs', each ~120 pictures. Each run used different pictures (to have all of them rated in the end) and a different set of participants.
- Each participant rated between 35 and 50 pictures of each category

What we have done (statistical analysis)
- Based on the classification of each answer as correct or incorrect, we report plain hit rates (mainly for visual presentation) and the so-called un-biased hit rates, which takes biases in stimulus frequency and response biases into account (HL Wagner, 1993, On measuring performance in category judgment studies of nonverbal behavior, Journal of nonverbal behavior, 17, p 3-28)
- We find that people are indeed able to identify the category above chance level (chance level as calculated according to Wagner)

Now, the reviewer comments:
"Aggregating between 104 and 141 responses per participant into a single number throws away a lot of information, and assuming that hit rates are normally distributed is unwarranted, if only because these are bounded between 0 and 1. I would recommend performing multi-level logistic regression on individual responses per trial with the appropriate random effects for this analysis."

And then also "In addition, the perceived emotion category (rather than its correct/incorrect recognition) can be modeled with multinomial regression, providing inference on confusion patterns, which are now presented as observed percentages only." I plan to open a separate thread on this one, so please comment on this only if you think this is directly to the other point.

We have no experience with multi-level logistic regression. So I tried to read up on this, but I somehow can't properly see the link. Thus, any suggestion at a good read or pointing to how to interpret our design as a multi-level logistic regression would be greatly appreciated. For example, it might help to illustrate how the data would have to be entered into SPSS and which are the right SPSS routines to look into.

Many thanks,