Do LLMs laugh at electric memes?

We know that LLMs have a pretty good “understanding” of the world (or can simulate one, if you don’t like the wording). Previous research has shown that transformer-style models outperform traditional cognitive modelling on many different tasks and even in modelling cognition across species. However, do they understand what humans experience when they look at memes?
What might sound entertaining or nonsensical at first is actually an interesting question: While pure sentence continuation by brute-force probability learning only trains a stochastic parrot (so the rumor has it), it is not clear what higher-order functions emerge within this parrot. Is the parrot able to understand what goes on in an average human mind?
Enter this cool paper from 2017 by Cowen & Keltner: “Self-report captures 27 distinct categories of emotion bridged by continuous gradients”. Adding to the decades old dispute whether emotions are categorigal or dimensional they come up somehow that could be paraphrased as – “why not both” (obviously, their discussion of the subject is a bit more nuanced than that). Leaving aside this academic question, to me, the neat thing is not the paper but the resulting dataset! They took ~2000 GIFs (short memes) and let several thousand people evaluate them using free-text, categorical or dimensional labels – then used PCA to extract a meaningfull category-dimensional hybrid for each GIF. They place each GIF in a 48-dimensional emotional hyperspace, telling us ground truth values of how an average human will experience the GIF emotionally.
We have successfully used this dataset in a paradigm to elicit emotions in participants – it works much better than the overused IAPS. Plus it’s quite enjoyable for the participant, the first study where people recommended their friends to participate as it was so entertaining.
Explore the GIFs yourself using this map – it’s pretty fun!
.
.
0.88 Amusement, 7.4 valence, 0.0 Anxiety, […]
Example GIFs and three of the 48 ratings
0.41 Fear, 1.2 valence, 0.25 Relief, […]
.
0.25 Adoration, 6.4 valence, 0.08 Dissapointment […]
Can an LLM predict human emotions when seeing memes?
Coming back to the topic: Can we use it to test the understanding of an LLM? If we feed the videos to a multimodal LLM and give it the list of 48 labels, can it accurately predict, what the average human would rate the GIF at? That means: Can it model human emotions, understand what a human might feel when watching the GIF? This would be pretty amazing, as far as I know there’s no neural network out there that could do this on such a precise label scale. This would also enable us to label new GIFs and stimuli as well!
To answer this question, I fed a random subselection of 500 GIFs into Gemini 2.0. This is the only LLM that allows for input of videos and actually looks at the videos frame-by-frame. I then queried it to predict the ratings of the GIFs on the 48 labels.
click to see full prompt
Here’s a Video, please rate it on the following affective and categorical dimensions similar to how a human would perceive and rate the scenes
These are the category ratings, rated from 0 to 10
Admiration,Adoration,Aesthetic Appreciation,Amusement,Anger,Anxiety,Awe,Awkwardness,Boredom,Calmness
Confusion,Contempt,Craving,Disappointment,Disgust,Empathic Pain,Entrancement,Envy,Excitement,Fear
Guilt,Horror,Interest,Joy,Nostalgia,Pride,Relief,Romance,Sadness,Satisfaction,Sexual Desire,Surprise, Sympathy,Triumph,
these are the affective dimensions, rated from 1 to 9
approach,arousal,attention,certainty,commitment,control,dominance,effort,fairness,identity,obstruction,safety,upswing,valence
- Your output should be in JSON dictionary, with an key for each dimension and the score as a value, e.g. {‘Excitement’: 4, …}
- Only output the dictionary in JSON format, nothing else.
LLMs predict human affect surprisingly well
To my surprise, Gemini 1.5 pro predicted the emotions humans would experience from a GIF pretty good! On average, we have an r-value of ~0.44, with spread from 0.87 (for sexual arousal) to -0.05 (for effort). Almost all correlation values were positive. There seems to be a general trend: negative emotions are better to predict than positive, and valence (whether it is positive or not) is in the top 5. I’m quite surprised that arousal is so low (~0.3), as it is one of the most used scales and just predicts how “activating” a GIF is, which seems more trivial than rating something obscure like “control”.

Categories are easier to predict than dimension
In psychology, there is a long-standing debate whether emotions are categorical or dimensional. Both of these approaches seem to make their valid points, however, this dataset shows the limitations of the debate. The prediction of categorical ratings (indicated by upper case words) is on average much easier than the dimensional scales. Which is not surprising, as they are not applicable to GIFs, or how “dominant” or “identity” would you rate a video? Would you use those words to describe how a movie made you feel? Makes no sense, does it.
A new LLM-Benchmark?
While some labels are much easier to predict, the average performance of 0.44 can surely be improved. This should lend itself nicely to benchmarking multimodal LLMs on their ability to understand human affect! Any ASI should be able to ace this. Nevertheless one should keep in mind that all of this hinges on the quality of the vision-model that is used. Most LLMs get trained on text and vision separately. Then the vision model is connected to the text output using various techniques. So if the vision model sucks, the LLM output cannot make up for that.
Additionaly, how well the task is achieved is limited by how the GIF is actually processed. Turns out, 4o will reduce the framerate, while Gemini actually inputs the GIF frame-by-frame. This could also nicely act as a frame-by-frame annotation and make it much easier to analyse the underlying neurophysiological recording together with time-specific labels.
Individual ratings
Last but not least, here are the individual ratings. Have fun exploring!















































