Explainable Classification of Internet Memes

Abstract

Nowadays, the integrity of online conversations is faced with a variety of threats, ranging from hateful content to manufactured media. In such a context, Internet Memes make the scalable automation of moderation interventions increasingly more challenging, given their inherently complex and multimodal nature. Existing work on Internet Meme classification has focused on black-box methods that do not explicitly consider the semantics of the memes or the context of their creation. This paper proposes a modular and explainable architecture for Internet Meme classification and understanding. We design and implement multimodal classification methods that perform example-and prototype-based reasoning over training cases, while leveraging both textual and visual SOTA models to represent the individual cases. We study the relevance of our modular and explainable models in detecting harmful memes on two existing tasks: Hate Speech Detection and Misogyny Classification. We compare the performance between example-and prototype-based methods, and between text, vision, and multimodal models, across different categories of harmfulness (eg, stereotype and objectification). We devise a user-friendly interface that facilitates the comparative analysis of examples retrieved by all of our models for any given meme, informing the community about the strengths and limitations of these explainable methods.

Date: November 30, 2025
Authors: Abhinav Kumar Thakur, Filip Ilievski, Hông Ân Sandlin, Zhivar Sourati, Luca Luceri, Riccardo Tommasini, Alain Mermoud
Conference: 17th International Workshop on Neural-Symbolic Learning and Reasoning, NeSy 2023