NLP Insight Filter [June 12 2024]
What NLP people are talking about this week on Twitter/X
Tl;dr: Two evaluation/dataset papers for this week. The first one looks at errors in the MMLU benchmark. MMLU is probably in the top 3 most used NLU benchmarks, so finding errors in this benchmark is sort of a big deal. The paper then proposes a methodology for removing these errors. Also, you’ve probably heard about the Chatbot Arena - a benchmark to evaluate generative chatbots. Now there’s GenAI Arena which is basically Chatbot Arena but for multimodal models.
FYI: My ‘popularity emoji’ is based on aggregate statistics of how many people have engaged with a paper on Twitter/X (as well as my own subjective personal interest).
Very popular (you really should know about this): 🔥
Popular (a good amount of people are discussing this): 😄
Less popular (but still worth making a mental note) : 🙂
Popularity: 😄
Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error taxonomy. Then, we create MMLU-Redux, which is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported. Our results strongly advocate for revising MMLU's error-ridden questions to enhance its future utility and reliability as a benchmark. Therefore, we open up MMLU-Redux for additional annotation this https URL.
GenAI Arena: An Open Evaluation Platform for Generative Models
Popularity: 🙂
Generative AI has made remarkable strides to revolutionize fields such as image and video generation. These advancements are driven by innovative algorithms, architecture, and data. However, the rapid proliferation of generative models has highlighted a critical gap: the absence of trustworthy evaluation metrics. Current automatic assessments such as FID, CLIP, FVD, etc often fail to capture the nuanced quality and user satisfaction associated with generative outputs. This paper proposes an open platform GenAI-Arena to evaluate different image and video generative models, where users can actively participate in evaluating these models. By leveraging collective user feedback and votes, GenAI-Arena aims to provide a more democratic and accurate measure of model performance. It covers three arenas for text-to-image generation, text-to-video generation, and image editing respectively. Currently, we cover a total of 27 open-source generative models. GenAI-Arena has been operating for four months, amassing over 6000 votes from the community. We describe our platform, analyze the data, and explain the statistical methods for ranking the models. To further promote the research in building model-based evaluation metrics, we release a cleaned version of our preference data for the three tasks, namely GenAI-Bench. We prompt the existing multi-modal models like Gemini, GPT-4o to mimic human voting. We compute the correlation between model voting with human voting to understand their judging abilities. Our results show existing multimodal models are still lagging in assessing the generated visual content, even the best model GPT-4o only achieves a Pearson correlation of 0.22 in the quality subscore, and behaves like random guessing in others.

