Datasets

Here, you can find a list of datasets (including respective links and references) coming out of the project work.

You may also want to check out our presence on Zenodo where we also list datasets or the Data Management Plan.

"VERITE: A Robust Benchmark for Multimodal Misinformation Detection Accounting for Unimodal Bias" Stefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos & Panagiotis Petrantonakis

January 08, 2024

VERITE (VERification of Image-TExt pairs) is an annotated evaluation benchmark for multimodal (image-caption) misinformation detection that accounts for unimodal biases.

Show dataset

"Synthbuster: Towards Detection of Diffusion Model Generated Images" Quentin Bammey

November 02, 2023

Dataset of 9.000 AI-generated images, described in the paper “Synthbuster: Towards Detection of Diffusion Model Generated Images” (Quentin Bammey, 2023, Open Journal of Signal Processing)

Show dataset

Show related article

"An Open Dataset of Synthetic Speech" Artem Yaroshchuk, Christoforos Papastergiopoulos & Luca Cuccovillo

September 29, 2023

ODSS is a multilingual, multispeaker dataset of synthetic and natural speech, designed to foster research and benchmarking of novel studies on synthetic speech detection.

ODSS comprises audio utterances generated from text by state-of-the-art synthesis methods, paired with their corresponding natural counterparts. The synthetic audio data includes several languages, with an equal representation of genders.

Natural and synthetic speech audio files within ODSS are released under the CC-BY-SA 4.0 license: Usage, extension and redistribution by the research community are strongly encouraged.

Show dataset

Show related article

"IDMT Audio Phylogeny Dataset" Milica Gerhardt & Luca Cuccovillo

September 26, 2023

The IDMT Audio Phylogeny Dataset contains audio phylogeny trees for evaluation of audio phylogeny algorithms. It includes two different sets of phylogeny trees with 60 trees each, where every tree contains 20 nodes (audio files). The main difference between these two sets is in the set of transformations T used to create near duplicates in the set.

This dataset is accompanied to publication and in case you use it please cite:

M. Gerhardt, L. Cuccovillo, P. Aichroth, “Advancing Audio Phylogeny: A Neural Network Approach for Transformation Detection”, in IEEE International Workshop on Information Forensics and Security, in press.

Show datatset

"PolyMeme" Vasileios Arailopoulos, Christos Koutlis, Symeon Papadopoulos & Panagiotis Petrantonakis

September 18, 2023

Internet Memes have emerged as a dominant new form of mass media and communication and they often consist of images that combine text with image and aim to express humor, irony, sarcasm, or sometimes convey hatred and misinformation. By recognizing them, we can characterize the trend of today's culture and avoid issues related to the spread of hateful and harmful content.

Memes can take various forms which can be split into several categories. Existing datasets typically do not recognize these categories and do not provide a set of images with significant size and diversity, so we created one that sufficiently satisfies those requirements. The collection is gathered from Reddit and is semi-automatically labelled. More precisely the dataset contains ~27k Internet image memes categorized as "Image Macro", "Object Labeling", "Screenshots" and "Text out of Image".

Show datatset

Code made available - you can get your hands on it

A lot in vera.ai centers around data and code. In other words: the dealing with data, its analysis, experimentation, and more. In October 2024 we have started to bring together and make available all code that a) was created in and b) is used by vera.ai members in one central destination on the project website. This includes direct links to respective repositories - in most cases GitHub.

Code DatasetsOctober 15, 2024

ODSS: An Open Dataset of Synthetic Speech. Call to use and cooperate

A team of vera:ai researchers from the Fraunhofer Institute for Digital Media Technology (IDMT), Germany and the Centre for Research and Technology Hellas (CERTH-ITI), Greece have developed a new synthetic speech detection dataset. Find out more why they did it and what this contains.

DatasetsMarch 12, 2024

Overcoming Unimodal Bias in Multimodal Misinformation Detection

This post explains the basics behind a paper entitled “VERITE: a robust benchmark for multimodal misinformation detection accounting for unimodal bias”, published in the International Journal of Multimedia Information Retrieval (IJMIR). It has been authored by researchers of the mever group at project partner CERTH-ITI.

DatasetsJanuary 19, 2024

The persistence and resilience of misinformation in the face of fact checking

A project by vera.ai, entitled “The Persistence and Resilience of Misinformation in the Face of Fact Checking,” was facilitated by Fabio Giglietto, Massimo Terenzi ( UNIURB) and Richard Rogers (UvA) at the Digital Media Initiative Winter School at the University of Amsterdam. We are pleased to share with you the results.

Datasets Meet us @January 17, 2024

Project partner KInIT at the EMNLP conference in Singapore represented with three papers

The EMNLP (Empirical Methods in Natural Language Processing) conference is one of the top NLP conferences. KInIT researchers Róbert Móro and Ján Čegiň presented three full papers at this prestigious event. Here’s more about what they did, and the related works.

Meet us @DatasetsJanuary 16, 2024

Unlocking Insights: paper "Cracking Open the European Newsfeed" is finally out in JQD:DM

A new paper that includes recent findings of disinformation analysis has been published in the Journal of Quantitative Description: Social Media. It is entitled “Cracking Open the European Newsfeed”, co-authored by vera.ai team members based at University of Urbino “Carlo Bo”, Italy. The paper contributes to the ongoing effort to describe and quantify the quality of information shared on large social media platforms.

Further Material DatasetsDecember 21, 2023

Synthbuster: Towards Detection of Diffusion Model Generated Images

Dataset of 9.000 AI-generated images described in the paper "Synthbuster: Towards Detection of Diffusion Model Generated Images" (Quentin Bammey, 2023, Open Journal of Signal Processing).

DatasetsNovember 02, 2023

OpenAI Models for Topic Modelling in Social Media Analysis

Here's a video recording of a talk by Fabio Giglietto of the University of Urbino, given in June 2023 at the Queensland University of Technology (QUT). Topic: using OpenAI models to identify the most salient topics circulated via Facebook links in the run-up to the Italian general elections.

Meet us @Demos and Trainings DatasetsJuly 07, 2023

Mapping the ‘memory loss’ of disinformation in fact-checks

In this piece we present findings from a study on current fact-checking archiving practices and Facebook post removals using the “War in Ukraine” dataset of the European Digital Media Observatory (EDMO). Insights presented here come from a project that was carried out as part of the Digital Methods Initiative Winter School and Data Sprint 2023.

Demos and Trainings Datasets Meet us @January 26, 2023

Mapping post-truth spaces concerning the war in Ukraine

Here's another project that was carried out during the 2023 Winter School organised by the University of Amsterdam’s Digital Methods Initiative (DMI). The work of this team, coordinated by the hosts themselves, among other tested a "detection method” that is to allow fact-checkers for faster detection of “post-truth spaces” and verification of actors that spread (in this case pro-Kremlin) propaganda on social networks.

Meet us @Demos and Trainings DatasetsJanuary 23, 2023

vera.ai is co-funded by the European Commission under grant agreement ID 101070093, and the UK and Swiss authorities. This website reflects the views of the vera.ai consortium and respective contributors. The EU cannot be held responsible for any use which may be made of the information contained herein.

Datasets

Related Articles

Code made available - you can get your hands on it

ODSS: An Open Dataset of Synthetic Speech. Call to use and cooperate

Overcoming Unimodal Bias in Multimodal Misinformation Detection

The persistence and resilience of misinformation in the face of fact checking

Project partner KInIT at the EMNLP conference in Singapore represented with three papers

Unlocking Insights: paper "Cracking Open the European Newsfeed" is finally out in JQD:DM

Synthbuster: Towards Detection of Diffusion Model Generated Images

OpenAI Models for Topic Modelling in Social Media Analysis

Mapping the ‘memory loss’ of disinformation in fact-checks

Mapping post-truth spaces concerning the war in Ukraine