Datasets 

Here, you can find a list of datasets (including respective links and references) coming out of the project work.

You may also want to check out our presence on Zenodo where we also list datasets or the Data Management Plan.

"VERITE: A Robust Benchmark for Multimodal Misinformation Detection Accounting for Unimodal Bias" Stefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos & Panagiotis Petrantonakis
January 08, 2024

VERITE (VERification of Image-TExt pairs) is an annotated evaluation benchmark for multimodal (image-caption) misinformation detection that accounts for unimodal biases.

Show dataset

"Synthbuster: Towards Detection of Diffusion Model Generated Images" Quentin Bammey
November 02, 2023

Dataset of 9.000 AI-generated images, described in the paper “Synthbuster: Towards Detection of Diffusion Model Generated Images” (Quentin Bammey, 2023, Open Journal of Signal Processing)

Show dataset

Show related article

"An Open Dataset of Synthetic Speech" Artem Yaroshchuk, Christoforos Papastergiopoulos & Luca Cuccovillo
September 29, 2023

ODSS is a multilingual, multispeaker dataset of synthetic and natural speech, designed to foster research and benchmarking of novel studies on synthetic speech detection. 

ODSS comprises audio utterances generated from text by state-of-the-art synthesis methods, paired with their corresponding natural counterparts. The synthetic audio data includes several languages, with an equal representation of genders.

Natural and synthetic speech audio files within ODSS are released under the CC-BY-SA 4.0 license: Usage, extension and redistribution by the research community are strongly encouraged.

Show dataset

Show related article

 

"IDMT Audio Phylogeny Dataset" Milica Gerhardt & Luca Cuccovillo
September 26, 2023

The IDMT Audio Phylogeny Dataset contains audio phylogeny trees for evaluation of audio phylogeny algorithms. It includes two different sets of phylogeny trees with 60 trees each, where every tree contains 20 nodes (audio files). The main difference between these two sets is in the set of transformations T used to create near duplicates in the set.

This dataset is accompanied to publication and in case you use it please cite:

M. Gerhardt, L. Cuccovillo, P. Aichroth, “Advancing Audio Phylogeny: A Neural Network Approach for Transformation Detection”, in IEEE International Workshop on Information Forensics and Security, in press.

Show datatset

"PolyMeme" Vasileios Arailopoulos, Christos Koutlis, Symeon Papadopoulos & Panagiotis Petrantonakis
September 18, 2023

Internet Memes have emerged as a dominant new form of mass media and communication and they often consist of images that combine text with image and aim to express humor, irony, sarcasm, or sometimes convey hatred and misinformation. By recognizing them, we can characterize the trend of today's culture and avoid issues related to the spread of hateful and harmful content.

Memes can take various forms which can be split into several categories. Existing datasets typically do not recognize these categories and do not provide a set of images with significant size and diversity, so we created one that sufficiently satisfies those requirements. The collection is gathered from Reddit and is semi-automatically labelled. More precisely the dataset contains ~27k Internet image memes categorized as "Image Macro", "Object Labeling", "Screenshots" and "Text out of Image".

Show datatset

Related Articles

vera.ai is co-funded by the European Commission under grant agreement ID 101070093, and the UK and Swiss authorities. This website reflects the views of the vera.ai consortium and respective contributors. The EU cannot be held responsible for any use which may be made of the information contained herein.