Here, you can find a list of datasets (including respective links and references) coming out of the project work.
You may also want to check out our presence on Zenodo where we also list datasets or the Data Management Plan.
VERITE (VERification of Image-TExt pairs) is an annotated evaluation benchmark for multimodal (image-caption) misinformation detection that accounts for unimodal biases.
Dataset of 9.000 AI-generated images, described in the paper “Synthbuster: Towards Detection of Diffusion Model Generated Images” (Quentin Bammey, 2023, Open Journal of Signal Processing)
ODSS is a multilingual, multispeaker dataset of synthetic and natural speech, designed to foster research and benchmarking of novel studies on synthetic speech detection.
ODSS comprises audio utterances generated from text by state-of-the-art synthesis methods, paired with their corresponding natural counterparts. The synthetic audio data includes several languages, with an equal representation of genders.
Natural and synthetic speech audio files within ODSS are released under the CC-BY-SA 4.0 license: Usage, extension and redistribution by the research community are strongly encouraged.
The IDMT Audio Phylogeny Dataset contains audio phylogeny trees for evaluation of audio phylogeny algorithms. It includes two different sets of phylogeny trees with 60 trees each, where every tree contains 20 nodes (audio files). The main difference between these two sets is in the set of transformations T used to create near duplicates in the set.
This dataset is accompanied to publication and in case you use it please cite:
M. Gerhardt, L. Cuccovillo, P. Aichroth, “Advancing Audio Phylogeny: A Neural Network Approach for Transformation Detection”, in IEEE International Workshop on Information Forensics and Security, in press.
Internet Memes have emerged as a dominant new form of mass media and communication and they often consist of images that combine text with image and aim to express humor, irony, sarcasm, or sometimes convey hatred and misinformation. By recognizing them, we can characterize the trend of today's culture and avoid issues related to the spread of hateful and harmful content.
Memes can take various forms which can be split into several categories. Existing datasets typically do not recognize these categories and do not provide a set of images with significant size and diversity, so we created one that sufficiently satisfies those requirements. The collection is gathered from Reddit and is semi-automatically labelled. More precisely the dataset contains ~27k Internet image memes categorized as "Image Macro", "Object Labeling", "Screenshots" and "Text out of Image".