In an era dominated by the influence of platforms, access to comprehensive data has become paramount for researchers seeking to analyse the dynamics of online content. Tailored to the needs emerging from the vera.ai project, this contribution unveils a series of policy recommendations aimed at enhancing data access tools, streamlining application processes, and promoting transparency in the use of social media data for academic and investigative purposes.
Researchers need access to a dedicated API endpoint that allows them to search and retrieve social media content based on specified parameters and search terms to analyse the transmission and amplification of social media content. Moreover, in line with the data access needs mandated by the Digital Services Act (DSA), a repository of archived takedowns, removed ads, and any relevant removed content would enable researchers to improve their investigations.
The tools that the vera.ai project is planning to develop include a customisable search functionality, comprising the ability to include or exclude specific accounts. Each search query should return posts with accompanying metadata, including reach, content impact indicators, views, exposure, and interaction statistics at the time of the query. Ideally, a history of interactions since publication should also be accessible. Furthermore, researchers will have the option to sort results by criteria, like publishing date or popularity, and access original multimedia content, such as images and videos, in their unaltered format.
The project actively oversees developments in the area of monitoring trending content across platforms in real-time. Important needs in this area include providing a list of the most popular platform-wide content and narratives trending in each EU Member State. Additionally, data on how content recommendations and search results are generated algorithmically should be programmatically accessible - but be strictly restricted to vetted researchers to prevent misuse. Knowing if, why, and how a piece of content has been recommended meets the requirements posed by Article 27 of the DSA about recommender system transparency.
To improve platform accountability and identify repeat offenders (DSA Article 23), and to conduct systemic risk assessments (Article 34), researchers should be able to create, update and follow lists of public accounts and venues (e.g., groups, channels) to track their public posts and the performance of those posts.
Researchers should be able to use the provided tools to analyse the volume of content related to specific events or topics, such as elections or migratory crises, during information operations. This analysis should also extend to understanding the demographics (e.g., gender, age, location) of users exposed to this content and measuring the impact of these information operations on users. This will draw from existing frameworks, such as the “breakout scale” for information operations, the “DISARM framework” for Foreign Information Manipulation and Interference (FIMI), and the “impact-risk index” for single hoaxes.
The application process for data access (as defined by Article 40 of the DSA) should be streamlined and focus on key aspects such as the research question or hypothesis to be studied, the researcher’s prior experience, and data protection mechanisms.
Researchers involved in projects related to disinformation, terrorism, extremism, and CSAM (Child Sexual Abuse Material) should be given strong priority.
In addition to academic researchers, non-academic researchers from think tanks, research institutes, and linked to EU-funded research consortia on disinformation and AI (e.g., the various EDMO hubs) fact-checkers, and investigative journalists should also be considered as valid applicants.
To expedite the process, Digital Services Coordinators (DSCs) should monitor the vetting process and ensure it is completed within a reasonable timeframe.
A researcher vetting process based on references from peers at independent institutions should be considered to ensure the credibility of researchers applying for data access. For instance, all partners of EDMO hubs and EC-funded projects should be automatically vetted, taking into account the thorough process guiding their selection and activities.
There is a need to clearly define data protection measures, including the signing of appropriate agreements and non-disclosure agreements (NDAs) either with the social platforms themselves or with an independent intermediary body which is in charge of vetting researchers and providing them with privacy-preserving access to data.
In line with General Data Protection Regulation (GDPR) requirements, clear ethical guidelines and criteria for data retention and deletion must be respected.
Platforms may request to review research results before publication, but this review should be limited to specific issues related to user privacy infringement.
To ensure that data gathered under Article 40 is used for the purposes envisaged and to minimise the risk of abuses, a carefully planned data access and management plan should be mandatory.
In line with Article 40(13), establishing an independent advisory body or mechanism could help oversee the data access request process. This mechanism can ensure impartiality and transparency in the evaluation of requests and the protection of user and business rights.
Moreover, establishing a feedback mechanism for researchers to report any issues or concerns they encounter while using the data accessed can help improve processes over time.
Similarly, another proposal is to implement regular audits of data access practices and policies to ensure compliance and identify areas for improvement.
For data access interfaces, simple and standard REST APIs are preferred, along with the ability to export data in common file formats such as database dump files, CSV, or JSON files.
Data formats should be simple to analyse automatically, especially by vetted researchers that do not have a strong technical background. This can be achieved in two ways: firstly, by offering a graphical user interface for researchers without programming skills and secondly, by providing comprehensive documentation for those with programming skills to develop wrappers for major programming languages (and possible GUIs).
It would be beneficial to grant vetted researchers access to appropriate high-performance cloud computing resources for large-scale projects, ensuring data remains secure and controlled.
Moreover, the needs of researchers with disabilities need to be considered to ensure inclusivity.
Establishing a common and precise language is desirable, possibly through a standard data dictionary and business glossary, to facilitate communication among Digital Services Coordinators (DSCs), vetted researchers, Very Large Online Platforms (VLOPs), and Very Large Online Search Engines (VLOSEs) without adding unnecessary complexity. Moreover, this standardisation would facilitate meta- and comparative analyses aimed at building further knowledge from the information available.
Certain mechanisms could be enforced to facilitate access to data for researchers meeting the conditions set out in Article 40(12), which includes fulfilling specific security and confidentiality requirements in line with the GDPR. Although GDPR poses some legitimate concerns in this scenario, it should not be used as an excuse to hinder DSA-mandated data access. In addition, the matter will be addressed in the upcoming delegated acts on data access, laying down the technical conditions for data sharing.
A daily updated list of the 100 most popular platform-wide content and narratives circulating in each EU Member State would help researchers focus their efforts. The creation and availability of data dumps would simplify access to this data.
Providing limited access to certain data for public interest purposes could also be considered, such as monitoring the spread of disinformation during critical events like elections or public health crises, as it can help mitigate potential risks to society.
These policy recommendations aim to strike a balance between enabling valuable research and protecting user privacy and business interests, ultimately promoting transparency and accountability in the use of social media data for academic and investigative purposes.
Authors: Alexis Gizikis, Maria Giovanna Sessa and Oleh Shchuryk (all EU DisinfoLab), with inputs from vera.ai project partners
Editors: Joanna Wright and Kalina Bontcheva (University of Sheffield) & Jochen Spangenberg (DW)