Large AI training data set removed after study finds child abuse material

Published by Victoria Kyle at December 21, 2023

LAION-5B, a large artificial intelligence data set used to train several popular text-to-image generators, was found to contain child sexual abuse material.

The LAION-5B dataset, a key resource for training AI image generators like Stable Diffusion and Imagen, has been withdrawn by its creators. This action follows a report revealing the dataset contained numerous instances of potential child sexual abuse material (CSAM).

LAION, or Large-scale Artificial Intelligence Open Network, a German nonprofit, provides open-source AI models and datasets. These resources have been instrumental in developing several prominent text-to-image models. However, a December 20 study by the Stanford Internet Observatory’s Cyber Policy Center uncovered 3,226 suspected cases of CSAM in the LAION-5B dataset. Stanford’s Big Data Architect and Chief Technologist, David Thiel, confirmed that much of this material was verified as CSAM by external parties.

Thiel pointed out that the presence of CSAM in the dataset doesn’t necessarily mean it will significantly alter the output of AI models trained with it. However, he acknowledged that it could have some impact. Thiel also highlighted the issue with repeated instances of the same CSAM, emphasizing the harm in perpetuating images of specific victims.

Released in March 2022, the LAION-5B dataset contains 5.85 billion image-text pairs. In response to these findings, LAION announced the removal of both the LAION-5B and the LAION-400M datasets. The organization stated this was a precautionary measure to ensure the datasets’ safety before they are republished.