Leaked Documents Reveal NVIDIA’s Secret AI Training Practices
NVIDIA has used videos from YouTube and other sources to train its AI products, as revealed by internal communications and documents obtained by 404 Media.
When discussing the legal and ethical aspects of using copyrighted content for training AI models, NVIDIA stated that their actions fully comply with copyright law. Internal conversations among NVIDIA employees show that when workers raised concerns about potential legal issues, managers assured them that the company’s top executives had approved the use of this data for AI model training.
A former NVIDIA employee reported that workers were asked to download videos from Netflix, YouTube, and other sources to train AI models such as the Omniverse 3D world generator, autonomous driving systems, and digital human products. The project, named Cosmos, has yet to be unveiled to the public.
The goal of Cosmos was to create an advanced video generation model capable of simulating light, physics, and intelligence in one place, allowing for its use in various applications. Internal messages show that employees used an open-source program, yt-dlp, to download videos from YouTube, bypassing blocks through virtual machines with rotating IP addresses.
Project managers discussed using 20-30 virtual machines on Amazon Web Services to download video content equivalent to 80 years of viewing every day. In May, an NVIDIA representative stated that the company was finalizing the first version of its data pipeline and preparing to create a video data factory that would generate data equivalent to a human lifetime daily.
An NVIDIA representative stated that the company is confident that its models comply with copyright law, as the law protects expressions but not facts, ideas, data, or information, which can be used to create one’s expressions.
Google and Netflix confirmed that NVIDIA’s use of their content violates their terms of service. NVIDIA employees concerned about the legal aspects were told by managers that this was an “executive decision” and that they should not worry about it.
Nevertheless, many researchers and legal experts argue that using copyrighted content for AI training is an open legal question. In recent years, academics have increasingly licensed their research data for non-commercial use to limit the commercial exploitation of their work.
The Cosmos project involved using both public and internal videos, as well as data collected by researchers. However, licenses for many of these datasets restrict their use to academic purposes only.
Discussions within NVIDIA also raised the possibility of using movie clips for training models. Employees suggested uploading films such as “Avatar” and “The Lord of the Rings” to obtain high-quality data. However, this raised concerns about potential conflicts with Hollywood and other stakeholders.
The project faced several technical and legal challenges related to capturing video from games and other sources. Nevertheless, in March, NVIDIA managed to download 100,000 videos in just two weeks, marking a significant milestone for the project.
Notably, NVIDIA’s lead scientist, Francesco Ferroni, created a Slack channel dedicated to building a massive video dataset for the Cosmos project. Ferroni shared a link in the channel to a spreadsheet listing various datasets, including:
- MovieNet (a database of over 1,000 films and 60,000 film trailers);
- WebVid (a video dataset from GitHub, composed of stock footage and removed by its creator after a cease-and-desist request from Shutterstock);
- InternVid-10M (a dataset containing 10 million YouTube video IDs);
- several internal datasets with saved frames from video games.
The situation with the Cosmos project clearly illustrates how major tech companies exploit legal gray areas to amass vast amounts of data necessary for training AI models. At the same time, it jeopardizes the rights of content creators and raises concerns among researchers and rights advocates.
Until a clear legal framework and transparency standards are established, similar situations will recur, threatening both content creators’ rights and public trust in AI innovations.