Google reportedly allowed OpenAI to scrap data from YouTube videos for GPT-4 training

The AI company used a million hours of videos from YouTube

Reading time icon 3 min. read


Readers help support Windows Report. We may get a commission if you buy through our links. Tooltip Icon

Read our disclosure page to find out how can you help Windows Report sustain the editorial team Read more

Google reportedly allowed OpenAI to scrap data from YouTube videos for GPT-4 training

As per the latest report from The New York Times, OpenAI scrapped data from YouTube videos to train its most advanced large language model (LLM), GPT-4. The AI company reportedly used a million hours of YouTube videos for GPT-4 training.

Interestingly, people from the concerned department at Google, which also owns YouTube, know about OpenAI’s practice of transcribing YouTube videos.

Google allegedly has the same approach so it allows OpenAI to scrap data from YouTube videos for GPT-4 training

The report suggests that OpenAI has developed a new model – the Whisper audio transcription model, which helped the AI company to scrap YouTube video data. It is worth noting that the company is well aware that it might come under the scanner of government bodies. However, it went ahead with the practice believing it was fair use.

The NY Times claimed that OpenAI scrapped data from YouTube videos and podcasts to train its two AI models. The report further mentions the involvement of OpenAI president, Greg Brockman, in the company’s shady approach for GPT-4 training.

The news agency further reported that Google is also practicing the same for training its Gemini AI which is a direct violation of the creator’s copyrights. However, Google said that it scraps data from YouTube videos only when the original creator consents to it.

The NY Times also talked about a report from The Times that Google tweaked its privacy policy last year. Talking about the same, it mentioned:

One motivation for the change, according to members of the company’s privacy team and an internal message viewed by The Times, was to allow Google to be able to tap publicly available Google Docs, restaurant reviews on Google Maps, and other online material for more of its A.I. products.

Previously, OpenAI CTO Mira Murati confirmed that their new AI video model, SoraAI, is trained on publicly available video data.

YouTube is aware of OpenAI’s practice but seemingly hesitates to interfere

In a recent interview with Bloomberg, YouTube CEO Neil Mohan said such practices are a clear violation of terms of service. He added:

One of those expectations is that the terms of service are going to be abided by. It does not allow for things like transcripts or video bits to be downloaded, and that is a clear violation of our terms of service. Those are the rules of the road in terms of content on our platform.

When asked about OpenAI using the data from YouTube videos for GPT-4 training, he gave an unsatisfactory answer. Mohan said that he is aware of the reports and added it may or may not have used data from YouTube videos.

Lastly, it is not new for AI companies like OpenAI and Google to use publicly available data for AI training. That said, these companies are well aware that they can be scrutinized for the same matter by the regulators.

Do you think companies using user data to train AI is a shady tactic? What’s your take on this? Share your views in the comments below.

More about the topics: AI, Google, OpenAI