OpenAI's "unspeakable secrets" have been made public? The method of using YouTube data to train models for acquisition is not glamorous
忧郁草民乜
发表于 2024-3-19 21:47:28
1297
0
0
As the GPT big model continues to advance, OpenAI seems to face only a lot of criticism. Apart from Musk's repeated questioning of "open source", the data source for training the big model in OpenAI is unclear, or it has laid a mine for infringement litigation for this company.
The training of current AI large models requires large-scale datasets with huge amounts of data to be fed on the data side. According to public information, the data sources of OpenAI may include but are not limited to: publicly available data sets, such as various resources on the Internet, such as books, web pages, news articles, academic papers, etc; Datasets provided by partners and third-party data providers; Specific field data purchased, such as medical, legal, or scientific literature; Synthesize data, OpenAI may use its own model to generate data, such as training and improving the model through its own output; Crowdsourcing and community contribution data.
The source of data is not the most important issue, and the focus of external attention is how OpenAI obtains this data.
Did you steal it?
As reported by Business Insider, OpenAI's use of a large number of YouTube videos to train models has become an "open secret", benefiting products including its newly launched Sora model in the field of cultural and biological videos. The mystery lies in how OpenAI can obtain sufficient YouTube content.
You should know that YouTube is a subsidiary of Google. In 2006, YouTube was acquired by Google for $1.65 billion and rapidly grew into the world's largest video sharing platform with Google's support.
Google has always been committed to developing AI and is one of OpenAI's main competitors. Naturally, it will not provide its own gold mine to users for free. YouTube has long banned downloading for commercial purposes and will also restrict the downloading of a large amount of YouTube video data. Under such strict control, individual users have also been affected. Some have stated that even downloading a YouTube video can be very slow, taking several hours to complete.
A common guess is that OpenAI used web crawlers to "steal" YouTube data. OpenAI has admitted to launching a web crawler robot called GPTBot, which is used to crawl and collect data for large-scale model training.
OpenAI executives are evasive about related issues, which indirectly deepens people's impression of it as a "data thief". The Wall Street Journal recently asked OpenAI Chief Technology Officer Mira Murati if the startup is using videos from YouTube, Instagram, and Facebook to train Sora.
"I'm actually not sure about this," she said. When asked again about the source of the training data, Murati refused to answer, "I won't disclose any details."
According to a recent report from Business Insider, a person familiar with OpenAI operations stated that the company has assigned a tightly protected team to obtain training data, and how to obtain this data is a confidentiality issue.
The AI field with dense fences
The use of crawlers is not tolerated by Google, as its YouTube platform prohibits the use of robots and other automated methods to crawl its videos.
But for OpenAI, accessing YouTube videos in a way that violates Google's terms of service may not be illegal. The case law and "fair use" principle in the United States grant companies the right to freely use online content in different ways.
In short, Google, OpenAI, and other technology companies currently believe that using copyrighted content for artificial intelligence model training is also legal. Regulatory authorities have not yet made clear regulations on this matter. The arena of artificial intelligence remains a vast wilderness, with game rules related to data either yet to be determined or ignored.
Manufacturers are competing to enter and build their own technological barriers.
OpenAI and other large model developers have previously disclosed their training data sources in published research papers, but this practice is no longer prevalent as competition intensifies. Everyone wants to retain their technical secrets in order to gain a relative advantage, especially for top manufacturers who hold a favorable position. The open source competition is also a manifestation of manufacturers trying to keep their own trump cards.
The only certainty is that with further iteration of generative AI technology, similar disputes will only increase.
Big companies are more likely to become targets of criticism. Taking data as an example, even if they dare to take responsibility and bear high data procurement costs, achieving complete compliance in data acquisition is not easy. Due to the large number of parameters, large models require the use of distributed computing and cloud services for training and deployment, which increases the risk of data theft, tampering, abuse, or leakage.
How to balance personal privacy protection and encourage technological innovation, and how to find the optimal path between enterprise survival and compliant production, has become an unavoidable issue for every company committed to generative AI.
LogoMoney.com 系信息发布平台,仅提供信息存储空间服务。
声明:该文观点仅代表作者本人,本文不代表LogoMoney.com立场,且不构成建议,请谨慎对待。
声明:该文观点仅代表作者本人,本文不代表LogoMoney.com立场,且不构成建议,请谨慎对待。
猜你喜欢
- OpenAI has Rocket again! Officially launched Sora, an AI video generation model
- The EU regulatory agencies are coming down! Google and Meta's secret advertising deal investigated
- Google releases its most powerful model to attack OpenAI, shifting focus to AI agents
- Challenge OpenAI, Google's new move! Significantly updated generative AI, launching video model VEO 2 and the latest version Imagen3
- Is it increasingly difficult to distinguish between truth and falsehood? Google launches new generation video generation model Veo 2
- Microsoft is reportedly committed to adding non OpenAI models to its 365 Copilot product
- The most expensive and cheapest models will be launched on the same day! Li Bin: NIO will double its efforts to become one of the top ten global car companies
- How will Google respond under stricter regulation in a more competitive track? CEO: Focus on Gemini model next year
- 焕新Model Y首次推出5年0息
- 特斯拉焕新Model Y首次推出5年0息
-
美股市场:纽约股市三大股指4月30日涨跌不一。截至当天收盘,道琼斯工业平均指数比前一交易日上涨141.74点,收于40669.36点,涨幅为0.35%;标准普尔500种股票指数上涨8.23点,收于5569.06点,涨幅为0.15%;纳斯 ...
- joey791216
- 前天 11:57
- 支持
- 反对
- 回复
- 收藏
-
当地时间周四,美股三大股指集体收涨,其中道指和标普500指数实现“八连涨”。不过,三大股指均在尾盘出现小幅跳水。 苹果、亚马逊于周四美股盘后公布了最新业绩,尽管业绩有所超出预期,但仍有令市场不满 ...
- jiangu12
- 昨天 10:28
- 支持
- 反对
- 回复
- 收藏
-
5月2日,全球电商巨头亚马逊公布了2025年第一季度财报。亚马逊第一季度净销售额为1556.67亿美元,较2024年第一季度同比增长9%;净利润为171.27亿美元,较2024年第一季度增长64%;每股摊薄收益1.59美元,较上年同 ...
- 独品金莲芳
- 11 小时前
- 支持
- 反对
- 回复
- 收藏
-
周三热门中概股涨跌不一。纳斯达克中国金龙指数(HXC)收跌0.95%。 上涨股当中(按市值从高到低),台积电涨1.34%,阿里巴巴涨0.46%,拼多多涨1.36%,网易涨0.66%,中华电信涨1.33%,理想汽车涨0.91%,日月 ...
- 蓝蓝的彩
- 前天 11:15
- 支持
- 反对
- 回复
- 收藏