AI is setting off a great scramble for data

Feeding ever-larger models is requiring makers to get creative

Not so long ago analysts were openly wondering whether artificial intelligence (ai) would be the death of(末日) Adobe, a maker of software for creative types. New tools like dall-e 2 and Midjourney, which conjure up(如魔法般变出) pictures from text, seemed set to render(使看起来) Adobe’s image-editing offerings redundant(冗余). As recently as April, Seeking Alpha, a financial news site, published an article headlined “Is ai the Adobe killer?”

Far from it.(地道表达) Adobe has used its database of hundreds of millions of stock photos to build its own suite(套) of ai tools, dubbed(称为) Firefly. Since its release in March the software has been used to create over 1bn(10^9,十亿) images, says Dana Rao, an executive at the company. By avoiding mining the internet for images, as rivals did, Adobe has skirted(回避) the deepening dispute over copyright that now dogs(困扰) the industry. The firm’s share price has risen by 36% since Firefly was launched.

Adobe’s triumph over the doomsters(末日论者) illustrates a wider point about the contest for dominance in the fast-developing market for ai tools. The supersized(超大的) models powering the latest wave of so-called “generative” ai rely on gargantuan(海量的) amounts of data. Having already helped themselves to(treat oneself to x 帮助自己获取未经许可地占用 √ ) much of the internet—often without permission—model builders are now seeking out new data sources to sustain the feeding frenzy(狂热). Meanwhile, companies with vast troves(贮藏) of the stuff are weighing up(权衡) how best to profit from it. A data land grab is under way.

The two essential ingredients for an ai model are datasets, on which the system is trained, and processing power, through which the model detects relationships within and among those datasets. Those two ingredients are, to an extent, substitutes(这里指相互替代): a model can be improved either by ingesting more data or adding more processing power. The latter, however, is becoming difficult amid(正处当下) a shortage in specialist ai chips, leading model builders to be doubly focused(加倍专注于) on seeking out data.

Demand for data is growing so fast that the stock of high-quality text available for training may be exhausted by 2026, reckons(估计) Epoch ai, a research outfit(机构). The latest ai models from Google and Meta, two tech giants, are believed to have been trained on over 1trn words. By comparison, the sum total of English words on Wikipedia, an online encyclopedia, is about 4bn.

It is not only the size of datasets that counts. The better the data, the better the model. Text-based models are ideally trained on long-form, well-written, factually(fact) accurate writing, notes(指出) Russell Kaplan of Scale ai, a data startup. Models that are fed this information(feed 作为及物动词,不需要介词.She feeds the cat bread every morning. When a baby is thirsty, it feeds more often.) are more likely to produce similarly high-quality output. Likewise, ai chatbots give better answers when asked to explain their working step-by-step, increasing demand for sources like textbooks that do that, too.(这也增加了对像教科书这样的资源的需求,因为教科书提供一步一步地解释,而不是翻译成:增加了对资源的需求,就像教科书(在一步一步解释时所需要更多纸和墨水作为资源)那样) Specialised information sets are also prized(很有价值), as they allow models to be “fine-tuned” for more niche applications. Microsoft’s purchase of GitHub, a repository for software code, for $7.5bn in 2018 helped it develop a code-writing ai tool.(copilot)

As demand for data grows, accessing it is getting trickier, with content creators now demanding compensation(赔偿) for material that has been ingested(吸收) into ai models. A number of copyright-infringement(侵犯) cases have already been brought against model builders in America. A group of authors, including Sarah Silverman, a comedian, are suing Openai, maker of Chatgpt, an ai chatbot, and Meta. A group of artists are similarly suing Stability ai, which builds text-to-image tools, and Midjourney.

The upshot(结果) of all this has been a flurry(一连串) of dealmaking as ai companies race to secure data sources. In July Openai inked(签署) a deal with Associated Press, a news agency, to access its archive of stories(新闻). It has also recently expanded an agreement with Shutterstock, a provider of stock photography, with whom Meta has a deal, too.(Meta也与后者达成了协议。) On August 8th it was reported that Google was in discussions with Universal Music, a record label, to license artists’ voices to feed a songwriting ai tool. Fidelity, an asset manager, has said that it has been approached by tech firms asking for access to its financial data. Rumours(传言) swirl about ai labs approaching the bbc, Britain’s public broadcaster, for access to its archive of images and films. Another supposed target is jstor, a digital library of academic journals.

Holders of information are taking advantage of their greater bargaining power.Reddit, a discussion forum, and Stack Overflow(栈溢出), a question-and-answer site popular with coders, have increased the cost of access to their data. Both websites are particularly valuable because users “upvote”(点赞) preferred answers, helping models know which are most relevant. Twitter (now known as X), a social-media site, has put in place measures to limit the ability of bots to scrape(爬虫,刮) the site and now charges anyone who wishes to access its data. Elon Musk, its mercurial owner, is planning to build his own ai business using the data.


As a consequence, model builders are working hard to improve the quality of the inputs they already have. Many ai labs employ armies of data annotators(注释, 匿名者: anonym) to perform tasks(执行任务) such as labelling images and rating answers. Some of that work is complex; an advert for one such job seeks applicants with a master’s degree or doctorate in life sciences. But much of it is mundane(平凡), and is being outsourced(外包) to places such as Kenya where labour is cheap.

ai firms are also gathering data via users’ interactions with their tools. Many of these have some form of feedback mechanism, where users indicate(指出) which outputs are useful. Firefly’s text-to-image generator allows users to pick from one of four options. Bard, Google’s chatbot, similarly proposes three answers. Users can give Chatgpt a thumbs up or thumbs down when it replies to queries. That information can be fed back as an input into the underlying model(底层模型), forming what Douwe Kiela, co-founder of Contextual ai, a startup, calls the “data flywheel”(飞轮). A stronger signal still of the quality of a chatbot’s answers is whether users copy the text and paste it elsewhere, he adds. Analysing such information helped Google rapidly improve its translation tool.


Expanding the frontier
There is, however, one source of data that remains largely untapped(未开封的/未使用的): the information that exists within the walls of the tech firmscorporate customers. Many businesses possess(拥有), often unwittingly(不知不觉), vast amounts of useful data, from call-centre transcripts(记录) to customer spending records. Such information is especially valuable because it can be used to fine-tune models for specific business purposes, like helping call-centre workers answer customers’ queries or business analysts(业务分析师) spot ways to boost sales.

Yet making use of that rich resource is not always straightforward. Roy Singh of Bain, a consultancy, notes that most firms have historically paid little attention to the types of vast but unstructured datasets that would prove(被证明,主动表被动) most useful for training ai tools. Often these are spread across multiple systems, buried in company servers rather than in the cloud.

Unlocking that information would help companies customise ai tools to better serve their specific needs. Amazon and Microsoft, two tech giants, now offer tools to help companies better manage their unstructured datasets, as does Google. Christian Kleinerman of Snowflake, a database firm, says that business is booming as clients look to “tear down data silos”(打破数据孤岛). Startups are piling in(涌入). In April Weaviate, an ai-focused database business, raised $50m at a value of $200m(以价值 2 亿美元筹集了 5000 万美元的资金). Barely a week later PineCone, a rival, raised $100m at a valuation of $750m. Earlier this month Neon, another database startup, raised an additional $46m in funding. The scramble for data is only just getting started.