Ai is setting off a great scramble for data 经济学人精读Economist
AI is setting off a great scramble for data
Feeding ever-larger models is requiring makers to get creative
Not so long ago analysts were openly wondering whether artificial intelligence (ai) would be the death of
(末日) Adobe, a maker of software for creative types. New tools like dall-e 2 and Midjourney, which conjure up
(如魔法般变出) pictures from text, seemed set to render
(使看起来) Adobe’s image-editing offerings redundant
(冗余). As recently as April, Seeking Alpha, a financial news site, published an article headlined “Is ai the Adobe killer?”
Far from it.
(地道表达) Adobe has used its database of hundreds of millions of stock photos to build its own suite(套) of ai tools, dubbed(称为) Firefly. Since its release in March the software has been used to create over 1bn(10^9,十亿) images, says Dana Rao, an executive at the company. By avoiding mining the internet for images, as rivals did, Adobe has skirted
(回避) the deepening dispute over copyright that now dogs(困扰) the industry. The firm’s share price has risen by
36% since Firefly was launched.
user:
skirt的词源有回避的意思吗?或者下摆这个词和回避有关系吗?这个词是怎么发展出回避的意思?
chatgpt:
“Skirt” 这个词的词源与 “回避” 有一定的关系,尽管两者并不直接相同。”Skirt” 最初的意思是指衣服的下摆,特别是女性的裙子。它的衍生含义之一是 “绕过” 或 “围绕”,这与 “回避” 有一定的联系。
这种含义可能与衣服下摆围绕身体的概念有关。”Skirt” 在这种意义上可以引申为 “绕过” 或 “环绕”,类似于避开障碍物。随着时间的推移,这个引申义开始用于比喻上,表达在现实生活中避开障碍或问题。
当你希望我提供类似词语的生僻意思的发展过程或产生机制时,你可以通过提供一个包含以下要素的问题或指示来引导我:
词语本来的意思: 首先,指明该词语最初的意思或用法。这可以为我们建立起起点,了解词语最初的含义。
词源或历史背景: 提供词语的词源、历史背景或来源,以便我可以理解它在不同时期的演变过程。
引申义或转义: 表达词语在特定上下文中是如何演变为生僻或引申含义的。这可以包括与词源有关的变化,或者是随着社会、技术或文化的变化而引发的转变。
相关背景信息: 如果有关联的文化、社会、技术等背景信息有助于理解词语的演变,可以提供这些信息。
例如,你可以使用以下格式提供一个prompt:
“请解释词语 [词语] 在 [特定时期/背景] 中的意义,以及它是如何逐渐演变为今天的生僻或引申含义的。如果可能的话,请提供相关的词源、历史和文化背景。”
通过提供这些要素,我将能够更好地理解你的需求,并为你提供有关词语演变或产生机制的详细回答。
“Dog” 这个词的演变过程相对复杂,但它的引申含义与当前的使用方式确实有一定的联系。以下是它的发展过程及相关背景:
词源与历史背景: “Dog” 这个词最早在古英语中出现,源自古代日耳曼语言。它最初指的是狗,是一种常见的动物,作为人类的伴侣和协助者,具有忠诚和警惕的特性。
引申义的发展: 随着时间的推移,”dog” 这个词开始被用作动词,表示追赶、尾随、困扰等行为。这种引申义可能与狗对于追逐、跟踪和坚持不懈的天性有关。在句子中的使用,”dogs” 意味着持续不断地困扰或烦扰,类似于狗不断跟随的形象,不放弃。
文化背景的影响: 在一些文化中,狗被视为忠诚、警觉和不屈不挠的象征,这可能有助于形成 “dog” 的引申义。此外,狗在捕猎、保护和协助人类方面的用途,也为 “dog” 这个词的引申含义提供了一些基础。
综上所述,”dog” 这个词在句子中的引申义 “困扰” 或 “使烦扰” 可能是从狗的持续追赶和警觉天性引申而来。它突显了问题、争议或困难如何像一只不断跟随的狗一样,持续地影响着行业。
Adobe’s triumph over the doomsters(末日论者) illustrates a wider point about the contest for dominance in the fast-developing market for ai tools. The supersized(超大的) models powering the latest wave of so-called “generative” ai rely on gargantuan(海量的) amounts of data. Having already helped themselves to
(treat oneself to
x 帮助自己获取
√ 未经许可地占用
√ ) much of the internet—often without permission—model builders are now seeking out new data sources to sustain the feeding frenzy(狂热). Meanwhile, companies with vast troves
(贮藏) of the stuff are weighing up(权衡) how best to profit from it. A data land grab is under way.
The two essential ingredients for an ai model are datasets, on which the system is trained, and processing power, through which the model detects relationships within and among those datasets. Those two ingredients are, to an extent, substitutes(这里指相互替代): a model can be improved either by ingesting more data or adding more processing power. The latter, however, is becoming difficult amid(正处当下) a shortage in specialist ai chips, leading model builders to be doubly focused
(加倍专注于) on seeking out data.
Demand for data is growing so fast that the stock of high-quality text available for training may be exhausted by 2026, reckons(估计) Epoch ai, a research outfit(机构). The latest ai models from Google and Meta, two tech giants, are believed to have been trained on over 1trn words. By comparison, the sum total of English words on Wikipedia, an online encyclopedia, is about 4bn.
It is not only the size of datasets that counts. The better the data, the better the model. Text-based models are ideally trained on long-form, well-written, factually(fact) accurate writing, notes(指出) Russell Kaplan of Scale ai, a data startup. Models that are fed this information
(feed 作为及物动词,不需要介词.She feeds the cat bread every morning. When a baby is thirsty, it feeds more often.) are more likely to produce similarly high-quality output. Likewise, ai chatbots give better answers when asked to explain their working step-by-step, increasing demand for sources like textbooks that do that, too.
(这也增加了对像教科书这样的资源的需求,因为教科书提供一步一步地解释,而不是翻译成:增加了对资源的需求,就像教科书(在一步一步解释时所需要更多纸和墨水作为资源)那样) Specialised information sets are also prized(很有价值), as they allow models to be “fine-tuned” for more niche applications. Microsoft’s purchase of GitHub, a repository for software code, for $7.5bn in 2018 helped it develop a code-writing ai tool.(copilot)
As demand for data grows, accessing it is getting trickier, with content creators now demanding compensation(赔偿) for material that has been ingested(吸收) into ai models. A number of copyright-infringement(侵犯) cases have already been brought against model builders in America. A group of authors, including Sarah Silverman, a comedian, are suing Openai, maker of Chatgpt, an ai chatbot, and Meta. A group of artists are similarly suing Stability ai, which builds text-to-image tools, and Midjourney.
The upshot(结果) of all this has been a flurry(一连串) of dealmaking as ai companies race to secure
data sources. In July Openai inked(签署) a deal with Associated Press, a news agency, to access its archive of stories(新闻). It has also recently expanded an agreement with Shutterstock, a provider of stock photography, with whom Meta has a deal, too.(Meta也与后者达成了协议。) On August 8th it was reported that Google was in discussions with Universal Music, a record label
, to license artists’ voices to feed a songwriting ai tool. Fidelity, an asset manager, has said that it has been approached by tech firms asking for access to its financial data. Rumours(传言) swirl about ai labs approaching the bbc, Britain’s public broadcaster, for access to its archive of images and films. Another supposed target is jstor, a digital library of academic journals.
这里的 “secure” 强调的是确保获得数据来源的权利或机会,通常与竞争、争夺等紧张的情况相关。它传达了公司正在努力获得数据来源的含义。
Holders of information are taking advantage of their greater bargaining power.
Reddit, a discussion forum, and Stack Overflow(栈溢出), a question-and-answer site popular with coders, have increased the cost of access to their data. Both websites are particularly valuable
because users “upvote”(点赞) preferred answers, helping models know which are most relevant. Twitter (now known as X), a social-media site, has put in place measures to limit the ability of bots to scrape(爬虫,刮) the site and now charges anyone who wishes to access its data. Elon Musk, its mercurial
owner, is planning to build his own ai business using the data.
“Mercurial” 这个词通常用来形容性格或行为反应多变、不稳定、难以预测的人。在这个上下文中,”mercurial owner” 指的是具有多变性格、善变的业主,暗示 Elon Musk 在性格上可能表现出不稳定或难以预测的特质。
As a consequence
, model builders are working hard to improve the quality of the inputs they already have. Many ai labs employ armies of
data annotators(注释, 匿名者: anonym) to perform tasks(执行任务) such as labelling images and rating answers. Some of that work is complex; an advert for one such job seeks applicants with a master’s degree or doctorate in life sciences. But much of it is mundane(平凡), and is being outsourced(外包) to places such as Kenya where labour is cheap.一些高度科学化、技术性强的职位可能确实需要深入的专业知识和研究背景,因此要求更高的学历也是合理的。
ai firms are also gathering data via users’ interactions with their tools. Many of these have some form of feedback mechanism, where users indicate
(指出) which outputs are useful. Firefly’s text-to-image generator allows users to pick from one of four options. Bard, Google’s chatbot, similarly proposes three answers. Users can give Chatgpt a thumbs up or thumbs down when it replies to queries. That information can be fed back as an input into the underlying model(底层模型), forming what Douwe Kiela, co-founder of Contextual ai, a startup, calls the “data flywheel”(飞轮). A stronger signal still
of the quality of a chatbot’s answers is whether users copy the text and paste it elsewhere, he adds. Analysing such information helped Google rapidly improve its translation tool.
1 still可以替换成哪些词?
A stronger signal moreover/yet/even more/furthermore/additionally/more pronounced//compelling//distinct of the quality of a chatbot’s answers is whether users copy the text and paste it elsewhere, he adds.
2 still的主要含义?
- 在这个上下文中,”still” 表示进一步、更加强烈或更突出的程度。这个句子的意思是,一个聊天机器人回答的质量更强烈地体现在用户是否会将文本复制并粘贴到其他地方。
“Still” 在这里用作副词,用来强调前面所述的情况,并且表达更加深入的程度.- 它最初起源于古英语 “stille”,意为 “motionless” 或 “not moving”。随着时间的推移,它逐渐演变为表示 “inactive”、”quiet” 或 “calm”。随着更多的语境使用和语义发展,它获得了更多的含义,包括 “even so”、”nevertheless” 和 “yet”,用于引出对比或补充信息的情况。
- Contrast(对比):
“She was tired, but still managed to finish her work.”
“The weather was cold, but they still went for a hike.”
- Concession(让步):
“I know it’s risky, but I still believe we should try.”
“He had little experience, but he still performed exceptionally well.”
- Emphasis(强调):
“Even in the face of challenges, she still persevered.”
“Despite the odds, they still achieved their goal.”
- Continuation(继续):
“The rain was pouring, but the team still continued with the outdoor event.”
“The concert was delayed, but the audience still waited patiently.”
- Nevertheless(然而、尽管如此):
“The movie received mixed reviews, but it was still a box office success.”
“The plan was risky; nevertheless, they decided to proceed.”
- Yet(然而、但是):
“He was exhausted, yet he still managed to give a great presentation.”
“It was late, yet they continued working on the project.”
3 如何发展出”更”这层含义?
强调:
“Still” 最初被用于强调某事的稳定性或特殊性。在描述某种状态或情况时,它可能暗示这种状态的保持不变。这种强调的特性为后来的含义奠定了基础。
对比:
随着语言的发展,人们开始将 “still” 用于引出对比情况。在描述两种情况之间的不同之处时,”still” 可以用来指出一个特定的状态,尽管有其他情况的存在。这就是 “尽管如此” 或 “然而” 的一种含义。
加强、更进一步:
从对比的角度来看,当人们使用 “still” 来强调一个状态,尤其是在与其他可能情况对比时,它的含义逐渐发展为 “更加强烈” 或 “更进一步”。这是因为 “still” 被用来突出一种状态,以至于它似乎超越了其他可能性,进一步加强了表达的程度。
Expanding the frontier
There is, however, one source of data that remains largely untapped(未开封的/未使用的): the information that exists within the walls of the
tech firms
’ corporate customers
. Many businesses possess(拥有), often unwittingly(不知不觉), vast amounts of useful data, from call-centre transcripts(记录) to customer spending records. Such information is especially valuable because it can be used to fine-tune models for specific business purposes, like helping call-centre workers answer customers’ queries or business analysts(业务分析师) spot ways to boost sales.
There is, however, one source of data that remains largely untapped(未开封的/未使用的): the information that exists within the walls of the tech firms’ corporate customers.
在这个句子中,”exists within the walls of” 是一个比喻表达,它意味着数据存在于某个特定的地方或实体内部,并且暗示这些数据可能尚未被充分利用或发掘。具体来说,这里使用的比喻是把数据比作被墙壁包围的东西,强调了数据的隐蔽性和未被开发的潜力。
Yet making use of that rich resource is not always straightforward. Roy Singh of Bain
, a consultancy, notes that most firms have historically paid little attention to the types of vast but unstructured datasets that would prove(被证明,主动表被动) most useful for training ai tools. Often these are spread across multiple systems, buried in company servers rather than in the cloud.
Unlocking that information would help companies customise ai tools to better serve their specific needs. Amazon and Microsoft, two tech giants, now offer tools to help companies better manage their unstructured datasets, as does Google
. Christian Kleinerman of Snowflake, a database firm, says that business is booming as clients look to “tear down data silos”(打破数据孤岛). Startups are piling in(涌入). In April Weaviate, an ai-focused database business, raised $50m at a value of $200m(以价值 2 亿美元筹集了 5000 万美元的资金). Barely a week later PineCone, a rival, raised $100m at a valuation of $750m. Earlier this month Neon, another database startup, raised an additional $46m in funding. The scramble for data is only just getting started.
(我)文章总结:
介绍了当今模型训练企业对于数据的挖掘/收集/占领的态势,并且强调了优质的数据集(数据本身的标记/优质数据的标志/未引起关注的重要优质数据)对于优秀模型的重要程度,同时也关注了对于数据版权的隐私保护与侵犯的话题,包括公司之间的协议与合作等角度也有探讨,充分展现了深度学习领域的朝阳发展态势.chatgpt的建议:
1 在总结中,强调文章的核心主题,即人工智能模型训练所需数据的挖掘和竞争。
2 提及文章中强调的优质数据的重要性,以及如何利用数据来优化模型的效果。
3 引入文章中关于数据隐私和版权问题的讨论,表明模型构建者在数据获取方面面临的挑战。
4 提及企业之间的合作、竞争和战略,突出文章中提到的主要公司和他们的举动。
5 最后强调,这场关于数据的争夺才刚刚开始,显示人工智能领域的前景和发展。chatgpt的示范:
这篇文章深入探讨了当前人工智能模型训练领域的一场激烈的数据竞争。在不久之前,人们还在质疑人工智能是否会威胁到某些领域,比如Adobe这样的创意软件公司。然而,现实却显示出Adboe通过建立自己的ai工具套件,充分利用其庞大的图库数据,成功应对了这一挑战。文章揭示了人工智能模型所需的两个关键要素:数据集和处理能力。尽管这两者在一定程度上可以相互替代,但数据的重要性越来越受到重视,尤其是高质量的、经过精心标记的数据。然而,获取数据并不是一件容易的事,因为数据隐私和版权问题也受到越来越多的关注。企业需要寻找各种途径来获取数据,包括合作协议和数据源的开发。同时,数据管理和优化工具也逐渐成为一种需求,以确保数据得到高效利用。尽管当前的竞争激烈,但人工智能模型训练领域的数据争夺才刚刚开始,前景依然令人兴奋。