手机APP下载

您现在的位置: 首页 > 英语听力 > 英语视频听力 > 沃克斯独立观点 > 正文

AI写作 真真假假,以假乱真的大忽悠

来源:可可英语 编辑:hepburn   可可英语APP下载 |  可可官方微信:ikekenet

A mildly fun thing to do when you're bored is start the beginning of a text message, and then use only the suggested words to finish it.

无聊的时候,一个还算有趣的游戏就是手动编一个开头,然后用手机联想的词语把它补充完整。

"In five years I will see you in the morning and then you can get it."

“五年后,我会在早上见到你,然后你就可以拿到它了。”

The technology behind these text predictions is called a "language model":

这种文字预测功能背后所用的技术叫作“语言模型”:

a computer program that uses statistics to guess the next word in a sentence.

使用统计数据来猜测句子下一个词语的一种计算机程序。

And in the past year, some other, newer language models have gotten really, weirdly good at generating text that mimics human writing.

过去的一年里,有几款比较新的语言模型在模仿人的写作口吻生产文本方面已经做得非常优秀了,甚至优秀得让人有些难以适应。

"In five years, I will never return to this place. He felt his eye sting and his throat tighten..."

“五年后,我再也不会回这个地方来了。他感觉眼睛有些刺痛,喉咙也有些发紧……”

The program completely made this up.

这段话完完全全是计算机程序编出来的。

It's not taken from anywhere else and it's not using a template made by humans.

既不是从别的地方摘抄过来的,也不是用人的写作模板写出来的。

For the first time in history, computers can write stories.

计算机终于破天荒地能自己写故事了。

The only problem is that it's easier for machines to write fiction than to write facts.

唯一的问题是,对机器而言,书写虚构的情节容易,书写事实可就不那么容易了。

Language models are useful for a lot of reasons.

语言模型的用处有很多。

They help "recognize speech" properly when sounds are ambiguous in speech-to-text applications.

语音转文字,声音不清楚的情况下,语言模型能帮助我们正确地“识别语音”。

And they can make translations more fluent when a word in one language maps to multiple words in another.

一种语言中的某个词语在另一种语言中有多个词语与之对应时,语言模型能够帮助我们提高翻译的流畅性。

But if you asked language models to simply generate passages of text, the results never made much sense.

但当你要求语言模型自己产出文段时,出来的结果很少有说得通的。

"So the kinds of things that made sense to do were like generating single words or very short phrases."

“真正说得通的只有产出单个词语或非常短的短语的时候。”

For years, Janelle Shane has been experimenting with language generation for her blog "AI Weirdness".

多年来,Janelle Shane一直在她的博客“古怪的AI”里尝试这种语言生成技术。

Her algorithms have generated paint colors, "Bull Cream".

她的算法产出了油漆颜色“牛奶油”色,

Halloween costumes, "Sexy Michael Cera".

万圣节服装“性感迈克尔·塞拉”,

And pick-up lines, "You look like a thing and I love you."

还有搭讪用语“你看起来像个东西,我爱你,”等各种怪异的表达。

But this is what she got in 2017 when she asked for longer passages, like the first lines of a novel:

下面是她2017年让电脑自己写作更长一点的文段,比如一篇小说的开头时得到的结果:

"The year of the island is discovered the Missouri of the galaxy like a teenage lying and always discovered the year of her own class-writing bed.

“岛上的那年是发现银河系的密苏里州,就像一个十几岁的少年躺着,总是发现自己的班级作文床的那一年。

It makes no sense."

根本读不通。”

Compare that to this opening line from a newer language model called GPT-2.

我们再来看一个更新的,名为“GPT-2”的语言模型生成的小说开头:

"It was a rainy, drizzling day in the summer of 1869.

“那是1869年夏季一个下着毛毛雨的雨天。

And the people of New York, who had become accustomed to the warm, kissable air of the city, were having another bad one."

已经习惯这座城市温暖宜人的空气的纽约人,现在又遇上了一场糟糕的天气。”

"It's like it's getting better at bullsh*tting us."

“感觉越来越会扯淡了。”

"Yes, yes, it is very good at generating scannable, readable bullsh*t.

“是的,没错,这款模型非常擅长生成可扫描,可读的扯淡玩意儿。

Going from word salad to pretty passable prose took a new approach in the field of natural language processing.

从词语大杂烩到内容尚可的散文,语言模型的进步堪称为自然语言处理领域开辟了一条新的道路。

Typically, language tasks have required carefully structured data.

通常情况下,语言任务都需要调用组织非常精细的数据。

You need thousands of correct examples to train the program.

你需要用成千上万个正确的例子来训练程序。

For translation you need a bunch of samples of the same document in multiple languages.

如果是翻译任务,你还需要同一文本的多语种对照样本。

For spam filters, you need emails that humans have labeled as spam.

如果是要过滤垃圾邮件,那你需要大量被大家标记为垃圾邮件的样本邮件。

For summarization, you need full documents plus their human-written summaries.

而要将语言模型用到总结上,你就需要整篇的文档样本,还有人工撰写的总结样本。

Those data sources are limited and can take a lot of work to collect.

这些数据源不仅数量十分有限,收集起来也会比较费劲。

But if the task is to simply guess the next word in a sentence, the problem comes with its own solution.

但如果你要做的语言工作仅仅是猜测句子的下一个字词,那就好办了。

So the training data can be any human-written text, no labeling required.

训练程序的数据可以是人类撰写的任何文本,而且无需任何标记。

This is called "self-supervised learning."

这就是所谓的“自监督学习”。

1

And it makes it easy and inexpensive to gather data, which means that you can use a LOT of it.

这样一来,收集数据就变得既简单又便宜了,后者就意味着你能够使用的(训练程序的)数据是海量的。

Like all of Wikipedia, or 11,000 books, or 8 million web sites.

比如你能使用所有的维基百科词条,或者能使用1.1万本书,又或者能使用800万个网站的数据。

With that amount of data, plus serious computing resources, and a few tweaks to the architecture and size of the algorithms,

有了这么多的数据,加上大量的计算资源,以及对算法架构和大小的一些调整,

these new language models build vast mathematical maps of how every word correlates with every other word,

这些新的语言模型就能构建出能显示每个单词与其他单词呈何种相关关系的巨幅数学地图,

all without being explicitly told any of the rules of grammar or syntax.

还无需明确告知模型任何语法或句法规则。

That gives them fluency with whatever language they're trained on, but it doesn't mean they know what's true or false.

这样一来,它们就能流利地使用他们所训练的任何语言了,但这并不意味着它们知道(它们生成的语句)哪些正确,哪些错误。

To get language models to generate true stories, like summarizing documents or answering questions accurately, it takes extra training.

要想让语言模型生成真实的故事,比如让它总结文档或准确回答问题,还需继续对其加以培训。

The simplest thing to do without much more work is just generate passages of text, which are both superficially coherent and also false.

最简单,且无需太多人为工夫的做法就是只生成文段——表面连贯,实际上并不正确的文段。

"So give me any headline that you want a fake news story for."

“给我一个假新闻的标题,随便什么标题都可以。”

"Scientists discover Flying Horse."

“科学家发现了会飞的马。”

Adam Geitgey is a software developer who created a fake news website populated entirely with generated text.

亚当·盖奇是一名软件开发工程师,他创建了一个完全用语言模型生成的文本写新闻的假新闻网站。

He used a language model called Grover, which was trained on news articles from 5,000 publications.

他用的是一个叫Grover的语言模型,该模型(训练时)依托的(数据)是5000本刊物里刊登的新闻文章。

"So, ok, this is what we got.

“嗯,这就是我们的结果啦。

More than 1,000 years ago, archaeologists unearthed a mysterious flying animal in France

“1000多年前,考古学家在法国出土了一种神秘的飞禽,

and hailed it the ‘Winged Horse of Afzel' or ‘Horse of Wisdom'"

便将其誉为“‘阿夫泽尔天马’或‘智慧之马’。”

"Yeah. This is amazing, right? Like this is crazy."

“嗯,这就很不可思议了是吧?简直了!”

"It's so crazy, like..."

“太可怕了,就好像……”

"It remains coherent, all the way to the end, you know."

“它还是连贯的,从头到尾都是连贯的你知道嘛。”

"...the animal, which is the size of a horse, was not easy.

“……该动物,体型和马一样大,并不容易。

If we just Google that.

如果我们上谷歌搜一下。

Like there's nothing."

也搜不到。”

"It doesn't exist anywhere."

“其他任何地方都是没有这句话的。”

"And I don't want to say this is perfect.

“我并不想说,程序能写成这样已经很完美了。

But just from a longer term point of view of what people were really excited about three years ago,

但从更长远的角度来看,比较三年前还让人感到非常激动的技术水平

versus what people can do now,

与现在的技术水平,(你会发现)

like this is just like a huge, huge leap."

这方面其实是已经取得了巨大的飞跃了的。”

If you read closely, you can see that the model is describing a creature that is somehow both "mouse-like" and "the size of a horse."

仔细阅读你会发现,这个模型描述的是长得“像老鼠”,体型又“和马差不多大小”的一种生物。

That's because it doesn't actually know what it's talking about.

这是因为程序它并不知道它在说什么。

It's simply mimicking the writing style of a news reporter.

它只是在模仿新闻记者的写作风格。

And these models can be trained to write in the voice of any source,

只要加以训练,这些模型就能模仿任何源材料的写作风格写作,

like a twitter feed, "I'd like to be very clear about one thing.

比如模仿推特消息的口吻,“我想澄清一个问题。

Shrek is not based on any actual biblical characters.

《怪物史莱克》并非根据圣经人物改编的故事,

Not even close."

两者差远了。”

Or whole subreddits.

模仿reddit子版块的风格。

"I found a potato on my floor."

“我在地板上发现了一个土豆。”

"A lot of people use the word 'potato' as an insult to imply they are not really a potato, they just ‘looked like' one."

“很多人用‘土豆’这个词辱骂他人,暗示他们并不是真正的土豆,只是‘像’个土豆。”

"I don't mean insult, I mean as in as in the definition of the word potato."

“我想说的不是辱骂,我想说的是土豆“这个词定义中的意思。”

"Fair enough. The potato has been used in various ways for a long time."

“好吧。很早以前,土豆就具备了各种各样的用途。”

But we may be entering a time when AI-generated text isn't so funny anymore.

问题是,如今人工智能自动生成的文本已经不那么有趣了。

"Islam has taken the place of Communism as the chief enemy of the West."

“伊斯兰教已经取代共产主义,成为西方世界的头号敌人。”

Researchers have shown that these models can be used to flood government websites with fake public comments about policy proposals,

研究人员已经证明,这些模型可以被(别有用心的人)用来冒充民众对政策的反馈评论,刷屏政府网站,

they can post tons of fake business reviews, argue with people online,

他们可以利用这些模型发布大量虚假的商业评论,用这些模型来与网友争论,

and generate extremist and racist posts that can make fringe opinions seem more popular than they really are.

甚至是生成极端主义和种族主义内容的帖子,让非主流观点看起来比实际更受欢迎。

"It's all about like taking something you could do and then just increasing the scale of it, making it more scalable and cheaper.

“其本质不过是把你会写出的一些东西拿过来,将其放大,增加其可扩展性,将其变得更便宜。

The good news is that some of the developers who built these language models also built ways to detect much of the text generated through their models.

好消息是,开发们不仅搭建了这些语言模型,也搭建了专门检测用这些模型生成的文本的工具。

But it's not clear who has the responsibility to fake-check the internet.

但目前,谁应该负责对互联网上的内容进行虚假检测这一工作尚不清楚。

And as bots become even better mimics - with faces like ours, voices like ours, and now our language,

随着机器的模仿能力越来越强,模仿完我们的脸,模仿我们的声音,现在又来模仿我们的语言,

those of us made of flesh and blood may find ourselves increasingly burdened with not only detecting what's fake, but also proving that we're real.

我们这些有着血肉之躯的人可能会觉得我们肩上的负担也越来越重了:不仅要检测什么是假的,还要能证明我们为什么是真的。

重点单词   查看全部解释    
mysterious [mis'tiəriəs]

想一想再看

adj. 神秘的,不可思议的

联想记忆
populated

想一想再看

adj. 粒子数增加的 v. 居住于…中;构成…的人口(

 
extremist [iks'tri:mist]

想一想再看

n. 极端主义者,过激分子

 
popular ['pɔpjulə]

想一想再看

adj. 流行的,大众的,通俗的,受欢迎的

联想记忆
imply [im'plai]

想一想再看

vt. 暗示,意指,含有 ... 的意义

联想记忆
accurately ['ækjuritli]

想一想再看

adj. 准确地
adv. 精确地,准确地

 
multiple ['mʌltipl]

想一想再看

adj. 许多,多种多样的
n. 倍数,并联

联想记忆
accustomed [ə'kʌstəmd]

想一想再看

adj. 习惯了的,通常的

 
collect [kə'lekt]

想一想再看

v. 收集,聚集
v. 推论

联想记忆
insult ['insʌlt]

想一想再看

vt. 侮辱,凌辱,辱骂
n. 侮辱,辱骂

 

发布评论我来说2句

    最新文章

    可可英语官方微信(微信号:ikekenet)

    每天向大家推送短小精悍的英语学习资料.

    添加方式1.扫描上方可可官方微信二维码。
    添加方式2.搜索微信号ikekenet添加即可。