Recently, OpenAI came out with a new language model that automatically synthesizes text, called GPT-2.
It’s disturbingly good. You can see some examples (cherry-picked, by their own admission) in OpenAI’s post and in the related technical paper.
I’m not going to write about the machine learning here, but about the examples and what we can infer from them.
The scary thing about GPT-2-generated text is that it flows very naturally if you’re just skimming, reading for writing style and key, evocative words. The “unicorn” sample reads like a real science press release. The “theft of nuclear material” sample reads like a real news story. The “Miley Cyrus shoplifting” sample reads like a real post from a celebrity gossip site. The “GPT-2” sample reads like a real OpenAI press release. The “Legolas and Gimli” sample reads like a real fantasy novel. The “Civil War homework assignment” reads like a real C-student’s paper. The “JFK acceptance speech” reads like a real politician’s speech. The “recycling” sample reads like a real right-wing screed.
If I just skim, without focusing, they all look totally normal. I would not have noticed they were machine-generated. I would not have noticed anything amiss about them at all.
But if I read with focus, I notice that they don’t make a lot of logical sense.
For instance, in the unicorn sample:
The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.
Wait a second, “Ovid” doesn’t refer to a “distinctive horn”, so why would naming them “Ovid’s Unicorn” be naming them after a distinctive horn? Also, you just said they had one horn, so why are you saying they have four horns in the next sentence?
While their origins are still unclear, some believe that perhaps the creatures were created when a human and a unicorn met each other in a time before human civilization. According to Pérez, “In South America, such incidents seem to be quite common.”
Wait, unicorns originated from the interbreeding of humans and … unicorns? That’s circular, isn’t it?
Or, look at the GPT-2 sample:
We believe this project is the first step in the direction of developing large NLP systems without task-specific training data. That is, we are developing a machine language system in the generative style with no explicit rules for producing text.
Except the second sentence isn’t a restatement of the first sentence — “task-specific training data” and “explicit rules for producing text” aren’t synonyms! So saying “That is” doesn’t make sense.
Or look at the LOTR sample:
Aragorn drew his sword, and the Battle of Fangorn was won. As they marched out through the thicket the morning mist cleared, and the day turned to dusk.
Yeah, day doesn’t turn to dusk in the morning.
Or in the “resurrected JFK” sample:
(1) The brain of JFK was harvested and reconstructed via tissue sampling. There was no way that the tissue could be transported by air. (2) A sample was collected from the area around his upper chest and sent to the University of Maryland for analysis. A human brain at that point would be about one and a half cubic centimeters. The data were then analyzed along with material that was obtained from the original brain to produce a reconstruction; in layman’s terms, a “mesh” of brain tissue.
His brain tissue was harvested…from his chest?! A human brain is one and a half cubic centimeters?!
So, ok, this isn’t actually human-equivalent writing ability. OpenAI doesn’t claim it is, for what it’s worth — I’m not trying to diminish their accomplishment, that’s not the point of this post. The point is, if you skim text, you miss obvious absurdities. The point is OpenAI HAS achieved the ability to pass the Turing test against humans on autopilot.
The point is, I know of a few people, acquaintances of mine, who, even when asked to try to find flaws, could not detect anything weird or mistaken in the GPT-2-generated samples.
There are probably a lot of people who would be completely taken in by literal “fake news”, as in, computer-generated fake articles and blog posts. This is pretty alarming. Even more alarming: unless I make a conscious effort to read carefully, I would be one of them.
Robin Hanson’s post Better Babblers is very relevant here. He claims, and I don’t think he’s exaggerating, that a lot of human speech is simply generated by “low order correlations”, that is, generating sentences or paragraphs that are statistically likely to come after previous sentences or paragraphs:
After eighteen years of being a professor, I’ve graded many student essays. And while I usually try to teach a deep structure of concepts, what the median student actually learns seems to mostly be a set of low order correlations. They know what words to use, which words tend to go together, which combinations tend to have positive associations, and so on. But if you ask an exam question where the deep structure answer differs from answer you’d guess looking at low order correlations, most students usually give the wrong answer.
Simple correlations also seem sufficient to capture most polite conversation talk, such as the weather is nice, how is your mother’s illness, and damn that other political party. Simple correlations are also most of what I see in inspirational TED talks, and when public intellectuals and talk show guests pontificate on topics they really don’t understand, such as quantum mechanics, consciousness, postmodernism, or the need always for more regulation everywhere. After all, media entertainers don’t need to understand deep structures any better than do their audiences.
Let me call styles of talking (or music, etc.) that rely mostly on low order correlations “babbling”. Babbling isn’t meaningless, but to ignorant audiences it often appears to be based on a deeper understanding than is actually the case. When done well, babbling can be entertaining, comforting, titillating, or exciting. It just isn’t usually a good place to learn deep insight.
I used to half-joke that the New Age Bullshit Generator was actually useful as a way to get myself to feel more optimistic. The truth is, it isn’t quite good enough to match the “aura” or “associations” of genuine, human-created inspirational text. GPT-2, though, is.
I also suspect that the “lyrical” or “free-associational” function of poetry is adequately matched by GPT-2. The autocompletions of Howl read a lot like Allen Ginsberg — they just don’t imply the same beliefs about the world. (Moloch whose heart is crying for justice! sounds rather positive.)
I’ve noticed that I cannot tell, from casual conversation, whether someone is intelligent in the IQ sense.
I’ve interviewed job applicants, and perceived them all as “bright and impressive”, but found that the vast majority of them could not solve a simple math problem. The ones who could solve the problem didn’t appear any “brighter” in conversation than the ones who couldn’t.
I’ve taught public school teachers, who were incredibly bad at formal mathematical reasoning (I know, because I graded their tests), to the point that I had not realized humans could be that bad at math — but it had no effect on how they came across in friendly conversation after hours. They didn’t seem “dopey” or “slow”, they were witty and engaging and warm.
I’ve read the personal blogs of intellectually disabled people — people who, by definition, score poorly on IQ tests — and they don’t read as any less funny or creative or relatable than anyone else.
Whatever ability IQ tests and math tests measure, I believe that lacking that ability doesn’t have any effect on one’s ability to make a good social impression or even to “seem smart” in conversation.
If “human intelligence” is about reasoning ability, the capacity to detect whether arguments make sense, then you simply do not need human intelligence to create a linguistic style or aesthetic that can fool our pattern-recognition apparatus if we don’t concentrate on parsing content.
I also noticed, upon reading GPT2 samples, just how often my brain slides from focused attention to just skimming. I read the paper’s sample about Spanish history with interest, and the GPT2-generated text was obviously absurd. My eyes glazed over during the sample about video games, since I don’t care about video games, and the machine-generated text looked totally unobjectionable to me. My brain is constantly making evaluations about what’s worth the trouble to focus on, and what’s ok to tune out. GPT2 is actually really useful as a *test* of one’s level of attention.
This is related to my hypothesis in https://srconstantin.wordpress.com/2017/10/10/distinctions-in-types-of-thought/ that effortless pattern-recognition is what machine learning can do today, while effortful attention, and explicit reasoning (which seems to be a subset of effortful attention) is generally beyond ML’s current capabilities.
Beta waves in the brain are usually associated with focused concentration or active or anxious thought, while alpha waves are associated with the relaxed state of being awake but with closed eyes, before falling asleep, or while dreaming. Alpha waves sharply reduce after a subject makes a mistake and begins paying closer attention. I’d be interested to see whether ability to tell GPT2-generated text from human-generated text correlates with alpha waves vs. beta waves.
The first-order effects of highly effective text-generators are scary. It will be incredibly easy and cheap to fool people, to manipulate social movements, etc. There’s a lot of opportunity for bad actors to take advantage of this.
The second-order effects might well be good, though. If only conscious, focused logical thought can detect a bot, maybe some people will become more aware of when they’re thinking actively vs not, and will be able to flag when they’re not really focusing, and distinguish the impressions they absorb in a state of autopilot from “real learning”.
The mental motion of “I didn’t really parse that paragraph, but sure, whatever, I’ll take the author’s word for it” is, in my introspective experience, absolutely identical to “I didn’t really parse that paragraph because it was bot-generated and didn’t make any sense so I couldn’t possibly have parsed it”, except that in the first case, I assume that the error lies with me rather than the text. This is not a safe assumption in a post-GPT2 world. Instead of “default to humility” (assume that when you don’t understand a passage, the passage is true and you’re just missing something) the ideal mental action in a world full of bots is “default to null” (if you don’t understand a passage, assume you’re in the same epistemic state as if you’d never read it at all.)
Maybe practice and experience with GPT2 will help people get better at doing “default to null”?
51 thoughts on “Humans Who Are Not Concentrating Are Not General Intelligences”
I find your perspective fascinating and possibly unusual. First of all, most humans don’t see math as a core skill, but as an insular skill. When they learn math, they tend to use it in a special math context, whereas in social and real-world situations, they resort to informal thought, i.e. they don’t have a formal language that would allow them to analyze their mental representations outside of very narrow domains. (If you ask them to, they may sometimes accuse you of “reductionism”.) I have even seen this with some skilled mathematicians. That’s why poor math ability does not necessarily translate into poor general intelligence.
I also suspect that your difficulty at identifying low IQ subjects in conversation might possibly not mean that it does not show up in conversation, but that you are unusual in the way you Turing test your fellow humans. I do think that humans do deep structure, but most of them with informal representations. Robin Hanson is going to encounter unusual behavior, because normies tend to “wing it” it in stem subjects, since they (perhaps correctly) perceive the benefit of building deep formal structure as too low for the cost.
Wrt GPT-2, I never expected it to discover deep structure, and would have been extremely surprised if did. The big result is that GPT-2 captures literary style pretty well, while previous predictive text models did not go much beyond contextual vocabulary. Deep structure requires a unified world model.
Most humans seem to have a unified world model, but integrating complex subjects requires hard work. That is not only an issue of attention (like your disregard for video games). For instance, the relationship between intuitive geometry and arithmetic may seem trivial to a mathematician, but is arcane to most humans.
When I was talking about “math”, I didn’t mean the kind of math I studied in grad school, which I’m aware very few people have been privileged with the opportunity to learn.
I was talking about arithmetic, which I’d certainly expect all college-educated people to have been exposed to. Public school teachers generally do not know how to add fractions. Job applicants with advanced degrees in biology mostly cannot solve a word problem using probabilities (even when simplified to a binary “which quantity is bigger” question).
I do think arithmetic and the ability to follow chains of argument, which seem to be related though not identical, are core to “what it means to be human”, i.e. what distinguishes humans from almost all other mammals. Obviously humans do a whole lot of complex processing without using those skills — our ability to distinguish friend from foe is quite subtle and sensitive and seems to run more on pattern-recognition. But I expect other primates to be very good at distinguishing friend from foe as well.
I thought this was a very interesting angle to look at GPT-2 from, but one detail of your experience was extremely surprising to me. You said you can’t tell how smart people are from casual interaction, which was not just mildly surprising, but shocking to me.
I think that I always have been able to and that it’s been so obvious that I never would have considered that it was an active thing I was doing. For example, I wouldn’t say, “I’m very good at determining who is tall and who is short from casual interactions with people. If I’m introduced to a new social group, after an evening of chit chat I can reliably rank order people by height.” I haven’t asked about this before, but intuitively I feel like my friends can detect this and have similar intuitive sense from casual interaction.
As a quick retroactive test to make some effect to check that I don’t just have illusion of competence, I queried my memory about situations where I was introduced to a new group of people (which happened several times because my family moved around in my youth). Typically I think I had a sense of the new class and new group of school mates within a day or so, and then when years later a school would track students on the basis of a thinly disguised IQ test the results would match my initial impressions. I can only recall a few big misses, and these tended to be quiet people who would maybe have been harder to judge. Take with a grain of salt though, selective memory is always a real risk.
It’s possible that I’m anomalous here. But it’s also possible you are. The preponderance of the presence or absence of this casual detection ability probably effects how we think about social changes in response to the need to detect bots.
I want to weakly second this. I don’t have anything good to go on — just that, some people really do come across as distinctly kind of dumb in conversation in a way that continues to be borne out later. Sometimes this can take a while to notice, sometimes it comes across immediately. It’s possible this is not nearly as reliable as I would intuitively expect, but the picture you describe, where dumb people can’t at all be distinguished from conversation, just doesn’t seem to be consistent with my experience.
You’re both right, and also missing the point. Of course you can form an impression of someone – what he’s saying is that impression does not correlate to being “smart” in an applied sense (math or otherwise), evidenced by his experience with interviews.
Thanks! I probably was assuming my experience was more universal than it is.
I suspect that some people, maybe including yourself, ask “probing questions” (a less formal version of what happened when I asked interviewees to solve a word problem) that distinguish people based on intelligence, whereas I don’t.
I can obviously tell, when talking about a particular topic, whether someone is knowledgeable or not about it. But topics of general interest, like personal life and family, are familiar to everybody, so I can’t distinguish people that way. And when I’m just trying to be friendly, I tend not to “push” people by bringing up things that might cause discomfort or conflict, so I’m less likely to see what ideas are complex enough to challenge their intellects.
I’ve already encountered enough *human* generated nonsense and bullshit that when something doesn’t quite make sense I tend to assume that it’s because it genuinely doesn’t make any sense rather than assume that it actually means something and I just don’t understand what.
(I have encountered the opposite feeling; the short story “A Good Man is Hard to Find” by Flannery O’Connor is definitely saying something, but damned if I know what!)
I’m pretty sure “A Good Man Is Hard To Find” is about Christianity.
The panic will set in when they can write comedy that makes me laugh.
So Trumps tweets are actually generated by GP2, or a lower level bot? (Gauging by level of coherence.) Covfefe anyone?
While I appreciate the dig at Trump, no, the point is that *humans* who are unfocused are probably doing something very similar to a bot. Trump’s tweets are actually plausible as something simpler than GPT-2 — I’m not sure I could distinguish individual Trump tweets from the output of a simple Markov-chain or RNN. I’ve seen some people suspect he is senile, and that wouldn’t shock me. GPT2-level is more like the average Medium post, IMO. (Not that I think many Medium posts to date are literally bot-generated — but they might be next year.)
Can you give some examples of Medium posts (ideally randomly selected, but whatever) that read like GPT2 to you? GPT2 seems obviously internally incoherent in a way I rarely find, even online. I recognize that what I read online is probably unrepresentative, but this is still very far from my experience so I would like to better understand.
Doesn’t surprise me that someone with Trump Derangement Syndrome has trouble judging a person’s computing power after a short conversation. Left-leaning types tend to make snap judgments based on some kind of orthodoxy matrix that I don’t get. I spent a good chunk of my life in the art scene and working around office bugmen and every time (not some of the time, EVERY time) they assumed I was on the same page as them politically because I didn’t look like Hollywood’s archetype of an “evil redneck”. This is also why conservatives and people on the right are dealing with the very real fear that they live in a world filled with NPCs who cannot see anything outside of liberal orthodoxy. We’ve been catching people pretending to be sentient for quite some time and it’s depressing and scary. You guys have zero ability to gauge character, intentions, even basic intellect, and yet instead of trusting people who can, you just imitate characters on TV and try to put out a caring, socially aware vibe which isn’t even real.
Not quite correct: Publius Ovidius Naso (“the one with the nose”)
Looks like AI > human blogger already.
I don’t like the dig at the author. But if GPT-2 was really piking up that association it would be really cool! 🙂
I now think of general intelligence as a stack of 3 distinct abilities, of increasing level of abstraction.
1 Perception – Pattern recognition
2 Action -Search and exploitation
3 Reasoning -Common-sense knowledge, world models, explanations
GTP-2 works on level (1) only, and I agree with your essay that this is also the most basic (default) level of human intelligence. This is what statistical modeling and neural nets do so well. But will statistical models work in *practice* (not just theory) for levels 2 and 3, or are entirely new paradigms of inference needed? This will really affect AGI time-lines. If it’s the former (just a matter of scaling up of statistical inference), AGI would probably be sooner (75 years).
Personally, I’m now more inclined to think that the practical limitations of statistical models will prevent them solving levels 2 and 3, and this would tend to push AGI time-lines right out, since most attention and effort is directed at statistical inference and neural networking currently.
So what approaches other than statistical models and neural nets might deal with level 3 (reasoning) and really solve NLP? The obvious place to look would be logic-based approaches – going back to some of the ideas of GOFAI. Is the meaning (semantics) of a proposition a set of possible worlds? This would point to modal logic. And homing in further, if world models are about causal reasoning (counter-factuals), this points to temporal logic. If we model time as a branching tree of possibilities (counter-factuals), we are led naturally in the direction of tree-adjoining grammars and branching-time logics.
A nudge in the right direction in the paper below….I’m trying to bring AGI time-lines back in – 😉
‘Controlled Natural Language With Temporal Features’ (Ayoade Olatunde Adeniyi , 2017)
You might enjoy my effort to create a nonverbal ‘language by association’ using neural nets to form photo pairs, partly for inducing logical and statistical thought and emotional awareness in children. A follow-on idea is force-feedback worry beads that ‘speak’ a high-dimensional ‘language of God’. See phobrain.com for work in progress. Here’s an early concept page for phobrain:
Click to access SCMETA_Ross.pdf