Epistemic Status: moderately confident
Edit To Add: It’s been brought to my attention that I was wrong to claim that progress in image recognition is “slowing down”. As classification accuracy approaches 100%, obviously improvements in raw scores will be smaller, by necessity, since accuracy can’t exceed 100%. If you look at negative log error rates rather than raw accuracy scores, improvement in image recognition (as measured by performance on the ImageNet competition) is increasing roughly linearly over 2010-2016, with a discontinuity in 2012 with the introduction of deep learning algorithms.
Deep learning has revolutionized the world of artificial intelligence. But how much does it improve performance? How have computers gotten better at different tasks over time, since the rise of deep learning?
In games, what the data seems to show is that exponential growth in data and computation power yields exponential improvements in raw performance. In other words, you get out what you put in. Deep learning matters, but only because it provides a way to turn Moore’s Law into corresponding performance improvements, for a wide class of problems. It’s not even clear it’s a discontinuous advance in performance over non-deep-learning systems.
In image recognition, deep learning clearly is a discontinuous advance over other algorithms. But the returns to scale and the improvements over time seem to be flattening out as we approach or surpass human accuracy.
In speech recognition, deep learning is again a discontinuous advance. We are still far away from human accuracy, and in this regime, accuracy seems to be improving linearly over time.
In machine translation, neural nets seem to have made progress over conventional techniques, but it’s not yet clear if that’s a real phenomenon, or what the trends are.
In natural language processing, trends are positive, but deep learning doesn’t generally seem to do better than trendline.
These are Elo ratings of the best computer chess engines over time.
There was a discontinuity in 2008, corresponding to a jump in hardware; this was the Rybka 2.3.1, a tree-search-based engine without any deep learning or indeed probabilistic elements. Apart from that, progress looks roughly linear.
Here again is the Swedish Chess Computer Association data on Elo scores over time:
Deep learning chess engines have only just recently been introduced; Giraffe, originated by Matthew Lai at Imperial College London, was created in 2015. It only has an Elo rating of 2412, about equivalent to late-90’s-era computer chess engines. (Of course, learning to predict patterns in good moves probabilistically from data is a more impressive achievement than brute-force computation, and it’s quite possible that deep-learning-based chess engines, once tuned over time, will improve.)
(Figures from the Nature paper on AlphaGo.)
Fan Hui is a human player. Alpha Go performed notably better than its predecessors Crazy Stone (2008, beat human players in mini go games), Pachi (2011), Fuego (2010), and GnuGo, all MCTS programs, but without deep learning or GPUs. AlphaGo uses much more hardware and more data.
Miles Brundage has argued that AlphaGo doesn’t represent that much of a surprise given the improvements in hardware and data (and effort). He also graphed the returns in Elo rating to hardware by the AlphaGo team:
In other words, exponential growth in hardware produces only roughly linear (or even sublinear) growth in performance as measured by Elo score. To do better would require algorithmic innovation as well.
Artificial Atari games are scored relative to a human professional playtester: (Computer score – random play)/(Human score – random play).
Compare to Elo scores: the ratio of expected scores for player A vs. player B is Q_A / Q_B, where Q_A = 10^(E_A/400), E_A being the Elo score.
Linear growth in Elo scores is equivalent to exponential growth in absolute scores.
Miles Brundage’s blog also offers a trend in Atari performance that looks exponential:
This would, of course, still be plausibly linear in Elo score.
Superhuman performance at arcade games is already here:
This was a single reinforcement learner trained with a convolutional neural net over images of the game screen outputting behaviors (arrows). Basically it’s dynamic programming, with a nonlinear approximation of the Q-function that estimates the quality of a move; in Deepmind’s case, that Q-function approximator is a convolutional neural net. Apart from the convnet, Q-learning with function approximation has been around since the 90’s and Q-learning itself since 1989.
Interestingly enough, here’s a video of a computer playing Breakout:
It obviously doesn’t “know” the law of reflection as a principle, or it would place the bar near where the ball will eventually land, and it doesn’t. There are erratic jerky movements that obviously could not in principle be optimal. It does, however, find the optimal strategy of tunnelling through the bricks and hitting the ball behind the wall. This is creative learning but not conceptual learning.
You can see the same phenomenon in a game of Pong:
The learned agent performs much better than the hard-coded agent, but moves more jerkily and “randomly” and doesn’t know the law of reflection. Similarly, the reports of AlphaGo producing “unusual” Go moves are consistent with an agent that can do pattern-recognition over a broader space than humans can, but which doesn’t find the “laws” or “regularities” that humans do.
Perhaps, contrary to the stereotype that contrasts “mechanical” with “outside-the-box” thinking, reinforcement learners can “think outside the box” but can’t find the box?
Image recognition as measured by ImageNet classification performance has improved dramatically with the rise of deep learning.
There’s a dramatic performance improvement starting in 2012, corresponding to Geoffrey Hinton’s winning entry, followed by a leveling-off. Plausibly accuracy is an S-shaped curve.
How does accuracy scale with processing power?
This paper from Baidu illustrates:
The performance of a deep neural net follows an S-shaped curve over time spent training, but works faster with more GPUs. How much faster?
Each doubling in GPUs provides only a linear boost in speed. At a given time interval for training (as one would have in a timed competition), this means that doubling the number of GPUs would result in a sublinear boost in accuracy.
Using the performance data from Yann LeCun’s website, we can see that deep neural nets hugely improved MNIST digit recognition accuracy. The best algorithms of 1998, which were convolutional nets and boosted convolutional nets due to LeCun, had error rates of 0.7-0.8. Within 5 years, that had dropped to error rates of 0.4, within 10 years, to 0.39 (also a convolutional net), within 15 years, to 0.23, and within 20 years, to 0.21. Clearly, performance on MNIST is leveling off; it took five years to halve and then 20 years to halve again.
As with ImageNet, we may be getting close to the limits of deep-learning performance (which may easily be human-level.)
Before the rise of deep learning, speech recognition was already progressing rapidly, though it was leveling off in conversational speech well above the 10% accuracy rate.
Then, in 2011, the advent of context-dependent deep neural network hidden Markov models produced a discontinuity in performance:
More recently, accuracy has continued to progress:
Nuance, a dictation software company, shows steadily improving performance on word recognition through to the present day, with a plausibly exponential trend.
Baidu has progressed even faster, as of 2015, in speech recognition on Mandarin.
As of 2016, the best performance on the NIST 2000 Switchboard set (of phone conversations) is due to Microsoft, with a word-error rate of 6.3%.
Machine translation is evaluated by BLEU score, which compares the translation to the reference via overlap in words or n-grams. BLEU scores range from 0 to 1, with 1 being perfect translation. As of 2012, Tilde’s had BLEU scores in the 0.25-0.45 range, with Google and Microsoft performing similarly but worse.
In 2016, Google came out with a new neural-network-based version of its translation tool. BLEU scores on English -> French and English -> German were 0.404 and 0.263 respectively. Human evaluations, however, rated the neural machine translations 60-87% better.
OpenMT, the machine translation contest, had top BLEU scores in 2012 of about 0.4 for Arabic-to-English, 0.32 for Chinese-to-English, 0.24 for Dari-to-English, 0.27 for Farsi-to-English, and 0.11 for Korean-to-English.
In 2008, Urdu-to-English had top BLEU scores of 0.32, Arabic-to-English scores of 0.48, and Chinese-to-English scores of 0.30.
This doesn’t correspond to an improvement in machine translation at all. Apart from Google’s improvement in human ratings, celebrated in this New York Times Magazine article, it’s unclear whether neural networks actually improve BLEU scores at all. On the other hand, scoring metrics may be an imperfect match to translation quality.
Natural Language Processing
The Association for Computational Linguistics Wiki has some numbers on state of the art performance for various natural language processing tasks.
SAT analogies have been becoming more accurate over time, roughly linearly, until the present day when they are roughly as accurate as the average US college applicant. None of these are deep learning techniques.
Question answering (multiple choice of sentences that answer the question) has improved roughly steadily over time, with a discontinuity around 2006. Neural nets did not start being used until 2014, but were not a discontinuous advance from the best models of 2013.
Paraphrase identification (recognizing if one paragraph is a paraphrase of another) seems to have risen steadily over the past decade, with no special boost due to deep learning techniques; the top performance is not from deep learning but from matrix factorization.
On NLP tasks that have a long enough history to graph, there seems to be no clear indication that deep learning performs above trendline.
Trends relative to processing power and time
Performance/accuracy returns to processing power seem to differ based on problem domain.
In image recognition, we see sublinear returns to linear improvements in processing power, and gains leveling off over time as computers reach and surpass human-level performance. This may mean simply that image recognition is a nearly-solved problem.
In NLP, we see roughly linear improvements over time, and in machine translation, it’s unclear if we see any trends in improvements over time, both of which suggest sublinear returns to processing power, but this is not very confident.
In games, we see roughly linear returns to linear improvements in processing power, which means exponential improvements in performance over time (because of Moore’s law and increasing investment in AI).
This would suggest that far-superhuman abilities are more likely to be possible in game-like problem domains.
What does this imply about deep learning?
What we’re seeing here is that deep learning algorithms can provide improvements in narrow AI across many types of problem domains.
Deep learning provides discontinuous jumps relative to previous machine learning or AI performance trendlines in image recognition and speech recognition; it doesn’t in strategy games or natural language processing, and machine translation and arcade games are ambiguous (machine translation because metrics differ; arcade games because there is no pre-deep-learning comparison.)
A speculative thought: perhaps deep learning is best for problem domains oriented around sensory data? Images or sound, rather than symbols. If current neural net architectures, like convolutional nets, mimic the structure of the sensory cortex of the brain, which I think they do, one would expect this result.
Arcade games would be more analogous to the motor cortex, and perceptual control theory suggests that something similar to Q-learning may be going on in motor learning, though I’d have to learn more to be confident in that. If mammalian motor learning turns out to look like Q-learning, I’d expect deep reinforcement learning to be especially good in arcade games and robotics, just as deep neural networks are especially good in visual and audio classification.
Deep learning hasn’t really proven itself better than trendline in strategy games (Go and chess) or in natural language tasks.
I might wonder if there are things humans can do with concepts and symbols and principles, the traditional tools of the “higher intellect”, the skills that show up on highly g-loaded tasks, that deep learning cannot do with current algorithms. Obviously hard-coding rules into an AI has grave limitations (the failure of such hard-coded systems was what caused several of the AI winters), but there may also be limitations to non-conceptual pattern recognition. The continued difficulty of automating language-based tasks may be related to this issue.
Miles Brundage points out,
Progress so far has largely been toward demonstrating general approaches for building narrow systems rather than general approaches for building general systems. Progress toward the former does not entail substantial progress toward the latter. The latter, which requires transfer learning among other elements, has yet to have its Atari/AlphaGo moment, but is an important area to keep an eye on going forward, and may be especially relevant for economic/safety purposes.
I agree. General AI systems, as far as I know, do not exist today, and the million-dollar question is whether they can be built with algorithms similar to those used today, or if there are further fundamental algorithmic advances that have yet to be discovered. So far, I think there is no empirical evidence from the world of deep learning to indicate that today’s deep learning algorithms are headed for general AI in the near future. Discontinuous performance jumps in image recognition and speech recognition with the advent of deep learning are the most suggestive evidence, but it’s not clear whether those are above and beyond returns to processing power. And so far I couldn’t find any estimates of trends in cross-domain generalization ability. Whether deep learning algorithms can be general-purpose is perhaps a more theoretical question; what we can say is that recent AI progress doesn’t offer any reason to suspect that they already are.
40 thoughts on “Performance Trends in AI”
Some Atari numbers pre-deep-learning in the appendices of https://arxiv.org/abs/1207.4708
I’ve also plotted some Atari data spanning pre- and post-deep learning, and it looks reasonably smoothly exponential over a 5 year time period: https://twitter.com/Miles_Brundage/status/717607643863318529 That’s just for a handful of games, though, and there are probably differences in evaluation methods. Happy to share a (noisy) spreadsheet with that and other data if there’s interest.
Also, there have been a lot of discontinuities in particular Atari games, e.g. Montezuma’s Revenge going from basically no score to human-levelish in the past year, and it seems just as reasonable to look at those game-level trends as to look at MNIST. One could also, e.g. compare trends in games with hard vs. easy exploration, which seems related to your “arcade” vs. “strategy” distinction (drawing from the taxonomy in the appendix here: https://arxiv.org/abs/1606.01868).
P.S. while I think the investigation I mentioned at the end could be interesting, it may not answer the “how big of a deal is deep learning here” question as some of the recent discontinuities are related to adding new non-deep learning things onto systems that already use deep learning.
“This was a single reinforcement learner trained with a convolutional neural net over images of the game screen outputting behaviors (arrows). Basically it’s Bellman-Ford, with a nonlinear approximation of the Q-function that estimates the quality of a move” – Bellman-Ford is an algorithm that finds shortest paths in graphs, I think you mean “Q-learning” (which is based off the Bellman equation).
If it’s trained over images of the screen, how would it figure out reflection? It can’t know which way the ball is moving.
Timestamped images at each time increment can totally determine where the ball is moving, right? What am I missing?
The DQN agent is given several (downscaled grayscale) screen images, hence it can implicitly learn a limited amount of motion. (That said, it’s kind of impressive such a crippled agent with no memory can do as well as it does; but apparently RNNs are challenging to train.) Certainly nothing as silly as adding a timestamp and expecting it to learn to count by OCR on blurry images.
Thanks for your post.
On MNIST: It’s probably worth noting that this is a very easy task. The error rates are tiny. There are 10,000 test points. The error numbers you report are all fractions of 1%, so the actual error rates of five years ago and today are not really .23 and .21, but .0023 and .0021. Note that this means they are only getting 23 and 21 examples wrong out of 10,000, so it’s not at all convincing that this difference is a real phenomenon. Additionally, when there are so few errors, you can go look at the errors, and a lot of them are very badly drawn, hard to tell maybe-this-digit maybe-that-digits. ML researchers use MNIST as a unit test for rejecting obviously bad methods quickly; reporting top accuracy on original MNIST is a bit of an ongoing joke. Treating MNIST and ImageNet in the same way isn’t really appropriate; qualitatively, deep learning barely improves the state-of-the-art for MNIST, whereas it’s an enormous discontinuous jump for ImageNet.
For speech recognition, it’s hard to compare results, because the big companies [I can speak only of Google, where I work] do most of their work on very large, proprietary data sets that are not shared across companies; the shared benchmarks are tiny by comparison and don’t lead to state-of-the-art recognizers for real tasks. I agree with your conclusion.
Overall, I agree strongly with the general points of your post, and predict most ML researchers would do likewise.
“BLEU scores range from 0 to 1, with 1 being perfect translation… BLEU scores on English -> French and English -> German were 40.4 and 26.30 respectively.” – these second numbers are hundredths of BLEU scores, right?
On “performance” for games, it’s worth noting that we don’t actually know how this relates to other things we’re actually interested in. Intuitively, Elo scores seem more like what we care about than exponentiated Elo scores (or raw Atari scores); every time your Elo score goes up by 400, your odds of winning are multiplied by 10:1. So if you were starting against an exactly equal opponent (so their chance of beating you is 50%), your first 400-point increment reduces their chances to 9.1%, your second 400-point gain reduces their chances to 1%, and so on.
In the Elo formula Q_A = 10^(E_A/400), where odds of A winning vs B are Q_A:Q_B, we can think of each game as a lottery in which each player X gets Q_X tickets.* One ticket is drawn, and the player whose ticket is drawn wins. So if player A has a score of 1600 and player B has a score of 2000, player A gets 10,000 tickets and player B gets 100,000 tickets for 1:10 odds. Apparently “exponential” progress in “absolute” Elo scores is offset by the diminishing marginal effectiveness of lottery tickets.
Exponential progress in buying lottery tickets is a pretty intuitive fit for what’s going on with lookahead search, where more computational power lets you explore more and longer branches, but with diminishing marginal probability of relevance, since you’ll explore the best ones first.
AlphaGo seems relevantly different mainly because most of its capacity was precomputed. The most impressive sentence to me in the AlphaGo paper was this one: “Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play.” This is powerful because it means that if you have to perform the same sort of complex evaluations on many instances, you can do the learning once instead of over and over.
I haven’t found any info on how much computation went into building the policy and value networks – or how expensive it was to run AlphaGo without lookahead – but that seems highly relevant in evaluating AlphaGo. Obviously the former will be a lowball estimate because it doesn’t count the computing power of the humans working on AlphaGo.
* I owe this analogy to Toby Ord, though I may have introduced errors.
Related to this, it is very difficult to map and kind of trend onto progress in many of the domains listed in the post. Something like MNIST or human languages are going to have an inherent “difficulty structure” where some of the training/testing examples are just intrinsically more difficult or even practically impossible. The distribution of difficulty in the training set will dictate whether performance on the test appears to increase linearly or exponentially. If you show a human *or* a machine a blurred smudge with no context, don’t be surprised if both systems fail to recognize it. If you play an garbled sequence of semantically ambiguous words for a human *or* a machine, it won’t be their fault if they misinterpret it. If the hardest 5% of examples are garbage, then that last 5% is going to make it look like the machine’s improvement rate is slowing down.
Another issue that stands in the way of attaining generality is simply how the systems are currently being trained. That is to say, we’re only training them *for* narrow applications. We know that machines can take into account huge amounts of context if that context is fed into their training data. If that context remains outside their training data, we shouldn’t be surprised when they consistently give answers that seem dumb to us because we have access to the full context. An example would be word recognition. If you fed system metadata about the gist of what a conversation was supposed to be about, I feel confident that word recognition would improve. In fact, humans go into almost every conversation already knowing what to expect. (And have you ever been unexpectedly spoken to in public and found that you couldn’t understand the first sentence spoken to you? Even humans need to calibrate on new voices and new contexts.) Machines may be approaching something like a theoretical maximum performance on word recognition-without-context.
Finally, I wanted to comment that DeepMind released a few papers in 2016 on “one-shot” learning of symbols.
What about the argument that these tasks get harder as the error rates get smaller? The hard-to-classify cases are what is left to improve on. And on a related note, why is absolute reduction the metric of choice? It seems to make some hand-wavy sense to me that we look at relative reductions in error – because this captures how much of what is left has been solved. If anything, intuitively I would think that even relative reduction underexplains the increase in difficulty as you get near the bayes rate in a noisy dataset.
Taking imagenet as an example, here is a picture with the actual numbers on it – https://qph.ec.quoracdn.net/main-qimg-3841c6dd04ae33398d5e6743f1072f69?convert_to_webp=true
So while imagenet top-5 error is leveling off in absolute reductions, the relative reductions are still quite impressive.
There was a 36% RR in 2012.
28% in 2013
42% in 2014
46% in 2015
Framed like that, these results are much more impressive. This also corresponds with the experience of applied practitioners – resnets and so on really do outperform alexnets/vggnets signficantly in the metrics we care about.
I haven’t seen any theoretical justification for arguing one way or the other (absolute vs relative improvements) but I am doubtful that we can just take absolute improvement as the one true metric.
I probably do agree with the general thrust of your post, but this seems to be a weak foundation to the argument that I would like to understand more about.
That’s a good point and I’ll have to think about that.
Nice post. I was reviewing the ILSVRC 2016 challenge and also noted the diminishing returns in improvement vis-a-viz prior years more spectacular results – AFAIK this year is iterative/refining with ensembling. I read an interesting comment recently (indirectly attributed to Andrej Karpathy – can’t find link) that suggested that the asymptotic results on ILSVRC may be due to the way Imagenet was labeled – via AMT (amazon mechanical turk). In other words, the sub-3% top 5 error rate may be approaching ‘ground truth’ for the database with some of the images incorrectly labeled by the humans who labeled via AMT. (remember estimate of human labeling accuracy is 5% error)
Perhaps it is time for a new image database for such challenges (maybe less heavy on the Dogs this time).
What I think is perhaps more interesting and, unfortunately, less frequently reported is the top 1 score – the “did I classify it correctly score” which has jumped from about 50% in 2011 via HOG/SVM methods to either 81% or 85% in the 2016 challenge (seen different #’s). Getting it completely right even 80% of the time is pretty fantastic – perhaps there is more fine-tuning work to be done here to get the top-1 score up to the 95-97% category on this database.
The 5% comes from https://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/ https://arxiv.org/abs/1409.0575 if anyone is wondering.
Good analysis, thanks.
There’s an ongoing debate w/in AI community whether DL approaches can be taken to AGI; Gary Marcus and some others are skeptics (arguably Andrew Ng is a skeptic, despite being a foremost DL proponent).
The argument for “yes” is I think along the lines of this:
with some small-scale examples of “this sort of thing” here:
Another area worth keeping in mind as a sign or progress is Q&A performance, as it clearly requires human-level abstraction.
> A speculative thought: perhaps deep learning is best for problem domains oriented around sensory data? Images or sound, rather than symbols. If current neural net architectures, like convolutional nets, mimic the structure of the sensory cortex of the brain, which I think they do, one would expect this result.
I’m reminded of this bit from Ben Goertzel’s essay “Are there Deep Reasons Underlying the Pathologies of Today’s Deep Learning Algorithms?“:
> The point of hierarchical architectures for visual and auditory data processing is mainly that, in these particular sensory data processing domains, one is dealing with information that has a pretty strict hierarchical structure to it. It’s very natural to decompose a picture into subregions, subsubregions and so forth; and to de- fine an interval of time (in which e.g. sound or video occurs) into subintervals of times. As we are dealing with space and time which have natural geometric structures, we can make a fixed processing-unit hierarchy that matches the structure of space and time lower-down units in the hierarchy dealing with smaller spatiotemporal regions; parent units dealing with regions that include the regions dealt with by their children; etc. For this kind of spatiotemporal data processing, a fairly rigid hierarchical structure makes a lot of sense (and seems to be what the brain uses). For other kinds of data, like the semantics of natural language or abstract philosophical thinking or even thinking about emotions and social relationships, this kind of rigid hierarchical structure seems much less useful, and in my view a more freely-structured architecture may be more appropriate.
> In the human brain, it seems the visual and auditory cortices have a very strong hierarchical pattern of connectivity and information flow, whereas the olfactory cortex has more of a wildly tangled-up, “combinatory” pattern. This combinatory pattern of neural connectivity helps the olfactory cortex to recognize smells using complex, chaotic dynamics, in which each smell represents an “attractor state” of the oflactory cortex’s nonlinear dynamics (as neuroscientist Walter Freeman has argued in a body of work spanning decades ). The portions of the cortex dealing with abstract cognition have a mix of hierarchical and combinatory connectivity patterns, probably reflecting the fact that they do both hierarchy-focused pattern recognition as we see in vision and audition, and attractor-based pattern recognition as we see in olfaction. But this is largely speculation most likely, until we can make movies somehow of the neural dynamics corresponding to various kinds of cognition, we won’t really know how these various structural and dynamical patterns come together to yield human thinking.
> My own view is that for anything resembling a standard 2015-style deep learning system (say, a convolutional neural net, stacked autoencoder, etc.) to achieve anything like human-level intelligence, major additions would have to be made, involving various components that mix hierarchical and more heterogeneous network structures in various ways. For example: Take “episodic memory” (your life story, and the events in it), as opposed to less complex types of memory. The human brain is known to deal with the episodic memory quite differently from the memory of images, facts, or actions. Nothing, in currently popular architectures commonly labeled “deep learning”, tells you anything about how episodic memory works. Some deep learning researchers (based on my personal experience in numerous conversations with them!) would argue that the ability to deal with episodic memories effectively will just emerge from their hierarchies, if their systems are given enough perceptual experience. It’s hard to definitively prove this is wrong, because these models are all complex dynamical systems, which makes it difficult to precisely predict their behavior. Still, according to the best current neuroscience knowledge , the brain doesn’t appear to work this way; episodic memory has its own architecture, different in specifics from the architectures of visual or auditory perception. I suspect that if one wanted to build a primarily brain-like AGI system, one would need to design (not necessarily strictly hierarchical) circuits for episodic memory, plus dozens to hundreds of other specialized subsystems.
There are multiple statements like “deep learning doesn’t generally seem to do better than trendline” that seem confusing, like assuming some magical trendline systems that are able to get better. In many of those domains, there is no “trendline” without deep learning – if you skip the non-NN approaches, then the line for the past few years would be nearly flat as the other methods were at a dead end for those domains, significant improvements (i.e. continuing the trendline) weren’t possible even with increasing computational power.
I agree. In the case of AlphaGo, it does make sense to project out the trendline to try to argue that ‘AlphaGo is not that special and not that discontinuous’ because there is a real viable alternative which can turn computing power into better performance – just do even more MCTS, it is scalable & parallelizable. Likewise for chess, the existing methods were already superhuman so obviously they work. (Also, why would anyone bother?) But for all the other domains, whether ImageNet or ALE or NLP, what alternative is there? (eg in the chart of ALE borrowed from Brundage, correct me if I’m wrong, but isn’t every single datapoint there a deep RL system and there are no benchmarks of earlier systems because no one ever tried, RL performance was so bad? I’m reminded of a comment on Reddit some time ago when I suggested using a RL approach for something: “Oh yeah, I forgot, reinforcement learning works now”.) I mean, suppose someone hands you a workstation with 32 Xeon cores but forbids you to use deep learning; how exactly does one go about turning that into deep learning equivalent results? The trendline says you should be able to, if deep learning isn’t important on its own!
“Deep learning matters, but only because it provides a way to turn Moore’s Law into corresponding performance improvements, for a wide class of problems.”
This reflects some very odd thinking and the tyranny of the outside view. (…”Silicon transistors matter compared to vacuum transistors, but only because it provides a way to turn Moore’s Law into corresponding performance improvements, for a wide class of problems.”…) Exponential trends in performance, particularly Moore’s Law, are driven by waves of sigmoids as new more powerful methods or paradigms supersede the old one and are able to keep progress going; and when no successors can be found, the trend collapses. The trend has no existence apart from the individual sigmoids. If earlier AI methods cannot continue improving as they are applied on bigger hardware, then the trend would have collapsed in the absence of deep learning, and deep learning is indeed a powerful and important trend.
In a way this is worse (or better, if one is a pessimist) than it seems. Demis Hassabis told me that he expects general AI to be a sort of neural net (augmented with memory or something like that, as they did in a recent paper). If that is not the way forward, then solving general intelligence is even more out of reach.
The games chosen have set of rules that are rather small, and they are very simplistic twitchy games. Other games such as the Civ series that are turn based strategy do not appear on the list and making inferences from the in-game AI, I’m going to assume that the progress is less robust in that area.
Strategy game AI, either RTS or turn-based are famous for cheating. To make the levels harder, the AI cheats more blatantly rather than play more intelligently. I’m surprised at this point why gaming studios like Fireaxis havn’t used a learning AI to play more intelligently.
Another place that seems to be a natural fit is MMOs. There would be a massive set of data available. NPCs and mobs theoretically should be able to learn, keeping the game far more fresh and entertaining.
I have never worked in game development or AI, so all my comments should be taken with a massive grain of salt.
There’s very little progress in games like Civ, because there’s very little incentive for anyone to research it. The reason why Firaxis etc. don’t invest more in developing AI, is that most players are okay with an blatantly cheating AI. It’s more profitable to invest more in developing interesting game mechanics etc., than to put those resources in developing better AI (or so Firaxis thinks, but they’re probably not wrong).
I don’t agree. The lack of a decent AI is constantly hit on in reviews. Civ VI is being tore into on Steam over this. I think if AI could do learn in more complex gaming environments the money would be there.
I doubt it is lack of incentive; I think it is lack of ability by the developers. I’m surprised that major companies that are on rev6 haven’t built enough data points to construct a decent AI. It is a single type of game that first came out in 1991. Twenty-six years later and the AI is so bad fans of the series won’t buy the latest game.
Civ is much like chess, which is why I point it out. One has to arrange the board in your favor, balance attack and defense, and change strategy in response to the opponent. It is more complex in the way things can go wrong, but one thing naturally leads to the next. It seems odd to me that learning AIs haven’t become a part of games.
It appears that the gaming studios haven’t developed the in-house skills. I think that if AI were capable the first studio to incorporate it effectively would have an advantage.
One thing about BLEU – my read on the conventional wisdom in MT is that they think it’s a proxy for actual translation quality only for the purpose of comparing two similar systems/architectures on the same dataset. They’ve noted this not just for NMT vs. SMT but in other settings (e.g. transfer vs. SMT, a long while before the newer NMT stuff). So there’s good justification for what Google did in relying on human evaluation for comparing NMT versus their previous SMT system.
It would be interesting to try to reconstruct historical trends in MT performance from, say, the NIST human evaluations they’ve done for many years, though it would be (very? impossibly?) tricky to deal with different human evaluators and such. I don’t know enough about the details of these evaluations to assess the feasibility of this.
Thanks for the analysis! Those were interesting videos of Pong and Breakout. I wonder if they’d approximate human performance better if the researchers were to build in sensorimotor coordination constraints.
(So for example, we’re often better at smooth motions than jerky ones, we need a certain amount of time to react to visual information, we often pre-plan sequences of motor movements which then take time/effort to abort, etc.) The somewhat successful responses that the artificial agents exhibit in the videos would be difficult for a human to implement. But if only sequences that humans *could* implement were allowed, perhaps the best remaining strategies to emerge might resemble (for example) “understanding the law of reflection”. Or look more like predictions of straight trajectories AFTER the reflection has happened. (I say “look like” and “resemble” because it’s not clear what more there is to understanding those things in the very limited context of these games).
An excellent survey. One of my companies works in text machine learning, and I often say, “We’re not traditional rules-based NLP… and we’re not deep learning.” One set of problems we’re good at is multitag classification: looking at a large human-generated text document and picking the best handful of tags out of a library of 1000s or 10,000s of possibilities. Note that this is neither machine translation, where the output is roughly equivalent in size and complexity to the input, nor feature extraction (i.e., word search). This is a very, very difficult problem in AI. Rule-based systems are O(n^2) at least with the size of the tag library. And deep learning – if it ever tries – will have to simultaneously tackle parsing (including negation) and scale-out. Since we’re not from either technology base, we don’t have those problems!
Have a look at multitag classification. I wonder how it commodes with the others in your article. Maybe we’ve missed something