Strong AI Isn’t Here Yet

Epistemic Status: moderately confident. Thanks to Andrew Critch for a very fruitful discussion that clarified my views on this topic.  Some edits due to Thomas Colthurst.

I’ve heard a fair amount of discussion by generally well-informed people who believe that bigger and better deep learning systems, not fundamentally different from those which exist today, will soon become capable of general intelligence — that is, human-level or higher cognition.

I don’t believe this is true.

In other words, I believe that if we develop strong AI in some reasonably short timeframe (less than a hundred years from now or something like that), it will be due to some conceptual breakthrough, and not merely due to continuing to scale up and incrementally modify existing deep learning algorithms.

To be clear on what I mean by a “breakthrough”, I’m thinking of things like neural networks (1957) and backpropagation (1986) [ETA: actually dates back to 1974, from Paul Werbos’ thesis] as major machine learning advances, and types of neural network architecture such as LSTMs (1997), convolutional neural nets (1998), or neural Turing machines (2016) as minor advances.

I’ve spoken to people who think that we will not need even minor advances before we get to strong AI; I think this is very unlikely.

Predicate Logic and Probability

As David Chapman points out in Probability Theory Does Not Extend Logic, one of the important things humans can do is predicate calculus, also known as first-order logic. Predicate calculus allows you to use the quantifiers “for all” and “there exists” as well as the operators “and”, “or”, and “not.”

Predicate calculus makes it possible to make general claims like “All men are mortal”.  Propositional calculus, which consists only of “and”, “or”, and “not”, cannot make such statements; it is limited to statements like “Socrates is mortal” and “Plato is mortal” and “Socrates and Plato are men.”

Inductive reasoning is the process of making predictions from data. If you’ve seen 999 men who are mortal, Bayesian reasoning tells you that the 1000th man is also likely to be mortal. Deductive reasoning is the process of applying general principles: if you know that all men are mortal, you know that Socrates is mortal.  In human psychological development, according to Piaget, deductive reasoning is more difficult and comes later — people don’t learn it until adolescence.  Deductive reasoning depends on predicate calculus, not just propositional calculus.

It’s possible to view propositional calculus as an extension of probability theory. For instance, MIRI’s logical induction paper constructs a (not very efficient) algorithm for assigning probabilities to all sentences in a propositional logic language plus some axioms, such that the probabilities learn to approximate the true computed values faster than it would take to compute the truth of propositions.  For example, if we are given the axioms of first-order logic, the logical induction criterion gives us a probability distribution over all “worlds” consistent with those axioms. (A “world” is an assignment of Boolean truth values to sentences in propositional calculus.)

What’s not necessarily known is how to assign probabilities to sentences in predicate calculus in a way consistent with the laws of probability.

Part of why this is so difficult is because it touches on questions of ontology. To translate “All men are mortal” into probability theory, one has to define a sample space. What are “men”?  How many “men” are there? If your basic units of data are 64×64 pixel images, how are you going to divide that space up into “men”?  And if tomorrow you upgrade to 128×128 images, how can you be sure that when you construct your collection of “men” from the new data, that it’s consistent with the old collection of “men”?  And how do you set up your statements about “all men” so that none of them break when you change the raw data?

This is the problem I alluded to in Choice of Ontology.  A type of object that behaves properly under ontology changes is a concept, as opposed to a percept (a cluster of data points that are similar along some metric.)  Images that are similar in Euclidean distance to a stick-figure form a percept, but “man” is a concept. And I don’t think we know how to implement concepts in machine-learning language, and I think we might have to do so in order to “learn” predicate-logic statements.

Stuart Russell wrote in 2014,

An important consequence of uncertainty in a world of things: there will be uncertainty about what things are in the world. Real objects seldom wear unique identifiers or preannounce their existence like the cast of a play. In the case of vision, for example, the existence of objects must be inferred from raw data (pixels) that contain no explicit object references at all. If, however, one has a probabilistic model of the ways in which worlds can be composed of objects and of how objects cause pixel values, then inference can propose the existence of objects given only pixel values as evidence. Similar arguments apply to areas such as natural language understanding, web mining, and computer security.

The difference between knowing all the objects in advance and inferring their existence and identity from observation corresponds to an important but often overlooked distinction between closed-universe languages such as SQL and logic programs and open-universe languages such as full first-order logic.

How to deduce “things” or “objects” or “concepts” and then perform inference about them is a hard and unsolved conceptual problem.  Since humans do manage to reason about objects and concepts, this seems like a necessary condition for “human-level general AI”, even though machines do outperform humans at specific tasks like arithmetic, chess, Go, and image classification.

Neural Networks Are Probabilistic Models

A neural network is composed of nodes, which take as inputs values from their “parent” nodes, combine them according to the weights on the edges, transform them according to some transfer function, and then pass along a value to their “child” nodes. All neural nets, no matter the difference in their architecture, follow this basic format.

A neural network is, in a sense, a simplification of a Bayesian probability model. If you put probability distributions rather than single numbers on the edge weights, then the neural network architecture can be interpreted probabilistically. The probability of a target classification given the input data is given by a likelihood function; there’s a prior over the distribution of weights; and as data comes in, you can update to a posterior distribution over the weights, thereby “learning” the correct weights on the network.  Doing gradient descent on the weights (as you do in an ordinary neural network) finds the maximum likelihood values of the posterior distributions on the weights in the Bayesian network paradigm.

What this means is that neural networks are simplifications or restrictions of probabilistic models. If we don’t know how to solve a problem with a Bayesian network, then a fortiori we don’t know how to solve it with deep learning either (except for considerations of efficiency and scale — deep neural nets can be much larger and faster than Bayes nets.)

We don’t know how to assign and update probabilities on predicate statements using Bayes nets, in a coherent and general manner. So we don’t know how to do that with neural nets either, except to the degree that neural nets are simpler or easier to work with than general Bayes nets.

For instance, as Thomas Colthurst points out in the comments, message passing algorithms don’t provably work in general Bayes nets, but do work in feedforward neural nets, which don’t have cycles. It may be that neural nets provide a restricted domain in which modeling predicate statements probabilistically is more tractable. I would have to learn more about this.

Do You Feel Lucky?

If you believe that learning “concepts” or “objects” is necessary for general intelligence (either for reasons of predicate logic or otherwise), then in order to believe that current deep learning techniques are already capable of general intelligence, you’d have to believe that deep networks are going to figure out how to represent objects somehow under the hood, without human beings needing to have conceptual understanding of how that works.

Perhaps, in the process of training a robot to navigate a room, that robot will represent the concept of “chairs” and “tables” and even derive general claims like “objects fall down when dropped”, all via reinforcement learning.

I find myself skeptical of this.

In something like image recognition, where convolutional neural networks work very well, there’s human conceptual understanding of the world of vision going on under the hood. We know that natural 2-d images generally are fairly smooth, so expanding them in terms of a multiscale wavelet basis is efficient, and that’s pretty much what convnets do.  They’re also inspired by the structure of the visual cortex.  In some sense, researchers know some things about how image recognition works on an algorithmic level.

I suspect that, similarly, we’d have to have understanding of how concepts work on an algorithmic level in order to train conceptual learning.  I used to think I knew how they worked; now I think I was describing high-level percepts, and I really don’t know what concepts are.

The idea that you can throw a bunch of computing power at a scientific problem, without understanding of fundamentals, and get out answers, is something that I’ve become very skeptical of, based on examples from biology where bigger drug screening programs and more molecular biology understanding don’t necessarily lead to more successful drugs.  It’s not in-principle impossible that you could have enough data to overcome the problem of multiple hypothesis testing, but modern science doesn’t have a great track record of actually doing that.

Getting artificial intelligence “by accident” from really big neural nets seems unlikely to me in the same way that getting a cure for cancer “by accident” from combining huge amounts of “omics” data seems unlikely to me.

What I’m Not Saying

I’m not saying that strong AI is impossible in principle.

I’m not saying that strong AI won’t be developed, with conceptual breakthroughs.  Researchers are working on conceptually novel approaches like differentiable computing and program induction that might lead to machines that can learn concepts and predicates.

I’m not saying that narrow AI might not be a very big deal, economically and technologically and culturally.

I’m not trying to malign the accomplishments of people who work on deep learning. (I admire them greatly and am trying to get up to speed in the field myself, and think deep learning is pretty awesome.)

I’m saying that I don’t think we’re done.

 

 

Advertisements

48 thoughts on “Strong AI Isn’t Here Yet

  1. This paper presents an interespecting, tightly-argued perspective on how current best deep learning models do not learn like humans, or as well as humans, in several (more or less) precisely definable aspects. I recommend it:
    Building Machines That Learn and Think Like People
    Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, Samuel J. Gershman
    https://arxiv.org/abs/1604.00289

  2. I agree with the conclusion here but some of the arguments here don’t make sense to me. In particular, I’m not convinced that Chapman’s “Probability theory does not extend logic” is correct. I’ll just paste here a comment I wrote on SSC on the mater a while back:

    So I looked at Chapman’s “Probability theory does not extend logic” and some things aren’t making sense. He claims that probability theory does extend propositional logic, but not predicate logic.

    But if we assume a countable universe, probability will work just as well with universals and existentials as it will with conjunctions and disjunctions. Even without that assumption, well, a universal is essentially an infinite conjunction, and an existential statement is essentially an infinite disjunction. It would be strange that this case should fail.

    His more specific example is: Say, for some x, we gain evidence for “There exist distinct y and y’ with R(x,y)”, and update its probability accordingly; how should we update our probability for “For all x, there exists a unique y with R(x,y)”? Probability theory doesn’t say, he says. But OK — let’s take this to a finite universe with known elements. Now all those universals and existentials can be rewritten as finite conjunctions and disjunctions. And probability theory does handle this case?

    I mean… I don’t think it does. If you have events A and B and you learn C, well, you update P(A) to P(A|C), and you update P(A∩B) to P(A∩B|C)… but the magnitude of the first update doesn’t determine the magnitude in the second. Why should it when the conjunction becomes infinite? I think that Chapman’s claim about a way in which probability theory does not extend predicate logic, is equally a claim about a way in which it does not extend propositional logic. As best I can tell, it extends both equally well.

    Similarly, I have to ask, is the part about logical induction actually correct? I can’t claim to have read the whole thing, but having read the introduction, I don’t see how it can be called “propositional”. It handles statements in predicate logic; is there some way in which it only treats them “propositionally”? It doesn’t look like it to me, though I’ll admit once again I haven’t read beyond the introduction…

    • From the paper:
      We generally use the symbols φ, ψ, χ to denote well-formed
      formulas in some language of propositional logic L (such as a theory of first order
      logic; see below), which includes the basic logical connectives ¬, ∧, ∨, →, ↔, and
      uses modus ponens as its rule of inference. We assume that L has been chosen so
      that its sentences can be interpreted as claims about some class of mathematical
      objects, such as natural numbers or computer programs. We commonly write S for
      the set of all sentences in L, and Γ for a set of axioms from which to write proofs
      in the language. We write Γ ⊢ φ when φ can be proven from Γ via modus ponens.

      The “sentences” which are being assigned probability values by the logical inductors are sentences in L, the propositional logic language. They are not proofs in the language, which would be predicate statements.

      Am I missing something?

      • Huh, looking again, it looks like you’re right. I honestly hadn’t noticed that earlier. (I said “the introduction”, I really meant “sections 1-3”, I forgot how it was split up, oops.) That’s surprising! Because I mean it assigns probability to arbitrary sentences! And discusses quantified stuff all the time! I guess it’s relying a lot on the separate prover process?

      • You quote:

        > We generally use the symbols φ, ψ, χ to denote well-formed
        formulas in some language of propositional logic L (such as a theory of first order
        logic; see below),

        Then you say:

        > The “sentences” which are being assigned probability values by the logical inductors are sentences in L, the propositional logic language. […] Am I missing something?

        Logical Inductors can use an L which is first-order logic, in which they learn about the structure of first-order proofs by observing what the deductive process over time. They then make good guesses about what first-order sentences will be proved later, in a way guaranteed to eventually do as well as any poly-time conjecture-making algorithm. So, they’re eventually pretty good at first-order logic.

        What they cannot do is make the leap from assigning probabilities which approach 1 to all instances of some universal statement, to then believing the universal statement. This is called the Gaifman property, and implies a very strong kind of uncomputability; hence, is impossible to achieve in this kind of theory. This might seem like saying it “can’t do first order logic” in a way. But really, this is connected with the compactness theorem for first-order logic. No finite set of instances implies a forall statement, so it cannot be that *all* the instances imply the forall statement, either. The Gaifman property actually only makes sense in the context of second-order logic, where we can say that the set of natural numbers is the *least* set generated by the successor operation. Then, we really *should* conclude the universal statement from all the instances, because we *know* that’s all the instances. But, as I said, this becomes rather strongly uncomputable. It is not clear what kind of weaker property we might actually want, in order to model what humans are doing when forming beliefs expressible in second-order logic.

      • Thank you! This clarifies a lot.
        Can logical inductors jump to believing the universal statement with some probability that is *almost* 1?

      • > Thank you! This clarifies a lot. Can logical inductors jump to believing the universal statement with some probability that is *almost* 1?

        Yes, the probability will go up roughly under the circumstances when you’d expect it to go up. But, there will be a “gap”; it will approach some number less than 1, rather than approaching 1. How much less than 1? Unfortunately that’s totally arbitrary, based on the initial distribution of wealth among traders in the system. We can manipulate the belief in a target undecideable sentence to be high or low, by adding traders whose only job in life is to put money for/against those sentences.

    • (As someone pretty familiar with logical inductors) I agree that calling logical induction seems wrong to me. Logical induction is more or less an extension of probability theory to allow us to meaningfully work with first-order logic while at the same time remaining computable (though not very feasibly computable!).

      A paper of mine (“Logical Prior Probability”) dealt with what we can do if we drop the computability requirement but keep it approximable. As you say, probability works just as well in this setting; it’s only that we have to drop computability requirements (because the undecidability of first-order logic makes it impossible to computably assign probability zero to contradictions). Fortunately, we can approximate it in a way which gets closer to a coherent distribution over time.

      What logical induction does which that kind of approach could never do is give some guarantees about the _way_ in which a coherent distribution gets approximated; it guarantees that good heuristic guesses will be made. So logical induction formalizes the way mathematicians can make conjectures, and illustrates that it necessarily goes beyond probability theory in a certain way (while also keeping as close to probability theory as possible).

      If anything, I would say that a better characterization of the limits of logical induction is that it doesn’t do _second-order_ logic. It doesn’t have notions of “finite” vs “infinite” like those which second-order logic can supply. It doesn’t come to believe in a standard model of the natural numbers.

  3. Do you have any ways to distinguish a high-level percept from a concept? Perhaps abstractions are an example, but I feel like one could argue they are a high-level percept of a sense that has access to “mind stuff”

  4. There seem to be two modes of modeling, and both are acquired through learning. The first one concerns policies or categories, based on conditions: to get this, do that. Poke your arm into this dark, muddy pond until you touch something, close your hand, pull it back: perhaps you will get a fish. The second one turns on the light: you can see the complete space, its dynamics and rules. Sure, you won’t have all the data all the time, and you may never stop to learn new nuances, but you will be able to make sense of practically all new observations by mapping them onto something in that space. The first mode is a very partially specified open world, the second one is closed, at least with respect to some crucial dynamics. Perhaps there is a correspondence to propositional logic vs. FOL as you describe it?
    When does a system go from the first mode to the second? Do we need to build new architecture, or is general learning sufficient? I honestly don’t know. I cannot confidently see why adding an additional layer of abstraction over the dark-mode layer (after exhaustive learning) is insufficient to get to the next mode.

  5. “For example, if we are given the axioms of first-order logic”
    Is this correct? First-order logic is predicate logic, not propositional logic.

    • Yes, I think that’s what the paper says. You make sentences in propositional logic that agree with a set of axioms which are in predicate logic, for example the ones that define Peano arithmetic.

  6. If you believe that learning “concepts” or “objects” is necessary for general intelligence (either for reasons of predicate logic or otherwise), then in order to believe that current deep learning techniques are already capable of general intelligence, you’d have to believe that deep networks are going to figure out how to represent objects somehow under the hood, without human beings needing to have conceptual understanding of how that works.

    Perhaps, in the process of training a robot to navigate a room, that robot will represent the concept of “chairs” and “tables” and even derive general claims like “objects fall down when dropped”, all via reinforcement learning.

    I’m not sure if we’re talking about the same thing, but to me learning concepts from data seems like a very easy task – something that even matrix factorization is capable of:

    https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf

    Factorizing the Netflix user-movie matrix allows us to discover the most descriptive dimensions for predicting movie preferences. We can identify the first few most important dimensions from a matrix decomposition and explore the movies’ location in this new space. Figure 3 shows the first two factors from the Netflix data matrix factorization. Movies are placed according to their factor vectors. Someone familiar with the movies shown can see clear meaning in the latent factors. The first factor vector (x-axis) has on one side lowbrow comedies and horror movies, aimed at a male or adolescent audience (Half Baked, Freddy vs. Jason), while the other side contains drama or comedy with serious undertones and strong female leads (Sophie’s Choice, Moonstruck). The second factorization a xis (y-axis) has independent, critically acclaimed, quirky films (Punch-Drunk Love, I Heart Huckabees) on the top, and on the bottom, mainstream formulaic films (Armageddon, Runaway Bride).

    The description of these concepts is done by humans (although I’m working on the problem of doing this automatically right now), but the concepts themselves are learned by simply factoring one user-movie-rating matrix into two.

    Word embedding models are equally simple, and equally capable of picking up concepts without human intervention:

    https://arxiv.org/abs/1607.06520

    A word embedding that represent each word (or common phrase) w as a d-dimensional word vector w∈R^d. Word embeddings, trained only on word co-occurrence in text corpora, serve as a dictionary of sorts for computer programs that would like to use word meaning. First, words with similar semantic meanings tend to have vectors that are close together. Second, the vector differences between words in embeddings have been shown to represent relationships between words [32,26]. For example given an analogy puzzle, “man is to king as woman is to x” (denoted as man:king :: woman:x), simple arithmetic of the embedding vectors finds that x=queen is the best answer because:
    →man − →woman ≈ →king − →queen
    Similarly, x=Japan is returned for Paris:France :: Tokyo:x. It is surprising that a simple vector arithmetic can simultaneously capture a variety of relationships. It has also excited practitioners because such a tool could be useful across applications involving natural language. Indeed, they are being studied and used in a variety of downstream applications (e.g., document ranking [27], sentiment analysis [18], and question retrieval [22]). However, the embeddings also pinpoint sexism implicit in text. For instance, it is also the case that:
    →man − →woman ≈ →computer programmer − → homemaker

    Sparse coding can figure out what music instruments are from just listening to the music, and then transcribe the notes:
    http://www.iro.umontreal.ca/~pift6080/H09/documents/papers/AbdallahPlumbley05-tnn-a4.pdf

    We investigate a data-driven approach to the analysis and transcription of polyphonic music, using a probabilistic model which is able to find sparse linear decompositions of a sequence of short-term Fourier spectra. The resulting system represents each input spectrum as a weighted sum of a small number of “atomic” spectra chosen from a larger dictionary; this dictionary is, in turn, learned from the data in such away as to represent the given training set in an (information theoretically) efficient way. When exposed to examples of polyphonic music, most of the dictionary elements take on the spectral characteristics of individual notes in the music, so that the sparse decomposition can be used to identify the notes in a polyphonic mixture. Our approach differs from other methods of polyphonic analysis based on spectral decomposition by combining all of the following: (a) a formulation in terms of an explicitly given probabilistic model, in which the process estimating which notes are present corresponds naturally with the inference of latent variables in the model, (b) a particularly simple generative model, motivated by very general considerations about efficient coding, that makes very few assumptions about the musical origins of the signals being processed, and (c) the ability to learn a dictionary of atomic spectra (most of which converge to harmonic spectral profiles associated with specific notes) from polyphonic examples alone — no separate training on monophonic examples is required.

    Autoencoders are basically designed to pick up on latent variables:
    http://multithreaded.stitchfix.com/blog/2015/09/17/deep-style/

    The beauty of these models is that the network learns how to represent the important features in an image (in our case attributes of style) without ever being explicitly told what those representations should look like. What’s more, this process has really given us back two models: one that takes in images and outputs a system of numbers we can use to help make our predictive algorithms on how well a given client will like an item of clothing, the other allows us to choose a random numerical description of clothing in the encoded space and then generate new images of styles yet to be seen.

    This is really delving into some deep waters. With models of this type we can effectively query our computer to design new clothing, and the results can be stunning!

    Heck, on some level, any clustering algorithm can be thought of as trying to uncover underlying concepts – it’s just usually not the most efficient way to do so.

    Given all that, I don’t see how learning concepts from data should be considered a hard problem.

    • See, I used to think that, and I no longer do.
      Clustering, sparse coding, dimension reduction, etc, reveal what I would now call *percepts* — key features or types that appear in the data, simplifications that explain a lot of the variance.

      These are not, in general, built to guarantee “safe” conversion between ontologies, as far as I know. Nor would they emerge naturally given broad uncertainty about ontologies.

      • What’s your definition of a concept then? To me “a model that explains a lot of variance in the data” seemed like the most natural explanation of what a map vs territory even means. I agree that the features models typically learn from the data are overly simplistic – most notably, unlike humans, who appear to be using some sort of a mental 3d model of the world, image recognition models appear to be working in terms of pixels, which is a really poor way to generalize anything. But I’m not sure if it’s the limitation of the algorithm or of the data we present to it. Humans appear to be quite bad at generalizing to unknown stimuli as well:
        http://www.newyorker.com/tech/elements/people-cured-blindness-see

        The philosopher William Molyneux, whose wife was blind, had proposed a thought experiment in the seventeenth century about a person, blind from birth, who could tell apart a cube and a sphere by touch: If his vision were restored and he was presented with the same cube and sphere, would he be able to tell which was which by sight alone?
        […]
        Sinha showed me a video in which a teen-age boy, blind since birth because of opaque cataracts, sees for the first time. The boy sits still and blinks silently, the room around him reflecting in his eyes as a kind of proof of their new transparency. Sinha believes these first moments for the newly sighted are blurry, incoherent, and saturated by brightness—like walking into daylight with dilated pupils—and swirls of colors that do not make sense as shapes or faces or any kind of object. “The moments immediately following bandage removal are not quite as ‘magical’ as Hollywood movies would have us believe,” Sinha told me. To answer Molyneux, then: No. A cube and a sphere are both lost in this confusion.

        So I’m not sure if humans are better at learning concepts from scratch, or they’re just exposed to a sufficiently wide variety of stimuli that the ontologies we have are the only efficient way to explain the variance.

      • I agree that the high-level percepts found by systems like autoencoders, word embedding models, and matrix factoring do not, by themselves, give you built-in safe conversion between ontologies, but I don’t think that this rules them out as “concepts.” I think that you get safe conversion between ontologies when lower-level concepts/precepts can be overridden or redirected by higher-level concepts/precepts (examples: an incorrectly recognized single letter is corrected in the context of an entire word, an incorrectly read single word is corrected in the context of an entire sentence, or an incorrectly interpreted situation is corrected in the context of a broader situation), and when at the very top of the concept/precept stack you have a form of goal-oriented behavior, which gets final say in terms of reinterpretation/correction.

      • I agree that the high-level percepts found by systems like autoencoders, word embedding models, and matrix factoring do not, by themselves, give you built-in safe conversion between ontologies, but I don’t think that this rules them out as “concepts.” I think that you get safe conversion between ontologies when lower-level concepts/precepts can be overridden or redirected by higher-level concepts/precepts (examples: an incorrectly recognized single letter is corrected in the context of an entire word, an incorrectly read single word is corrected in the context of an entire sentence, or an incorrectly interpreted situation is corrected in the context of a broader situation), and when at the very top of the concept/precept stack you have a form of goal-oriented behavior, which gets final say in terms of reinterpretation/correction.

  7. I think that we can definitely write algorithms that learn to see without the developers understanding sight at all, the idea “use a model with translational symmetry” uses only the very mildest biological inspiration. Note also that convnets are effective in domains without smoothness. (And convnets aren’t even necessary for simple vision problems, so at best this would be a claim about degrees.)
    It now seems looks likely that we can also learn to manipulate objects or walk with no understanding of physics or the algorithms underlying motor control. So it seems to me like the argument you are making needs to either (a) rely on some property of forming concepts that is not shared by these other tasks, or (b) come with an empirical prediction that we won’t have much luck at doing other tasks like manipulation. (That prediction which is reasonably likely to be settled soon, I’m 50/50 on basically human-level manipulation in simulation within a few years.)
    I guess you argue that it is hard to construct a trainable computational architecture capable of forming concepts, while it is easy to construct such an architecture for e.g. manipulation. That seems plausible, but having confidence would probably require some argument.
    It’s weird to call a neural turing machine a minor advance but differentiable computing to be conceptually novel, they seem like more or less the same thing.
    In some sense a neural network is like a probabilistic model, but that doesn’t mean that a neural network is implementing a human’s probabilistic reasoning according to that particular analogy.
    My post this weekend seems relevant, especially the first part of section V:
    https://sideways-view.com/2017/02/19/the-monkey-and-the-machine-a-dual-process-theory/

    • “It now seems looks likely that we can also learn to manipulate objects or walk with no understanding of physics or the algorithms underlying motor control.” Do you mean robots, or human babies? If the former, I’m surprised to hear that; do robots really learn motor control “from scratch”? If the latter, I think it’s false — there’s a lot of evidence of “intuitive physics” beginning in infants. I do think that manipulation requires conceptual thinking, and that babies learn some kind of conceptual thinking; they would have to, in order to intentionally seek and grab an object, use an object as a tool, or display object permanence. I do predict that, without conceptual advances, it will be hard to get robots to do this without hard-coding the behavior.

      I didn’t mean “minor advance” to be an insult — I think “minor advances” are examples of conceptually novel advances, except distinguished by the fact that they can occur several times in a decade instead of once every few decades the way “major advances” are. I’m not too married to my classifications.

      • Robots can’t learn very good motor control from scratch (in fact we can’t build very good robots at all, from scratch or otherwise). It looks quite plausible that they will be able to within the next few years, I was giving it a 50/50 probability.

        So can we chalk that up as a prediction of your perspective—that we won’t even have, say, rat-level motor control in simulation within the next few years?

        Whether or not learning manipulation turns out to be hard, I think it is unlikely that the hard parts will end up being things like intuitive physics or object permanence. My point about motor control was not that it requires conceptual reasoning, just that the argument in your post seems to apply essentially verbatim. (It also seems to apply nearly verbatim to vision or translation, though intuitively I can see why you might feel like those cases should be much easier. If you are instead accepting that analogy but claiming that we use some understanding of vision or translation, then I would be happy to disagree with you strongly.)

      • What is ‘very good motor control’ or ‘rat-level motor control’ here? Can you guys be more concrete? For example, in these 3 relatively recent papers, to what extent would you grant they are learning object manipulation with ‘no understanding of physics or the algorithms underlying motor control’ and how close to ‘very good motor control’ would you consider them?

        – “Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection” http://arxiv.org/pdf/1603.02199v1.pdf , Levine et al 2016; video: https://www.youtube.com/watch?v=cXaic_k80uM
        – “Deep Reinforcement Learning for Robotic Manipulation” https://arxiv.org/abs/1610.00633 , Gu et al 2016; video: https://sites.google.com/site/deeproboticmanipulation/ ; blog: https://research.googleblog.com/2016/10/how-robots-can-acquire-new-skills-from.html “How Robots Can Acquire New Skills from Their Shared Experience”
        – “Sim-to-Real Robot Learning from Pixels with Progressive Nets” https://arxiv.org/abs/1610.04286 , Rusu et al 2016b

      • By rat-level motor control, I mean the ability to solve robotics tasks as well as a rat can control its body (in simulation). I agree that’s hard to adjudicate, but it seems hard to make much more concrete without talking in detail about particular robotics tasks. Similarly for human-level control. We want to be able to learn to do any control task that a human could do without deliberation, if we help ourselves to a simulator + reward function + enough demonstrations. Analogously, it seems like we can now learn to do any perceptual task that a human can do without deliberation, given enough labelled data.

        I think the results you cited are good examples of learning motor control without any understanding of physics. There is a question of how far these techniques can get us, and another question of how far they can get us over the coming years (though it’s not clear to me why there is some particular level of capability the level at which you need to understand the underlying domain). I think it’s very likely these techniques could get to human-level motor control eventually (e.g. if you used as much computational power as all of human evolution). I am uncertain about whether they can do so over the short term.

      • >So can we chalk that up as a prediction of your perspective—that we won’t even have, say, rat-level motor control in simulation within the next few years?

        I assume Sarah has given no explicit prediction? (Maybe she and Paul communicated privately.)

  8. A finite-sized recurrent neural network (which is really just a neural network tailored for timeseries data) can provably* represent any Turing-complete program. In light of this fact, it is is actually irrelevant whether a neural network is “probablistic” or not. A sufficiently clever person could encode *any* program, obviously including a program which expresses predicate calculus in some form, into the architecture of an RNN. It stands to reason that an RNN can probably learn some way of expressing predicate logic from appropriate training data.

    The fact that an RNN can encode any Turing machine is devastating to any argument that neural networks cannot in principle do some particular thing. This leaves only arguments that modern neural networks *probably* can’t do certain classes of things without unreasonably large quantities of training data. And even this leaves open the option of sudden architectural or algorithmic innovations which allow more efficient use of available training data.

    * http://binds.cs.umass.edu/papers/1992_Siegelmann_COLT.pdf

    • Some of my former colleagues have built NN models of FOL. Arguably, they built a lot of the requisite structure into the networks to begin with, and the question would be if this could be learned from scratch, and how.

      We would be looking for a regularization that identifies necessary and sufficient conditions instead of statistical properties beyond a given threshold, which might require that the system uses language to build a conceptual graph and synchronize it with other speakers. Eventually, this does not seem to very different from training the system to minimize a loss function wrt safe transportation of the learned categories between ontologies?

  9. I would disagree that we need extensions to the logic that we currently have. We can build general purpose computers using only ‘and’, ‘or’ and ‘not’ gates. We have general purpose programming languages capable of coding any logic that we want. What is lacking is the enabling software.

  10. The logic in the “Neural Networks Are Probabilistic Models” section seems off to me. Isn’t it instead the case that precisely because neural networks are special cases of Bayes nets, it is possible that we will solve some problems (such as consistently assigning & updating probabilities to predicate statements) for them before we solve the problem in the general case? (And perhaps even already have?)

    To give just one relevant example: feed forward neural nets are acyclic, so message passing algorithms provably work for them, but not for arbitrary Bayes nets.

  11. You’re absolutely right. I’ve known for over 10 years, and despite yelling about the limitations of probability theory until I was blue in the face, I was totally ignored by the great ‘super-geniuses’ of (then) Singularity Institute (now MIRI). So please understand, I’m not exactly feeling particularly magnanimous to these people you know? (read: I think I know in very general terms how to solve concept learning, but my motivation to share the answer is close to zero).
    I don’t think Chapman’s argument is correct though – my guess is that probability theory DOES in fact extend deductive logic (it’s just that no one has fully figured out how to do it yet, but I do think the Bayesian paradigm will crack deduction in general).
    Where I think probability theory fails is in handling types of *non-monotonic logic*, where conclusions don’t strictly follow from premises – the class of failures of Bayes will (in my opinion) include things like reasoning by default and abductive reasoning. See my wikipedia page on “Categorization and Semantics” for my A-Z list of stuff I think is relevant to concept learning:
    https://en.wikipedia.org/wiki/User:Zarzuelazen/Books/Reality_Theory:_Categorization%26Semantics
    If you want to know how to solve concepts, then here’s my hints for you: Look carefully at currently little-known and obscure papers decades into the past of the AI field 😉 Remember how in AI winters past everyone had written off ‘neural nets’ , but suddenly decades later the ideas came roaring back from obscurity? The seeds of the right ideas for solving concepts are already there, somewhere buried in the distant past of the AI field, just biding their time to come roaring into the light….

  12. I’ve known for over 10 years, and despite yelling about the limitations of probability theory until I was blue in the face, I was totally ignored by the great ‘super-geniuses’ of (then) Singularity Institute (now MIRI).
    Yes, the probability will go up roughly under the circumstances when you’d expect it to go up.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s