Epistemology Sequence, Part 2: Concepts

What are the “things” in our world?

A table is not a “raw” piece of sense data; it is a grouping of multiple sensory stimuli into a single, discrete object. (The concept which the word “table” refers to is even more general, since it includes all instances of individual tables.)

We do not perceive the world in terms of raw sense data; we process it a lot before we can even become conscious of it. Unmediated sensory perception would have no structure, it would be like William James’ “blooming, buzzing confusion,” which was his phrase for a baby’s sensory experience.

James was wrong on the facts — even babies do not have literal unmediated perceptions.  There is nowhere in the brain that represents a photograph-like picture of the visual field, for instance.  But we do know that object recognition can break to some degree in humans, yielding examples of people who lack some higher-level sensory processing. Mel Baggs writes about having to consciously and effortfully recognize objects, as a result of autism.  Agnosia is the general term for inability to recognize sensory phenomena; there are many agnosias, like the inability to distinguish visual shapes, or to distinguish speech from non-speech sounds.  It’s clear that organizing sensory data into discrete objects (let alone forming abstractions from types of objects and their properties) is a nontrivial operation in the brain. And, indeed, image and speech recognition is an ongoing and unsolved area of machine learning research.

Visual object recognition is currently believed to be modeled by a hierarchical neural net, shaped like a tree. The lowest leaves on the tree, known as simple cells, recognize local features of the image — say, a (convolution with a) particular line segment, in a particular (x, y) coordinate position, at a particular angle.  Higher levels of the tree integrate multiple nodes from lower on the tree, producing features that recognize more complex features (shapes, patterns, boundaries, etc.)  Higher features have invariance properties (the shape of the number 2 is recognizable even if it’s translated, rotated, scaled, written in a different color, etc) which come from integrating many lower features which have different values for the “irrelevant” properties like location or color.  Near the top of the tree, we can get as far as having a single feature node for a particular type of object, like “dog.”

It is known empirically that individual neurons in the visual cortex are tuned to recognize complex objects like faces, and that this recognition is invariant to changes in e.g. viewing angle or illumination.  Monkeys trained to recognize a novel object will acquire neurons which are selective for that object, which shows that the process of object recognition is learned rather than hard-coded.

We can call a concept a node that’s not a leaf.  A concept is a general category composed of aggregating perceptions or other concepts, which have some essential characteristic(s) in common. (In the case of the symbol “2”, the shape is essential, while the color, scale, and position are not.)  To form a concept, the input from the lower nodes must be “pooled” over such inessential dimensions.  In the classic HMAX model of the visual cortex, pooling is implemented with a “max” function — the complex cell’s activity is determined by the strongest signal it receives from the simple cells.  A “pooling” level is followed by a “composition” level, whose nodes are all possible combinations of nearby groups of nodes on the preceding level; after a further pooling level, the nodes represent “complex composite” concepts, composed of smaller shapes.

HMAX is an example of a convolutional neural net.  In a convolutional neural net, each node’s activity is determined by the activity of a spatially local patch of nodes on the level just below it, and the transfer functions are constrained to be identical across a level. This constraint cuts down dramatically on the computational cost of learning the weights on the neural net.  The max-pooling step in a convolutional neural net makes the composite nodes translation-invariant; the max over a set of convolutions with overlapping patches is robust to translations of the input image.  This gives us a way to implement the ability to generalize or produce translation invariance.  Variants on convolutional neural nets can give other kinds of invariance, such as scale-invariance, rotation-invariance, illumination-invariance, or even invariance with respect to an arbitrary group of transformations.  The general principle is that you can generate higher concepts via measurement omission — pooling over a variety of specific feature-detectors which vary in a non-salient characteristic will give you a more general feature detector that only cares about the salient characteristic.

Having a hierarchical structure of this kind is valuable because it is computationally efficient. Fully-connected neural nets, where each node on layer is connected to every node on layer n-1, have far too many weights to learn (especially since the inputs on the bottom layer are one neuron per pixel in the image).  Hierarchical structure allows you to cut down on the number of objects in your vocabulary; you can conceive of “this table” rather than all possible parts and viewing angles and lighting choices that give you images of the table.

How the brain models more abstract concepts is less well known.  But it seems intuitive that you can generate new concepts from old ones by integration (including multiple concepts under an umbrella heading) or differentiation (dividing a concept into multiple distinct types.)

In neural-net language, “integrating” multiple nodes is an OR function, which is implemented with a max-pooling step.  The parent node is active iff at least one of the child nodes is active; this is equivalent to saying that the parent node is active iff the maximum over all child nodes is active.

Differentiation involves subdividing a node into types.  If I understand this correctly, this involves combinations of AND functions (whose implementation can be derived from OR functions) and XOR functions, which are more difficult. For instance, if the parent node is of the form “A OR B” and you need to identify the child node “Exactly one of {A, B}”, you have to define an XOR function with a neural net. XOR functions provably cannot be done with single-layer neural networks; implementing an XOR function requires a hidden layer. In high dimensions, parity functions (generalizations of the XOR function) are intractable to learn with neural nets.  It appears that differentiation is qualitatively more difficult than integration. At least some kinds of categorization that humans can do appear to be (mostly) open problems for artificial intelligence.

In short: hierarchical organization into concepts is a natural way to construct an ontology that is computationally efficient to work with.  Concepts are generalizations from simpler objects; a concept has some form of invariance over irrelevant characteristics. (Which characteristics are relevant and which are irrelevant? More on that later.)

Claims about the world can be expressed in terms of concepts, subsuming all their sub-components; for instance, the event “there is a black circle in this picture” can be defined entirely in terms of the node that represents “black circle”, and implicitly includes all possible locations of the black circle. Thus, the hierarchical network of concepts also gives rise to a kind of hierarchical structure on states of the world.

This gives us a the start of a language for how to talk about ontologies.  Later we’ll get into: what makes a good ontology? what happens if you change your ontology?  What about decision-making?

Note: terms in bold are from ItOE; quantitative interpretations are my own.  I make no claims that this is the only philosophical language that gets the job done. “There are many like it, but this one is mine.”

4 thoughts on “Epistemology Sequence, Part 2: Concepts

  1. This convinced me that there a lot worth learning about nets. You also captured features that supposedly motivate Platonic ontologies without the spooky stuff.

    “Monkeys trained to recognize a novel object will acquire neurons which are selective for that object, which shows that the process of object recognition is learned rather than hard-coded.”
    This feels like a misunderstanding on my part, but doesn’t it actually show that the *objects* are learned, but say nothing about whether the *process* (recognition/integration in general) is hard-coded? Or does that distinction just drop out?

    Looking forward to part 3.

  2. This convinced me that there a lot worth learning about nets. You also captured features that supposedly motivate Platonic ontologies without the spooky stuff.

    “Monkeys trained to recognize a novel object will acquire neurons which are selective for that object, which shows that the process of object recognition is learned rather than hard-coded.”
    This feels like a misunderstanding on my part, but doesn’t it actually show that the *objects* are learned, but say nothing about whether the *process* (recognition/integration in general) is hard-coded? Or does that distinction just drop out?

    Looking forward to part 3.

  3. Great post, thanks for this.

    I think the section on integrating / splitting concepts uses too literal of a correspondence between neural net activity and relationships between concepts. In convnets, pooling is more about ensuring that the spatial extent of features grows as you go deeper into the net. The pooling is between spatially neighboring copies of a single feature, not between different features.

    Even in the neural net layers where different feature detectors can get nonlinearly combined, I don’t think this corresponds very well to integrating concepts, because although it can combine patterns, its model of the world is entirely limited to those patterns. It’s like how word2vec is only very loosely about semantic knowledge, even though it has enough semantically-relevant patterns represented to be interesting.

    So I’m saying that suppose you had a neuron for the concept of “grandmother” (which might be a big supposition right there, especially for novel concepts where the idea of firmware representations doesn’t make sense). If I’m not totally of base here, this neuron probably does not have direct connections to the neurons for “my grandmother” or “Bob’s grandmother” or “Clarice’s grandmother” or “mother.” It might have connections to neurons with connections to neurons with connections to those neurons. But the neuron for “my grandmother” will also have inputs from neurons that have eventual inputs from the “grandmothers in general” neuron, because the brain is really connected. Introspection about these concepts will not necessarily reflect these most-direct connections, but will instead depend on some complicated functional result of you trying to imagine things using these neurons.

Leave a comment