Epistemology Sequence, Part 3: Values and Evaluation

Post edited based on comments from Shea Levy, Daniel Speyer, Justin Alderis, and Andrew Rettek.

In the view of epistemology I’m putting forward in this sequence, the “best” way to construct a concept is the most useful way, relative to an agent’s goals.  So, before we can talk about the question of “what makes a good concept?” we need to talk about evaluation.

Agents have values.  Agents engage in goal-directed behavior; they have preferences; they have priorities; they have “utility functions”; etc.  I’ll use these phrases more or less interchangeably. (I do not mean to claim that these “utility functions” necessarily obey Von Neumann-Morgenstern axioms.)  What they mean is that an agent has a function over states of the world that captures how good each state of the world is, from the agent’s perspective.  This function need not be single-valued, nor need every state of the world have a value.  Humans may very well have complex, multi-dimensional values (we like things like pleasure, beauty, health, love, depth and challenge, etc).

Now, the natural question is, what is a state of the world?  That depends on your ontology.

The “things” in your model of the world are concepts, as I described in the last post.  Every concept is a generalization, over other concepts or over sensory perceptions; a concept is a collection of things that have a salient property in common, but differ in their irrelevant properties.  

In neural net language, a concept is a non-leaf node in a (generalized) convolutional neural net.  Pooling represents generalization; the lower nodes that feed into a given node on a pooling level are examples of the higher node, as in “dogs, snakes, and fish are all animals.”  Composition represents, well, composition: the nodes on a composition level are composite concepts made up of the lower nodes that feed into them, as in “a unicorn is a horse with a horn on its forehead.”

So, what does this have to do with values?

In some states of the world, you have direct experiential perception of value judgments.  Pleasure and pain are examples of value judgments.  “Interestingness” or “surprisingness” are also value judgments; “incentive salience” (relevance to an organism’s survival) is one model of what sensations cause a spike in dopamine and attract an organism’s attention.  (More on that in my past post on dopamine, perception, and values.)

Every value judgment that isn’t a direct experience must be constructed out of simpler concepts, via means such as inference, generalization, or composition.  You directly experience being sick as unpleasant and being healthy as pleasant; you have to create the generalizations “sick” and “healthy” in order to understand that preference, and you have to model and predict how the world works in order to value, say, taking an aspirin when you have a headache.

Your value function is multivariate, and it may well be that some of the variables only apply to some kinds of concepts. It doesn’t make sense to ask how “virtuous” your coffee mug is.  I’m deliberately leaving aside the question of whether some kinds of value judgments will be ranked as more important than others, or whether some kinds of value judgments can be built out of simpler kinds (e.g. are all value judgments “made of” pleasure and pain in various combinations?).  For the moment, we’ll just think of value as (very) multivariate.

You can think of the convolutional network as a probabilistic graphical model, with each node as a random variable over “value”, and “child” nodes as being distributed according to functions which take “parent” nodes as parameters.  So, for instance, if you have experiences of really enjoying playing with a cat, that gives you a probability distribution on the node “cat” of how much you like cats; and that in turn updates the posterior probability distribution over “animals.” If you like cats, you’re more likely to like animals. This is a kind of “taxonomic” inference.

Neural networks, in themselves, are not probabilistic; they obey rules without any reference to random variables. (Rules like “The parent node has value equal to the transfer function of the weighted sum of the child node values.”) But these rules can be interpreted probabilistically. (“The parent node value is a random variable, distributed as a logistic function of the child node variables.”)  If you keep all the weights from the original neural net used to construct concepts, you can use it as a probabilistic graphical model to predict values.

Propagating data about values all over the agent’s ontology is a process of probabilistic inference.  Information about values goes up to update the distributions of higher concepts, and then goes down to update the distributions of lower but unobserved concepts. (I.e. if you like kittens you’re more likely to like puppies.)  The linked paper explains the junction tree algorithm, which indicates how to perform probabilistic inference in arbitrary graphs.

Of course, this kind of  taxonomic inference isn’t the only way to make inferences.  A lion on the other side of a fence is majestic; that lion becomes terrifying if he’s on the same side of the fence as you.  You can’t make that distinction just by having a node for “lion” and a node for “fence.” And it would be horribly inefficient to have a node for “lion on this side of fence” and “lion on that side of fence”.  What you do, of course, is value “survival”, and then predict which situation is more likely to kill you.

At present, I don’t know how this sort of future-prediction process could translate into neural-net language, but it’s at least plausible that it does, in human brains.  For the moment, let’s black-box the process, and simply say that certain situations, represented by a combination of active nodes, can be assigned values, based on predictions of the results of those situations and how the agent would evaluate those results.

In other words, f(N_1, N_2, … N_k) = (v_1, v_2, … v_l), where the N’s are nodes activated by perceiving a situation, and the v’s are value assignments, and computing the function f may involve various kinds of inferences (taxonomic, logical, future-predicting, etc.)  This computation is called teleological measurement.

The process of computing f is the process by which an agent makes an overall evaluation of how much it “likes” a situation. Various sense data produce immediate value judgments, but also those sensations are examples of concepts, and the agent may have an opinion on those concepts.  You can feel unpleasantly cold in the swimming pool, but have a positive opinion of the concept of “exercise” and be glad that you are exercising.  There are also logical inferences that can be drawn over concepts (if you go swimming, you can’t use that block of time to go hiking), and probabilistic predictions over concepts (if you swim in an outdoor pool, there’s a chance you’ll get rained on.)  Each of these activates a different suite of nodes, which in turn may have value judgments attached to them.  The overall, holistic assessment of the situation is not just the immediate pleasure and pain; it depends on the whole ontology.

After performing all these inferences, what you’re actually computing is a weighted sum of values over nodes.  (The function f that determines the weights is complicated, since I’m not specifying how inference works.)  But it can be considered as a kind of “inner product” between values and concepts, of the form

f(N) * V,

where V is a vector that represents the values on all the nodes in the whole graph of concepts, and f(N) represents how “active” each node is after all the inferences have been performed.

Note that this “weighted sum” structure will give extra “goodness points” to things that remain good in high generality. If “virtue” is good, and “generosity” is a virtue and also good, and “bringing your sick friend soup” is a species of generosity and also good, then if you bring your sick friend soup, you triple-count the goodness.

This is a desirable property, because we ordinarily think of things as better if they’re good in high generality.  Actions are prudent if they continue to be a good idea in the long term, i.e. invariant over many time-slices.  Actions are just if they continue to be a good idea no matter who does them to whom, i.e. invariant over many people. A lot of ethics involves seeking these kinds of “symmetry” or “invariance”; in other words, going to higher levels on the graph that represents your concept structure.

This seems to me to be a little related to the notion of Haar-like bases on trees.  In the linked paper, a high-dimensional dataset is characterized by a hierarchical tree, where each node on the tree is a cluster of similar elements in the dataset. (Very much like our network of concepts, except that we don’t specify that it must be a tree.) Functions on the dataset can be represented by functions on the tree, and can be decomposed into weighted sums of “Haar-like functions” on the tree; these Haar-like functions are constant on nodes of a given depth in the tree and all their descendants.  This gives a multiscale decomposition of functions on the dataset into functions on the “higher” nodes of the tree.  “Similarity” between two data points is the inner product between their Haar-like expansions on the tree; two data points are more similar if they fall into the same categories. This has the same multiscale, “double-counting” phenomenon that shows up in teleological measurement, which gives extra weight to similarity when it’s not just shared at the “lowest” level but also shared at higher levels of generality.

(Haar-like bases aren’t a very good model for teleological measurement, because our function f is both multivariate and in general nonlinear, so the evaluation of a situation isn’t really decomposable into Haar functions.  The situation in the paper is much simpler than ours.)

This gives us the beginning of a computational framework for how to talk about values. A person has a variety of values, some of which are sensed directly, some of which are “higher” or constructed with respect to more abstract concepts.  Evaluating a whole situation or state of the world involves identifying which concepts are active and how much you value them, as well as inferring which additional concepts will be active (as logical, associational, or causal consequences of the situation) and how much you value them.  Adding all of this up gives you a vector of values associated with the situation.

If you like, you can take a weighted sum of that vector to get a single number describing how much you like the situation overall; this is only possible if you have a hierarchy of values prioritizing which values are most important to you.

Once you can evaluate situations, you’re prepared to make decisions between them; you can optimize for the choices that best satisfy your values.

An important thing to note is that, in this system, values are relative to the agent’s ontology. Values are functions on your neural network; they aren’t applicable to some different neural network.  Values are personal; they are only shared between agents to the extent that the agents have a world-model in common. Disagreement on values is only possible when an ontology is shared.  If I prioritize X more than you do, then we disagree on values; if you have no concept of X then we don’t really disagree, we’re just seeing the world in alien ways.

Now that we have a concrete notion of how values work, we can go on to look at how an agent chooses “good” concepts and “good” actions, relative to its values, and what to do about ontological changes.

Note: terms in bold are from ItOE; quantitative interpretations are my own.  I make no claims that this is the only philosophical language that gets the job done. “There are many like it, but this one is mine.”

Epistemology Sequence, Part 1: Ontology

This sequence of posts is an experiment in fleshing out how I see the world. I expect to revise and correct things, especially in response to discussion.

“Ontology” is an answer to the question “what are the things that exist?”

Consider an reasoning agent making decisions. This can be a person or an algorithm.  It has a model of the world, and it chooses the decision that has the best outcome, where “best” is rated by some evaluative standard.

A structure like this requires an ontology — you have to define what are the states of the world, what are the decision options, and so on.  If outcomes are probabilistic, you have to define a sample space.  If you are trying to choose the decision that maximizes the expected value of the outcome, you have to have probability distributions over outcomes that sum to one.

[You could, in principle, have a decision-making agent that has no model of the world at all, but just responds to positive and negative feedback with the algorithm “do more of what rewards you and less of what punishes you.” This is much simpler than what humans do or what interesting computer programs do, and leads to problems with wireheading. So in this sequence I’ll be restricting attention to decision theories that do require a model of the world.]

The problem with standard decision theory is that you can define an “outcome” in lots of ways, seemingly arbitrarily. You want to partition all possible configurations of the universe into categories that represent “outcomes”, but there are infinitely many ways to do this, and most of them would wind up being very strange, like the taxonomy in Borges’ Celestial Emporium of Benevolent Knowledge:

Those that belong to the emperor

Embalmed ones

Those that are trained

Suckling pigs

Mermaids (or Sirens)

Fabulous ones

Stray dogs

Those that are included in this classification

Those that tremble as if they were mad

Innumerable ones

Those drawn with a very fine camel hair brush

Et cetera

Those that have just broken the flower vase

Those that, at a distance, resemble flies

We know that statistical measurements, including how much “better” one decision is than another, can depend on the choice of ontology. So we’re faced with a problem here. One would presume that an agent, given a model of the world and a way to evaluate outcomes, would be able to determine the best decision to make.  But the best decision depends on how you construct what the world is “made of”! Decision-making seems to be disappointingly ill-defined, even in an idealized mathematical setting.

This is akin to the measure problem in cosmology.  In a multiverse, for every event, we think of there as being universes where the event happens and universes where the event doesn’t happen. The problem is that there are infinitely many universes where the event happens, and infinitely many where it doesn’t. We can construct the probability of the event as a limit as the number of universes becomes large, but the result depends sensitively on precisely how we do the scaling; there isn’t a single well-defined probability.

The direction I’m going to go in this sequence is to suggest a possible model for dealing with ontology, and cash it out somewhat into machine-learning language. My thoughts on this are very speculative, and drawn mostly from introspection and a little bit of what I know about computational neuroscience.

The motivation is basically a practical one, though. When trying to model a phenomenon computationally, there are a lot of judgment calls made by humans.  Statistical methods can abstract away model selection to some degree (e.g. generate a lot of features and select the most relevant ones algorithmically) but never completely. To some degree, good models will always require good modelers.  So it’s important to understand what we’re doing when we do the illegible, low-tech step of framing the problem and choosing which hypotheses to test.

Back when I was trying to build a Bayes net model for automated medical diagnosis, I thought it would be relatively simple. The medical literature is full of journal articles of the form “A increases/decreases the risk of B by X%.”  A might be a treatment that reduces incidence of disease B; A might be a risk factor for disease B; A might be a disease that sometimes causes symptom B; etc.  So, think of a graph, where A and B are nodes and X is the weight between them. Have researchers read a bunch of papers and add the corresponding nodes to the graph; then, when you have a patient with some known risk factors, symptoms, and diseases, just fill in the known values and propagate the probabilities throughout the graph to get the patient’s posterior probability of having various diseases.

This is pretty computationally impractical at large scales, but that wasn’t the main problem. The problem was deciding what a node is. Do you have a node for “heart attack”? Well, one study says a certain risk factor increases the risk of having a heart attack before 50, while another says that a different risk factor increases the lifetime number of heart attacks. Does this mean we need two nodes? How would we represent the relationship between them? Probably having early heart attacks and having lots of heart attacks are correlated, but we aren’t likely to be able to find a paper that quantifies that correlation.  On the other hand, if we fuse the two nodes into one, then the strengths of the risk factors will be incommensurate.  There’s a difficult judgment call inherent in just deciding what the primary “objects” of our model of the world are.

One reaction is to say “automating human judgment is harder than you thought”, which, of course, is true. But how do we make judgments, then? Obviously I’m not going to solve open problems in AI here, but I can at least think about how to concretize quantitatively the sorts of things that minds seem to be doing when they define objects and make judgments about them.

The Calderon-Zygmund Decomposition as Metaphor

The Calderon-Zygmund decomposition is a classic tool in harmonic analysis.

It’s also a part of a reframe of how I think since I started being immersed in this field.

The basic statement of the lemma is that all integrable functions can be decomposed into a “good” part, where the function is bounded by a small number, and a “bad” part, where the function can be large, but locally has average value zero; and we have a guarantee that the “bad” part is supported on a relatively small set.

Explicitly,

Let f \in \mathbb{R^n}, \int_{\mathbb{R}^n} |f(x)| dx < \infty, and let \alpha > 0.  Then there exists a countable collection of disjoint cubes Q_j such that for each $j$

\alpha < \frac{1}{|Q_j|} \int_{Q_j} |f(x)| dx < 2^n \alpha

(that is, the average value of f on the “bad” cubes is not too much bigger than \alpha)

\sum |Q_j| \le \frac{1}{\alpha} \int_{\mathbb{R}^n} |f(x)|dx

(that is, we have an upper bound on the size of the “bad” cubes)

and f(x) \le \alpha for almost all x not in the union of the Q_j. In other words, f is small outside the cubes, the total size of the cubes isn’t too big, and it’s not that big even on the cubes.

In particular, if we define

g(x) = f(x) outside the cubes, g(x) = \frac{1}{|Q_j|} \int_{Q_j} f(t) dt on each cube, and b(x) = f(x) - g(x), then b(x) = 0 outside the cubes, and has average value zero on each cube.  The “good” function g is bounded by \alpha; the “bad” function b is only supported on the cubes, and has average value zero on those cubes.

Why is this true? The basic sketch of the proof  involves taking a big grid of cubes, asking on each one if the average of f is less than \alpha or not; if not, the cube is a “bad” cube and we make it one of the Q_j, and if not, we keep subdividing, each cube being subdivided into 2^n daughter cubes.

The intuition here is that functions which are more or less regular (an integrable function has to decay at infinity and not be too singular at zero) can be split into a “good” part that’s either small or locally constant, and a “bad” part that can be wiggly, but only on small regions, and always with average value zero on those regions.

This is the basic principle behind multiscale decompositions.  You take a function on, say, the plane; you decompose it into a “gist” function which is constant on squares of size 1, and a “wiggle” function which is the difference. Then throw away the gist, look at the wiggle, look at squares of side-length 1/2, and again decompose it into a gist which is constant on squares and a wiggle which is everything else.  And keep going.  Your original function is going to be the sum of all the wiggles — or all the gists, depending on how you want to look at it.

But the nice thing about this is that you’re only using local information.  To compute f(x), you need to know what size-1 box x is in, for the first gist, and then which size-1/2 box, for the first wiggle, and then which size-1/4 box, for the second wiggle, and so on, but you only need to know the wiggles and gists in those boxes.  And if the value of f changed outside the box, the decomposition and approximate value of f(x) wouldn’t change.

So how does this reframe how you see things in the real world?

Well, there are endless debates about whether you “can” capture a complex phenomenon with a simple model.  Can human behavior “really” be reduced to an algorithm? Can you “really” describe economics or biology with equations?  Is this the “right” definition to capture this idea?

To my view, that’s a wrong question. The right question is always “How much information do I lose by making this simplifying approximation?”  A “natural” degree of roughness of your approximation is the turning point where more detail won’t give you much more accuracy.

Multiscale decompositions give you a way of thinking about the coarseness of approximations.

In regions where a function is almost constant, or varying slowly, one layer of approximation is pretty good. In regions where it fluctuates rapidly and at varying scales (think “more like a fractal”), you need more layers of approximation.  A function that has rapid decay in its wavelet coefficients (the “wiggles” shrink quickly) can be approximated more coarsely than a function with slow decay.  These are the functions where the “bad part” of the “bad part” of the “bad part” and so on (in the Calderon-Zygmund sense) remains fairly big rather than rapidly disappearing.  (Of course, since the “bad part” is restricted to cubes, you can compute this separately in each cube, and require a different level of accuracy in different parts of the domain of the function.)

Definitions are approximations. You can define a category by its prototypical, average member, and then define subcategories by how they differ from that average, and sub-sub-categories by how they differ from the average of the sub-categories.

The hierarchical structure allows you to be much more efficient; you can skip the extra detail when it’s not warranted.  In fact, there’s a fair amount of evidence that this is how the human brain structures information.

The language of harmonic analysis deals a lot with how to relate measures of regularity (basically, bounds on integrals, or measures of smoothness) with measures of coefficient decay (basically, how deep down the tree of successive approximations do you need to go to get a good estimate). Calderon-Zygmund decomposition is just one of the simpler cases.  But the basic principle of “nicer functions permit rougher approximations” is a really good framing device to dissolve questions about choosing definitions and models. Debates about “this model can never capture all the complexity of the real thing” vs. “this model is a useful simplification” should be replaced by debates about how amenable the phenomenon is to approximation, and which model gives you the most accurate picture relative to its simplicity.

How I Read: the Jointed Robot Metaphor

“All living beings, whether born from eggs, from the womb, from moisture, or spontaneously; whether they have form or do not have form; whether they are aware or unaware, whether they are not aware or not unaware, all living beings will eventually be led by me to the final Nirvana, the final ending of the cycle of birth and death. And when this unfathomable, infinite number of living beings have all been liberated, in truth not even a single being has actually been liberated.” The Diamond Sutra

What do you do when you read a passage like this?

If you’re not a Buddhist, does it read like nonsense?

Does it seem intuitively true or deep right away?

What I see when I read this is a lot of uncertainty.  What is a living being that does not have form?  What is Nirvana anyway, and could there be a meaning of it that’s not obviously incompatible with the laws of physics?  And what’s up with saying that everyone has been liberated and nobody has been liberated?

Highly metaphorical, associative ideas, the kind you see in poetry or religious texts or Continental philosophy, require a different kind of perception than you use for logical arguments or proofs.

The concept of steelmanning is relevant here. When you strawman an argument, you refute the weakest possible version; when you steelman an argument, you engage with the strongest possible version.   Strawmanning impoverishes your intellectual life. It does you no favors to spend your time making fun of idiots.  Steelmanning gives you a way to test your opinions against the best possible counterarguments, and a real possibility of changing your mind; all learning happens at the boundaries, and steelmanning puts you in contact with a boundary.

A piece of poetic language isn’t an argument, exactly, but you can do something like steelmanning here as well.

When I read something like the Diamond Sutra, my mental model is something like a robot or machine with a bunch of joints.

Each sentence or idea could mean a lot of different things. It’s like a segment with a ball-and-socket joint and some degrees of freedom.  Put in another idea from the text and you add another piece of the robot, with its own degrees of freedom, but there’s a constraint now, based on the relationship of those ideas to each other.  (For example: I don’t know what the authors mean by the word “form”, but I can assume they’re using it consistently from one chapter to another.)  And my own prior knowledge and past experiences also constrain things: if I want the Diamond Sutra to click into the machine called “Sarah’s beliefs,” it has to be compatible with materialism (or at least represent some kind of subjective mental phenomenon encoded in our brains, which are made of cells and atoms.)

If I read the whole thing and wiggle the joints around, sooner or later I’ll either get a sense of “yep, that works, I found an interpretation I can use” when things click into place, or “nope, that’s not actually consistent/meaningful” when I get some kind of contradiction.

I picture each segment of the machine as having a continuous range of motion. But the set of globally stable configurations of the whole machine is discrete. They click into place, or jam.

You can think of this with energy landscape or simulated-annealing metaphors. Or you can think of it with moduli space metaphors.

This gives me a way to think about mystical or hand-wavy notions that’s not just free-association or “it could mean anything”, which don’t give me enough structure.  There is structure, even when we’re talking about mysticism; concepts have relationships to other concepts, and some ways of fitting them together are kludgey while others are harmonious.

It can be useful to entertain ideas, to work out their consequences, before you accept or reject them.

And not just ideas. When I go to engage in a group activity like CFAR, the cognitive-science-based self-improvement workshop where I spent this weekend, I naturally fall into the state of provisionally accepting the frame of that group.  For the moment, I assumed that their techniques would work, engaged energetically with the exercises, and I’m waiting to evaluate the results objectively until after I’ve tried them.  My “machine” hasn’t clicked completely yet — there are still some parts of the curriculum I haven’t grokked or fit into place, and I obviously don’t know about the long-term effects on my life.  But I’m going to be wiggling the joints in the back of my mind until it does click or jam.  People who went into the workshop with a conventionally “skeptical” attitude, or who went in with something like an assumption that it could only mean one thing, tended to think they’d already seen the curriculum and it was mundane.

I’m not trying to argue for credulousness.  It’s more like a kind of radical doubt: being aware there are many possible meanings or models and that you may not have pinned down “the” single correct one yet.