We could regrow livers

There are currently 16,000 Americans on the waiting list for a liver transplant, but there are only enough livers for 6000 transplants a year.  Every year, more than 1500 people die waiting for a liver transplant.

One commonly mentioned idea to close the organ donor gap is to pay people for their organs to incentivize more donation.

New science may open other possibilities as well. Eric Lagasse’s lab at the University of Pittsburgh’s McGowan Center for Regenerative Medicine has been experimenting with lymph nodes as a transplantation site.  Simply put, if you put hepatocytes (liver cells) into a lymph node, the node will grow into a functioning mini-liver.  This rescues mice from lethal liver failure.

Injecting cells into lymph nodes also works with thymus cells, which can give athymic mice a functioning immune system.  And it works with pancreas cells, which, when injected into lymph nodes, can rescue mice from diabetes.

The procedure only partially works with kidneys, which are much more structurally complex — the cells implanted into lymph nodes show some signs of growing into nephrons, but aren’t completely functional.

Interestingly enough, this effect was documented in 1963 by immunologist Ira Green. If you remove the spleen or thymus from a mouse, and replace it with ectopically transplanted spleen or thymus tissue, the tissue grows into a functioning, structurally normal, miniature thymus or spleen.

In 1979, researchers found that hepatocytes injected into the rat spleen (which is part of the lymphatic system and analogous to a large lymph node) functioned normally and grew to take up 40% of the spleen.

There have been clinical studies in humans of hepatocyte  transplantation, generally with less than impressive results, but generally these hepatocytes are infused through the portal vein or renal artery, whence a small fraction of the cells reach the liver and spleen. It’s still possible that injection into lymph nodes would be more effective. As the above article states,

One possible explanation for the discrepancy between the laboratory and clinical outcomes may relate to the route of hepatocyte delivery. Following infusion using direct splenic puncture, dramatic corrections in liver function have accompanied hepatocyte transplantation in laboratory animals. In patients with cirrhosis, however, allogeneic hepatocytes have been delivered to the spleen exclusively through the splenic artery.

A natural question to ask is: why isn’t more being done with this?  “Inject liver cells into lymph nodes” is not a particularly high-tech idea (as far as my layman’s understanding goes.)  Nor is it a completely new idea; researchers have known for decades that hepatocytes grow into functioning liver tissue, particularly when injected into the lymphatic system.  You’d think that a procedure that could replace liver transplants would be profitable and that founding a biotech company to do human trials would be a tremendous opportunity, especially since there is less scientific risk than there often is with other early-stage biomedical research (e.g. preclinical drugs).

Part of the problem is that the business model in such cases is unclear.  This has been a pattern we noticed several times at MetaMed; often a medical breakthrough is not a new drug or device, but a novel medical procedure.  It is not permissible to enforce a patent (e.g. to sue someone for infringement) on a medical or surgical procedure.  Medical ethics (for instance, see this statement from the American Academy of Orthopedic Surgeons) generally holds that it’s unethical to patent a surgical procedure.  This means that it’s difficult to profit off the invention of a medical or surgical procedure.  The total value of being able to offer a liver transplant to anyone who wants one would be billions of dollars a year– but it’s not clear how anybody can capture that value, so there’s less incentive (apart from humanitarian motives) to develop and implement such a procedure.

It also means that it’s difficult to disseminate information about new medical or surgical procedures. Learning to perform procedures is an apprenticeship process; one doctor has to teach another. This, combined with natural risk aversion, means the spread of new procedures is slow. If a new surgery had been shown conclusively, by excellent experimental evidence, to be better than the old one, it still would not necessarily sweep the nation; if the clinician who pioneered the new surgery isn’t a natural evangelist, it may never be performed in more than one hospital.

This seems to be an opportunity in search of a viable strategy. There are spectacular results in regenerative medicine (frequently coming out of the McGowan Institute — see Stephen Badylak’s work in tissue regeneration).  It’s not clear to me how one would make those results “scale” in the sense that we’re used to in tech companies.  But if you could figure out a model, the market size is mind-boggling.

If you had a way to regrow organs, how would you validate it experimentally? And how would you get it to patients?  And how would you do it fast enough not to lose tens of thousands of lives from delay?

Aesthetics are moral judgments

I often hear people say things like “It’s ridiculous to judge that someone’s a bad person because of his musical taste!” People assume it’s obvious that aesthetic judgments have no moral weight.

For me, aesthetic judgments are a kind of moral judgment.

I understand “morality” to basically cash out as “priority structure”, “values”, and related concepts. What matters most to me, and what would matter most to me if I knew more and thought more clearly.  With that definition, when I say that kindness is “good” and I say that Camembert is “good”, I’m not using two unrelated meanings of the word — cheese and kindness are both valuable to me.

Aesthetic preferences aren’t really arbitrary; they say things about what you value and how you see the world.

For example, I like Bach.  There’s a pretty well-established correlation between liking Bach and liking math. Godel, Escher, Bach is a pretty strong marker of membership in my tribe.  And I don’t think that’s arbitrary.  The words I’d use to describe Bach’s music are complex and orderly.  Polyphony gives the impression of a giant, intricate clock, moving according to regular mechanisms, steady as the stars in their courses and endlessly interesting.  It gives me a sense of cosmos, of natural law.  And the fact that I like that says something about what my priorities are more generally.

These sorts of connections are associative and probabilistic rather than determined. Not literally everyone who likes Bach is getting the same associations from the music as I do. But associations and resonances can be real tendencies in the world even if they’re not strict logical entailments.  Metaphors can be apt. There are some synesthetic/metaphorical connections that correlate across human minds, like the bouba/kiki effect.  In a “clusters-in-thingspace” sense, it can be sort of objectively true that Bach is “about” cosmic natural order.  You can’t stretch these intuitions too far, but they aren’t completely fictitious either.

And it’s possible to learn aesthetic intuitions.

I used to only like paintings with very crisp, precise textures, rather than the cloudy, fuzzy textures that show up in John Singer Sargent or Turner paintings.  The art blog Opulent Joy taught me to appreciate the soft textures; when I realized “oh! he’s appreciating a broader power spectrum than I am!” I immediately noticed that his aesthetic was like mine, but stronger — more general, more nuanced, and therefore an upgrade I would like to make.

Another example: when I was a kid,  I found industrial landscapes horribly ugly.  Machines seemed like a blight on nature.  The more I came to understand that good things are produced by machines, and that machines are made with care and skill, the more I started to see trains and bridges and construction sites and shipping containers as beautiful.  Factual understanding changed my aesthetic appreciation. And if learning facts changes your aesthetic views, that means that they aren’t arbitrary; they actually reflect an understanding of the world, and can be more correct or less so.

Judging people for aesthetics isn’t crazy.  If someone loves Requiem for a Dream, it’s reasonable to infer that they’re a pessimistic person.  If you think pessimism is bad, then you’re indirectly judging them for their taste. Now, your inferences could be wrong — they could just be huge Philip Glass fans — once again, we’re looking at Bayesian evidence, not logical entailments, so being overconfident about what other people’s tastes “say about them” is a bad idea.  But aesthetics do more-or-less mean things.

For me, personally, my aesthetic sensitivities are precise in a way my moral intuitions aren’t.  My “conscience” will ping perfectly innocent things as “bad”; or it’ll give me logically incoherent results; or it’ll say “everything is bad and everyone is a sinner.” I’ve learned to mistrust my moral intuitions.

My aesthetic sensibilities, on the other hand, are stable and firm and specific. I can usually articulate why I like what I like; I’m conscious of when I’m changing my mind and why; I’m confident in my tastes; my sophistication seems to increase over time; intellectual subjects that seem “beautiful” to me also seem to turn out to be scientifically fruitful and important.  To the extent that I can judge such things about myself, I’m pretty good at aesthetics.

It’s easier for me to conceptualize “morality” as “the aesthetics of human relationships” than to go the other way and consider aesthetics as “the morality of art and sensory experience.”  I’m more likely to have an answer to the question “which of these options is more beautiful?” than “which of these options is the right thing to do?”, so sometimes I get to morality through aesthetics. Justice is good because symmetry is beautiful.  Spiteful behavior is bad because resentment is an ugly state to be in.  Preserving life is good, at root, because complexity is more interesting and beautiful than emptiness.  (Which is, again, probably true because I am a living creature and evolutionarily wired to think so; it’s circular; but the aesthetic perspective is more compelling to me than other perspectives.)

It always puzzles me when people think of aesthetics as a sort of side issue to philosophy, and I know I’ve puzzled people who don’t see why I think they’re central.  Hopefully this gives a somewhat clearer idea of how someone’s internal world can be “built out of aesthetics” to a very large degree.

Epistemology Sequence, Part 5: Extension and Universality

One of the properties that you’d like a learning agent to have is that, if your old concepts work well, learning a new concept should extend your knowledge but not invalidate your old knowledge. Changes in your ontology should behave in a roughly predictable manner rather than a chaotic manner.  If you learn that physics works differently at very large or very small scales, this should leave classical mechanics intact at moderate scales and accuracies.

From a goal-based perspective, this means that if you make a desirable change in ontology — let’s say you switch from one set of nodes to a different set — and you choose the “best” map from one ontology to another, in something like the Kullback-Leibler-minimizing sense described here — then when you take preimages of your “utility functions” on the new ontology onto the old ontology, they come out mostly the same.  The best decision remains the best decision.

In the special case where the old ontology is just a subset of the new ontology, this means that the maps between them are a restriction and an extension.  (For example, if we restrict (a, b, c, d, e) to the first three coordinates, it’s just the identity operation on those coordinates, (a, b, c); and if we extend (a, b, c) to (a, b, c, d, e), again the map is the identity on the first three coordinates.)  What we’d like to say is that, when we add new nodes to our ontology, then the function that computes values on that ontology (the f in Part 3 of this sequence) extends to a new f on the new ontology, while keeping the same values on the old nodes.

For example; let’s say I have a regression model that predicts SAT scores as a result of a bunch of demographic variables. The “best” model minimizes the sum of squared errors. Sum of squared errors is my utility function.  Now, if I add a variable to my model, the utility function stays the same, it’s still “sum of squared errors”; so if adding that new variable changes the model but reduces the residuals, my old model “wants” to make the upgrade.  On the other hand, an ‘upgrade’ to my model that changes the utility function, like deciding to minimize the sum of squared errors plus the coefficient for the new variable, isn’t necessarily an improvement unless the “best” model by that criterion also shrinks the sum of squared errors relative to the original regression model.

From the goal-oriented perspective, the only changes you’d want to make to your ontology are those which, when “projected” onto the old ontology, have you making the same “optimal” choices.

(These statements still need to be made precise. There may be a conjecture in there but I haven’t specified it yet. The whole business smells like Hahn-Banach to me, but I could be entirely mistaken. The universality of neural nets might be relevant to showing that this kind of a “rational learner” is implementable with neural nets in the first place. )

Epistemology Sequence, Part 4: Updating Ontologies Based On Values

What’s a good ontology?

Well, the obvious question is, good relative to what?

Relative to your values, of course. In the last post I talked about how, given an ontology for describing the world, you can evaluate a situation.  You compute the inner product

f(N)*V

where V represents a value function on each of the concepts in your ontology, and f(N) is a function of which concept nodes are “active” in a situation, whether by direct perception, logical inference, predictive inference, association, or any other kind of linkage.  For instance, the situation of being a few yards away from a lion will activate nodes for “tan”, “lion”, “danger”, and so on.

If you can evaluate situations, you can choose between actions. Among all actions, pick the one that has the highest expected value.

One particular action that you might take is changing your ontology.  Suppose you add a new node to your network of concepts.  Probably a generalization or a composition of other nodes. Or you subtract a node.  How would you decide whether this is a good idea or not?

Well, you build a model using your current ontology of what would happen if you did that. You’d take different actions.  Those actions would lead to different expected outcomes. You can evaluate how much you like those outcomes using your current ontology and current values.

For  modeling the world, the kinds of things you might optimize for are accuracy (how often does your model come up with correct predictions) and simplicity (how few degrees of freedom are involved.)  This is often implemented in machine learning with a loss function consisting of an error term and a regularization term; you choose the model that minimizes the loss function.

Notice that, in general, changing your ontology is changing your values. You can’t prioritize “civil rights” if you don’t think they exist.  When you learn that there are other planets besides the Earth, you might prioritize space exploration; before you learned that it was possible, you couldn’t have wanted it.

The question of value stability is an important one. When should you self-modify to become a different kind of person, with different values?  Would you take a pill that turns you into a sociopath?  After all, once you’ve taken the pill, you’ll be happy to be free of all those annoying concerns for other people.  Organizations or computer programs can also self-modify, and those modifications can change their values over time.  “Improvements” meant to increase power or efficacy can cause such agents to change their goals to those that present-day planners would find horrifying.

In the system I’m describing, proposed changes are always evaluated with respect to current values.  You don’t take the sociopath pill, because the present version of you doesn’t want to be a sociopath. The only paths of self-modification open to you are those where future states (and values) are backwards-compatible with earlier states and values.

The view of concepts as clusters in thingspace suggests that the “goodness” of a concept or category is a function of some kind of metric of the “naturalness” of the cluster.  Something like the ratio of between-cluster to within-cluster variance, or the size of the margin to the separating hyperplane.  The issue is that choices of metric matter enormously.  A great deal of research in image recognition, for example, involves competing choices of similarity metrics. The best choice of similarity metric is subjective until people agree on a goal — say, a shared dataset with labeled images to correctly identify — and compete on how well their metrics work at achieving that goal.

The “goodness” or “aptness” of concepts is a real feature of the world. Some concepts divide reality at the joints better than others. Some concepts are “natural” and some seem contrived.  “Grue” and “bleen” are awkward, unnatural concepts that no real human would use, while “blue” and “green” are natural ones.  And yet, even blue and green are not human universals (the Japanese ao refers to both blue and green; 17th century English speakers thought lavender was “blue” but we don’t.)  The answer to this supposed puzzle is that the “naturalness” of concepts depends on what you want to do with them.  It might be more important to have varied color words in a world with bright-colored synthetic dyes, for instance; our pre-industrial ancestors got by with fewer colors.  The goodness of concepts is objective — that is, there is a checkable, empirical fact of the matter about how good a concept is — but only relative to a goal, which may depend on the individual agent.  Goals themselves are relative to ontology.  So choosing a good ontology is actually an iterative process; you have to build it up relative to your previous ontology.

(Babies probably have some very simple perceptual concepts hard-coded into their brains, and build up more complexity over time as they learn and explore.)

It’s an interesting research problem to explore when major changes in ontology are desirable, in “toy” computational situations.  The early MIRI paper “Ontological crises in artificial agent’s value systems” is a preliminary attempt to look at this problem, and says essentially that small changes in ontologies should yield “near-isomorphisms” between utility functions.  But there’s a great deal of work to be done (some of which already exists) about robustness under ontological changes — when is the answer spit out by a model going to remain the same under perturbation of the number of variables of that model?  What kinds of perturbations are neutral, and what kinds are beneficial or harmful?  Tenenbaum’s work on learning taxonomic structure from statistical correlations is somewhat in this vein, but keeps the measure of “model goodness” separate from the model itself, and doesn’t incorporate the notion of goals.  I anticipate that additional work on this topic will have serious practical importance, given that model selection and feature engineering is still a labor-intensive, partly subjective activity, and that greater automation of model selection will turn out to be valuable in technological applications.

Most of the ideas here are from ItOE; quantitative interpretations are my own.

Epistemology Sequence, Part 3: Values and Evaluation

Post edited based on comments from Shea Levy, Daniel Speyer, Justin Alderis, and Andrew Rettek.

In the view of epistemology I’m putting forward in this sequence, the “best” way to construct a concept is the most useful way, relative to an agent’s goals.  So, before we can talk about the question of “what makes a good concept?” we need to talk about evaluation.

Agents have values.  Agents engage in goal-directed behavior; they have preferences; they have priorities; they have “utility functions”; etc.  I’ll use these phrases more or less interchangeably. (I do not mean to claim that these “utility functions” necessarily obey Von Neumann-Morgenstern axioms.)  What they mean is that an agent has a function over states of the world that captures how good each state of the world is, from the agent’s perspective.  This function need not be single-valued, nor need every state of the world have a value.  Humans may very well have complex, multi-dimensional values (we like things like pleasure, beauty, health, love, depth and challenge, etc).

Now, the natural question is, what is a state of the world?  That depends on your ontology.

The “things” in your model of the world are concepts, as I described in the last post.  Every concept is a generalization, over other concepts or over sensory perceptions; a concept is a collection of things that have a salient property in common, but differ in their irrelevant properties.  

In neural net language, a concept is a non-leaf node in a (generalized) convolutional neural net.  Pooling represents generalization; the lower nodes that feed into a given node on a pooling level are examples of the higher node, as in “dogs, snakes, and fish are all animals.”  Composition represents, well, composition: the nodes on a composition level are composite concepts made up of the lower nodes that feed into them, as in “a unicorn is a horse with a horn on its forehead.”

So, what does this have to do with values?

In some states of the world, you have direct experiential perception of value judgments.  Pleasure and pain are examples of value judgments.  “Interestingness” or “surprisingness” are also value judgments; “incentive salience” (relevance to an organism’s survival) is one model of what sensations cause a spike in dopamine and attract an organism’s attention.  (More on that in my past post on dopamine, perception, and values.)

Every value judgment that isn’t a direct experience must be constructed out of simpler concepts, via means such as inference, generalization, or composition.  You directly experience being sick as unpleasant and being healthy as pleasant; you have to create the generalizations “sick” and “healthy” in order to understand that preference, and you have to model and predict how the world works in order to value, say, taking an aspirin when you have a headache.

Your value function is multivariate, and it may well be that some of the variables only apply to some kinds of concepts. It doesn’t make sense to ask how “virtuous” your coffee mug is.  I’m deliberately leaving aside the question of whether some kinds of value judgments will be ranked as more important than others, or whether some kinds of value judgments can be built out of simpler kinds (e.g. are all value judgments “made of” pleasure and pain in various combinations?).  For the moment, we’ll just think of value as (very) multivariate.

You can think of the convolutional network as a probabilistic graphical model, with each node as a random variable over “value”, and “child” nodes as being distributed according to functions which take “parent” nodes as parameters.  So, for instance, if you have experiences of really enjoying playing with a cat, that gives you a probability distribution on the node “cat” of how much you like cats; and that in turn updates the posterior probability distribution over “animals.” If you like cats, you’re more likely to like animals. This is a kind of “taxonomic” inference.

Neural networks, in themselves, are not probabilistic; they obey rules without any reference to random variables. (Rules like “The parent node has value equal to the transfer function of the weighted sum of the child node values.”) But these rules can be interpreted probabilistically. (“The parent node value is a random variable, distributed as a logistic function of the child node variables.”)  If you keep all the weights from the original neural net used to construct concepts, you can use it as a probabilistic graphical model to predict values.

Propagating data about values all over the agent’s ontology is a process of probabilistic inference.  Information about values goes up to update the distributions of higher concepts, and then goes down to update the distributions of lower but unobserved concepts. (I.e. if you like kittens you’re more likely to like puppies.)  The linked paper explains the junction tree algorithm, which indicates how to perform probabilistic inference in arbitrary graphs.

Of course, this kind of  taxonomic inference isn’t the only way to make inferences.  A lion on the other side of a fence is majestic; that lion becomes terrifying if he’s on the same side of the fence as you.  You can’t make that distinction just by having a node for “lion” and a node for “fence.” And it would be horribly inefficient to have a node for “lion on this side of fence” and “lion on that side of fence”.  What you do, of course, is value “survival”, and then predict which situation is more likely to kill you.

At present, I don’t know how this sort of future-prediction process could translate into neural-net language, but it’s at least plausible that it does, in human brains.  For the moment, let’s black-box the process, and simply say that certain situations, represented by a combination of active nodes, can be assigned values, based on predictions of the results of those situations and how the agent would evaluate those results.

In other words, f(N_1, N_2, … N_k) = (v_1, v_2, … v_l), where the N’s are nodes activated by perceiving a situation, and the v’s are value assignments, and computing the function f may involve various kinds of inferences (taxonomic, logical, future-predicting, etc.)  This computation is called teleological measurement.

The process of computing f is the process by which an agent makes an overall evaluation of how much it “likes” a situation. Various sense data produce immediate value judgments, but also those sensations are examples of concepts, and the agent may have an opinion on those concepts.  You can feel unpleasantly cold in the swimming pool, but have a positive opinion of the concept of “exercise” and be glad that you are exercising.  There are also logical inferences that can be drawn over concepts (if you go swimming, you can’t use that block of time to go hiking), and probabilistic predictions over concepts (if you swim in an outdoor pool, there’s a chance you’ll get rained on.)  Each of these activates a different suite of nodes, which in turn may have value judgments attached to them.  The overall, holistic assessment of the situation is not just the immediate pleasure and pain; it depends on the whole ontology.

After performing all these inferences, what you’re actually computing is a weighted sum of values over nodes.  (The function f that determines the weights is complicated, since I’m not specifying how inference works.)  But it can be considered as a kind of “inner product” between values and concepts, of the form

f(N) * V,

where V is a vector that represents the values on all the nodes in the whole graph of concepts, and f(N) represents how “active” each node is after all the inferences have been performed.

Note that this “weighted sum” structure will give extra “goodness points” to things that remain good in high generality. If “virtue” is good, and “generosity” is a virtue and also good, and “bringing your sick friend soup” is a species of generosity and also good, then if you bring your sick friend soup, you triple-count the goodness.

This is a desirable property, because we ordinarily think of things as better if they’re good in high generality.  Actions are prudent if they continue to be a good idea in the long term, i.e. invariant over many time-slices.  Actions are just if they continue to be a good idea no matter who does them to whom, i.e. invariant over many people. A lot of ethics involves seeking these kinds of “symmetry” or “invariance”; in other words, going to higher levels on the graph that represents your concept structure.

This seems to me to be a little related to the notion of Haar-like bases on trees.  In the linked paper, a high-dimensional dataset is characterized by a hierarchical tree, where each node on the tree is a cluster of similar elements in the dataset. (Very much like our network of concepts, except that we don’t specify that it must be a tree.) Functions on the dataset can be represented by functions on the tree, and can be decomposed into weighted sums of “Haar-like functions” on the tree; these Haar-like functions are constant on nodes of a given depth in the tree and all their descendants.  This gives a multiscale decomposition of functions on the dataset into functions on the “higher” nodes of the tree.  “Similarity” between two data points is the inner product between their Haar-like expansions on the tree; two data points are more similar if they fall into the same categories. This has the same multiscale, “double-counting” phenomenon that shows up in teleological measurement, which gives extra weight to similarity when it’s not just shared at the “lowest” level but also shared at higher levels of generality.

(Haar-like bases aren’t a very good model for teleological measurement, because our function f is both multivariate and in general nonlinear, so the evaluation of a situation isn’t really decomposable into Haar functions.  The situation in the paper is much simpler than ours.)

This gives us the beginning of a computational framework for how to talk about values. A person has a variety of values, some of which are sensed directly, some of which are “higher” or constructed with respect to more abstract concepts.  Evaluating a whole situation or state of the world involves identifying which concepts are active and how much you value them, as well as inferring which additional concepts will be active (as logical, associational, or causal consequences of the situation) and how much you value them.  Adding all of this up gives you a vector of values associated with the situation.

If you like, you can take a weighted sum of that vector to get a single number describing how much you like the situation overall; this is only possible if you have a hierarchy of values prioritizing which values are most important to you.

Once you can evaluate situations, you’re prepared to make decisions between them; you can optimize for the choices that best satisfy your values.

An important thing to note is that, in this system, values are relative to the agent’s ontology. Values are functions on your neural network; they aren’t applicable to some different neural network.  Values are personal; they are only shared between agents to the extent that the agents have a world-model in common. Disagreement on values is only possible when an ontology is shared.  If I prioritize X more than you do, then we disagree on values; if you have no concept of X then we don’t really disagree, we’re just seeing the world in alien ways.

Now that we have a concrete notion of how values work, we can go on to look at how an agent chooses “good” concepts and “good” actions, relative to its values, and what to do about ontological changes.

Note: terms in bold are from ItOE; quantitative interpretations are my own.  I make no claims that this is the only philosophical language that gets the job done. “There are many like it, but this one is mine.”

Epistemology Sequence, Part 2: Concepts

What are the “things” in our world?

A table is not a “raw” piece of sense data; it is a grouping of multiple sensory stimuli into a single, discrete object. (The concept which the word “table” refers to is even more general, since it includes all instances of individual tables.)

We do not perceive the world in terms of raw sense data; we process it a lot before we can even become conscious of it. Unmediated sensory perception would have no structure, it would be like William James’ “blooming, buzzing confusion,” which was his phrase for a baby’s sensory experience.

James was wrong on the facts — even babies do not have literal unmediated perceptions.  There is nowhere in the brain that represents a photograph-like picture of the visual field, for instance.  But we do know that object recognition can break to some degree in humans, yielding examples of people who lack some higher-level sensory processing. Mel Baggs writes about having to consciously and effortfully recognize objects, as a result of autism.  Agnosia is the general term for inability to recognize sensory phenomena; there are many agnosias, like the inability to distinguish visual shapes, or to distinguish speech from non-speech sounds.  It’s clear that organizing sensory data into discrete objects (let alone forming abstractions from types of objects and their properties) is a nontrivial operation in the brain. And, indeed, image and speech recognition is an ongoing and unsolved area of machine learning research.

Visual object recognition is currently believed to be modeled by a hierarchical neural net, shaped like a tree. The lowest leaves on the tree, known as simple cells, recognize local features of the image — say, a (convolution with a) particular line segment, in a particular (x, y) coordinate position, at a particular angle.  Higher levels of the tree integrate multiple nodes from lower on the tree, producing features that recognize more complex features (shapes, patterns, boundaries, etc.)  Higher features have invariance properties (the shape of the number 2 is recognizable even if it’s translated, rotated, scaled, written in a different color, etc) which come from integrating many lower features which have different values for the “irrelevant” properties like location or color.  Near the top of the tree, we can get as far as having a single feature node for a particular type of object, like “dog.”

It is known empirically that individual neurons in the visual cortex are tuned to recognize complex objects like faces, and that this recognition is invariant to changes in e.g. viewing angle or illumination.  Monkeys trained to recognize a novel object will acquire neurons which are selective for that object, which shows that the process of object recognition is learned rather than hard-coded.

We can call a concept a node that’s not a leaf.  A concept is a general category composed of aggregating perceptions or other concepts, which have some essential characteristic(s) in common. (In the case of the symbol “2”, the shape is essential, while the color, scale, and position are not.)  To form a concept, the input from the lower nodes must be “pooled” over such inessential dimensions.  In the classic HMAX model of the visual cortex, pooling is implemented with a “max” function — the complex cell’s activity is determined by the strongest signal it receives from the simple cells.  A “pooling” level is followed by a “composition” level, whose nodes are all possible combinations of nearby groups of nodes on the preceding level; after a further pooling level, the nodes represent “complex composite” concepts, composed of smaller shapes.

HMAX is an example of a convolutional neural net.  In a convolutional neural net, each node’s activity is determined by the activity of a spatially local patch of nodes on the level just below it, and the transfer functions are constrained to be identical across a level. This constraint cuts down dramatically on the computational cost of learning the weights on the neural net.  The max-pooling step in a convolutional neural net makes the composite nodes translation-invariant; the max over a set of convolutions with overlapping patches is robust to translations of the input image.  This gives us a way to implement the ability to generalize or produce translation invariance.  Variants on convolutional neural nets can give other kinds of invariance, such as scale-invariance, rotation-invariance, illumination-invariance, or even invariance with respect to an arbitrary group of transformations.  The general principle is that you can generate higher concepts via measurement omission — pooling over a variety of specific feature-detectors which vary in a non-salient characteristic will give you a more general feature detector that only cares about the salient characteristic.

Having a hierarchical structure of this kind is valuable because it is computationally efficient. Fully-connected neural nets, where each node on layer is connected to every node on layer n-1, have far too many weights to learn (especially since the inputs on the bottom layer are one neuron per pixel in the image).  Hierarchical structure allows you to cut down on the number of objects in your vocabulary; you can conceive of “this table” rather than all possible parts and viewing angles and lighting choices that give you images of the table.

How the brain models more abstract concepts is less well known.  But it seems intuitive that you can generate new concepts from old ones by integration (including multiple concepts under an umbrella heading) or differentiation (dividing a concept into multiple distinct types.)

In neural-net language, “integrating” multiple nodes is an OR function, which is implemented with a max-pooling step.  The parent node is active iff at least one of the child nodes is active; this is equivalent to saying that the parent node is active iff the maximum over all child nodes is active.

Differentiation involves subdividing a node into types.  If I understand this correctly, this involves combinations of AND functions (whose implementation can be derived from OR functions) and XOR functions, which are more difficult. For instance, if the parent node is of the form “A OR B” and you need to identify the child node “Exactly one of {A, B}”, you have to define an XOR function with a neural net. XOR functions provably cannot be done with single-layer neural networks; implementing an XOR function requires a hidden layer. In high dimensions, parity functions (generalizations of the XOR function) are intractable to learn with neural nets.  It appears that differentiation is qualitatively more difficult than integration. At least some kinds of categorization that humans can do appear to be (mostly) open problems for artificial intelligence.

In short: hierarchical organization into concepts is a natural way to construct an ontology that is computationally efficient to work with.  Concepts are generalizations from simpler objects; a concept has some form of invariance over irrelevant characteristics. (Which characteristics are relevant and which are irrelevant? More on that later.)

Claims about the world can be expressed in terms of concepts, subsuming all their sub-components; for instance, the event “there is a black circle in this picture” can be defined entirely in terms of the node that represents “black circle”, and implicitly includes all possible locations of the black circle. Thus, the hierarchical network of concepts also gives rise to a kind of hierarchical structure on states of the world.

This gives us a the start of a language for how to talk about ontologies.  Later we’ll get into: what makes a good ontology? what happens if you change your ontology?  What about decision-making?

Note: terms in bold are from ItOE; quantitative interpretations are my own.  I make no claims that this is the only philosophical language that gets the job done. “There are many like it, but this one is mine.”

Epistemology Sequence, Part 1: Ontology

This sequence of posts is an experiment in fleshing out how I see the world. I expect to revise and correct things, especially in response to discussion.

“Ontology” is an answer to the question “what are the things that exist?”

Consider an reasoning agent making decisions. This can be a person or an algorithm.  It has a model of the world, and it chooses the decision that has the best outcome, where “best” is rated by some evaluative standard.

A structure like this requires an ontology — you have to define what are the states of the world, what are the decision options, and so on.  If outcomes are probabilistic, you have to define a sample space.  If you are trying to choose the decision that maximizes the expected value of the outcome, you have to have probability distributions over outcomes that sum to one.

[You could, in principle, have a decision-making agent that has no model of the world at all, but just responds to positive and negative feedback with the algorithm “do more of what rewards you and less of what punishes you.” This is much simpler than what humans do or what interesting computer programs do, and leads to problems with wireheading. So in this sequence I’ll be restricting attention to decision theories that do require a model of the world.]

The problem with standard decision theory is that you can define an “outcome” in lots of ways, seemingly arbitrarily. You want to partition all possible configurations of the universe into categories that represent “outcomes”, but there are infinitely many ways to do this, and most of them would wind up being very strange, like the taxonomy in Borges’ Celestial Emporium of Benevolent Knowledge:

Those that belong to the emperor

Embalmed ones

Those that are trained

Suckling pigs

Mermaids (or Sirens)

Fabulous ones

Stray dogs

Those that are included in this classification

Those that tremble as if they were mad

Innumerable ones

Those drawn with a very fine camel hair brush

Et cetera

Those that have just broken the flower vase

Those that, at a distance, resemble flies

We know that statistical measurements, including how much “better” one decision is than another, can depend on the choice of ontology. So we’re faced with a problem here. One would presume that an agent, given a model of the world and a way to evaluate outcomes, would be able to determine the best decision to make.  But the best decision depends on how you construct what the world is “made of”! Decision-making seems to be disappointingly ill-defined, even in an idealized mathematical setting.

This is akin to the measure problem in cosmology.  In a multiverse, for every event, we think of there as being universes where the event happens and universes where the event doesn’t happen. The problem is that there are infinitely many universes where the event happens, and infinitely many where it doesn’t. We can construct the probability of the event as a limit as the number of universes becomes large, but the result depends sensitively on precisely how we do the scaling; there isn’t a single well-defined probability.

The direction I’m going to go in this sequence is to suggest a possible model for dealing with ontology, and cash it out somewhat into machine-learning language. My thoughts on this are very speculative, and drawn mostly from introspection and a little bit of what I know about computational neuroscience.

The motivation is basically a practical one, though. When trying to model a phenomenon computationally, there are a lot of judgment calls made by humans.  Statistical methods can abstract away model selection to some degree (e.g. generate a lot of features and select the most relevant ones algorithmically) but never completely. To some degree, good models will always require good modelers.  So it’s important to understand what we’re doing when we do the illegible, low-tech step of framing the problem and choosing which hypotheses to test.

Back when I was trying to build a Bayes net model for automated medical diagnosis, I thought it would be relatively simple. The medical literature is full of journal articles of the form “A increases/decreases the risk of B by X%.”  A might be a treatment that reduces incidence of disease B; A might be a risk factor for disease B; A might be a disease that sometimes causes symptom B; etc.  So, think of a graph, where A and B are nodes and X is the weight between them. Have researchers read a bunch of papers and add the corresponding nodes to the graph; then, when you have a patient with some known risk factors, symptoms, and diseases, just fill in the known values and propagate the probabilities throughout the graph to get the patient’s posterior probability of having various diseases.

This is pretty computationally impractical at large scales, but that wasn’t the main problem. The problem was deciding what a node is. Do you have a node for “heart attack”? Well, one study says a certain risk factor increases the risk of having a heart attack before 50, while another says that a different risk factor increases the lifetime number of heart attacks. Does this mean we need two nodes? How would we represent the relationship between them? Probably having early heart attacks and having lots of heart attacks are correlated, but we aren’t likely to be able to find a paper that quantifies that correlation.  On the other hand, if we fuse the two nodes into one, then the strengths of the risk factors will be incommensurate.  There’s a difficult judgment call inherent in just deciding what the primary “objects” of our model of the world are.

One reaction is to say “automating human judgment is harder than you thought”, which, of course, is true. But how do we make judgments, then? Obviously I’m not going to solve open problems in AI here, but I can at least think about how to concretize quantitatively the sorts of things that minds seem to be doing when they define objects and make judgments about them.

Values Affirmation Is Powerful

One of the most startlingly effective things I’ve seen in the psychology literature is the power of “self-affirmation.”

The name is a bit misleading. The “self-affirmation” described in these studies isn’t looking in the mirror and telling yourself you’re beautiful.  It’s actually values affirmation — writing short essays about what’s important to you in life (things like “family”, “religion”, “art”) and why you value them. The standard control intervention is writing about why a value that’s not very important to you might be important to someone else.

Values affirmation has been found in many studies to significantly improve academic performance in “negatively stereotyped” groups (blacks, Hispanics, and women in STEM), and these effects are long-lasting, continuing up to a year after the last exercise.[1]  Values affirmation causes about a 40% reduction in the black-white GPA gap, concentrated in the middle- and low-performing students.[4]

Values affirmation exercises reduce the cortisol response (cortisol is a “stress hormone”) in response to social stress tasks, as well as reducing self-reported stress.[2]  Students assigned to a values-affirmation exercise did not have an increase in urinary epinephrine and norepinephrine (measures of sympathetic nervous system activity) in the weeks before an exam, while control students did.[5]  People who have just done a self-affirmation exercise have less of an increase in heart rate in response to being insulted.[6]

A fifteen-minute values affirmation exercise continued to reduce (questionnaire-measured) relationship insecurity for four weeks after the initial exercise.[3]

The striking phenomenon is that a very short, seemingly minor intervention (spending 15 minutes on a writing task) seems to have quite long-lasting and dramatic effects.

There are lots and lots of studies pointing in this direction, and I haven’t looked in great depth into how sound their methodology is; I still consider it quite possible that this is a statistical fluke or result of publication bias.  But it does seem to mesh well with a lot of ideas I’ve been considering over the years.

There is a kind of personal quality that has to do with believing you are fit to make value judgments.  Believing that you are free to decide your own priorities in life; believing that you are generally competent to pursue your goals; believing that you are allowed to create a model of the world based on your own experiences and thoughts.

If you lack this quality, you will look to others to judge how worthy you are, and look to others to interpret the world for you, and you will generally be more anxious and more likely to unconsciously self-sabotage.

I think of this quality as being a free person or being sovereign.  The psychological literature will often characterize it as “self-esteem”, but in popular language “self-esteem” is overloaded with “thinking you’re awesome”, which is different.  Everybody has strengths and weaknesses and nobody is wonderful in every way.  Being sovereign doesn’t require you to think you’re perfect; it is the specific feeling that you are allowed to use your own mind.

What the self-affirmation literature seems to say is that this quality is incredibly important, and incredibly responsive to practice.

The stereotype threat literature in particular suggests that there is an enormous aggregate cost, in terms of damaged academic and work performance and probably health damage, due to the loss of a sense of sovereignty among people whom society stereotypes as inferior.

Put another way: being a “natural aristocrat”, in the sense of being a person who is confident in his right to think and decide and value, gives you superpowers. My intuition is that people become much, much smarter and more competent when they are “free.”

And if promoting psychological freedom is as easy as the self-affirmation literature suggests, then people interested in maximizing humanitarian benefit should be interested.  Human cognitive enhancement is a multiplier on whatever good you want to do, just as economic growth is; it increases the total amount of resources at your disposal.  Raising IQ seems to be hard, once you get past the low-hanging fruit like reducing lead exposure, but reducing stereotype threat seems to be much easier.  I have a lot of uncertainty about “what is the most useful thing one can do for humanity”, but making saner, freer people arguably deserves a spot on the list of possibilities.

[1]Sherman, David K., et al. “Deflecting the trajectory and changing the narrative: How self-affirmation affects academic performance and motivation under identity threat.” Journal of Personality and Social Psychology 104.4 (2013): 591.

[2]Creswell, J. David, et al. “Affirmation of personal values buffers neuroendocrine and psychological stress responses.” Psychological Science 16.11 (2005): 846-851.

[3]Stinson, Danu Anthony, et al. “Rewriting the Self-Fulfilling Prophecy of Social Rejection Self-Affirmation Improves Relational Security and Social Behavior up to 2 Months Later.” Psychological science 22.9 (2011): 1145-1149.

[4]Cohen, Geoffrey L., et al. “Reducing the racial achievement gap: A social-psychological intervention.” science 313.5791 (2006): 1307-1310.

[5]Sherman, David K., et al. “Psychological vulnerability and stress: the effects of self-affirmation on sympathetic nervous system responses to naturalistic stressors.” Health Psychology 28.5 (2009): 554.

[6]Tang, David, and Brandon J. Schmeichel. “Self-affirmation facilitates cardiovascular recovery following interpersonal evaluation.” Biological psychology 104 (2015): 108-115.

Changing My Mind: Radical Acceptance

I used to be really against the notion of radical acceptance.  Or, indeed, any kind of philosophy that counseled not getting upset about bad things or not stressing out over your own flaws.

The reason why is that I don’t like the loss of distinctions.  “Science” means “to split.”

If you dichotomize  “justice vs. mercy”, “intense vs. relaxed”, “logic vs. intuition”, and so on, I’m more attracted to the first category. I identify with Inspector Javert and Toby Ziegler. I admire adherence to principle.

And there’s a long tradition of maligning “intense” people like me, often with anti-Semitic or ableist overtones, and I tend to be suspicious of rhetoric that pattern-matches to those associations.  There’s a pattern that either frames intense people as cruel, in a sort of “Mean Old Testament vs. Nice New Testament” way, or as pathetic (“rigid”, “obsessive”, “high need for cognitive closure”, etc).  “Just relax and don’t sweat the small stuff” can be used to excuse backing out of one’s commitments, stretching the truth, or belittling others’ concerns.

There’s also an aesthetic dimension to this. One can prefer crispness and sharpness and intensity to gooey softness.  I think of James Joyce, an atheist with obvious affection for the Jesuitical tradition that taught him.

So, from where I stand, “radical acceptance” sounds extremely unappealing. Whenever I heard “You shouldn’t get mad at reality for being the way it is”, I interpreted it as “You shouldn’t care about the things you care about, you shouldn’t try to change the world, you shouldn’t stand up for yourself, you shouldn’t hold yourself to high standards.  You’re a weird little girl and you don’t matter.”

And of course I reject that. I’m still passionate, still intense, still trying to have integrity, and I don’t ever want to stop caring about the difference between true and false.

But I do finally grok some things about acceptance.

  • It’s just not objectively true that anything short of perfection is worth scrapping.  I can be a person with flaws and my life is still on net extremely worthwhile.  That’s not “bending the rules”, it’s understanding cost-benefit analysis.
  • There’s a sense in which imperfections are both not good and completely okay.  For example: I have a friend that I’ve often had trouble communicating with. Sometimes I’ve hurt his feelings, sometimes he’s hurt mine, pretty much always through misunderstanding.  My Javertian instinct would be to feel like “This friendship is flawed, I’ve sullied it, I need to wipe the slate clean.” But that’s impossible.  The insight is that the friendship is not necessarily supposed to be unsullied.  Friction and disagreement are what happens when you’re trying to connect deeply to people who aren’t exactly like you.  The friendship isn’t falling short of perfection, it’s something rough I’m building from scratch.
  • “Roughness” is a sign that you’re at a frontier. “Mistakes are the portals of discovery.”  Even the most admirable people have experienced disappointment and tried things that didn’t work.  Life doesn’t have to be glossy or free of trouble to be glorious.  Getting through hard times, or making yourself a better person, are legitimate achievements.  Optimizing for “build something” is life-giving; optimizing for “have no flaws” is sterile.
  • Hating injustice, or hating death, is only a starting point. Yes, bad things really are bad, and it’s important to validate that.  Sometimes you have to mourn, or rage, or protest. But what then?  How do you fix the problem?  Once you’ve expressed your grief or anger, once you’ve made people understand that it’s really not all right, what are you going to do?  It becomes a question to investigate, not a flag to raise.  And sometimes people seem less angry, not because they care less, but because they’ve already moved on to the investigation and strategy-building phase of the work.
  • One idea that allows me to grok this is the Jewish idea that G-d chooses not to destroy the world.  Is the world flawed? Heck yes! Is it swarming with human beings who screw up every day?  You bet!  Is it worth wiping out?  No, and there’s a rainbow to prove it.  Which means that the world, in all its messy glory, is net good.  It beats hell out of hard vacuum.

Choice of Ontology

I noticed an interesting phenomenon while reading this paper about a gene therapy that was intended to treat advanced heart failure.

The paper (which documents a Phase I/II study) found an 82% reduction in the risk of cardiac events in the high-dose group vs. the placebo group, with a p-value of 0.048. Weakly significant, but a big effect size.

On the other hand, if you look at the raw numbers you see that some people have a lot of cardiac events, and some have few or none.  If you divide people into “healthy” vs. “unhealthy”, where “unhealthy” people have at least one cardiac event within a year, and “healthy” people don’t, then the placebo group had 7 healthy and 7 unhealthy patients, while the high-dose group had 7 healthy and 2 unhealthy patients.

If you do a one-sided t-test of this, you get a non-significant 0.07 p-value.

And intuitively, it makes sense that 7 out of 14 unhealthy patients vs 7 out of 9 unhealthy patients could very easily be a fluke.

How you frame the problem, what you consider to be an “event” in your probability space, matters. Do you count cardiac events? Mostly healthy vs. mostly unhealthy people? People with no cardiac events vs. people with any cardiac events? (the latter gives you p=0.089).

One way of framing it is that you posit some kind of hierarchical model.  In this case, your risk of having a cardiac event is drawn from a probability distribution which is something like a mixture of two gamma distributions, one with a “low risk” parameter and one with a “high risk” parameter.

You could make a generative model to test the null hypothesis. Under the assumption that the therapy doesn’t work, you could randomly choose the size of the “high risk” vs. “low risk” population, and then for each patient, draw to see whether they’re high risk or low risk, and then draw again (repeatedly) from the appropriate gamma distribution to get their pattern of cardiac events.  Sampling from this can give you the posterior probability distribution of the actual data given the null hypothesis.

You could even make the number of clusters in your mixture, or the cutoffs of the clusters, random variables themselves, and average over different models.  That’s not really eliminating the fact that choice of model matters, it’s just pushing your agnosticism up a meta-level; but it may be general enough to be practically like model-agnosticism (e.g. adding more levels of hierarchy to the model might eventually cease to change the answer to “is this therapy significantly effective?” Note that you’re only getting p-value differences of a few percent even when we’re only tweaking a single parameter.   At some point I should try this empirically and see how much difference added model flexibility actually makes.)

But there’s a basic principle here which I see in a lot of contexts, where the output of a statistical algorithm is dependent on your choice of ontology.  And, I think, your choice of ontology is ultimately dependent on your goals.  Do I want to measure reduction in number of heart attacks or do I want to measure number of people who become heart-attack-free? There can’t be an answer to that question that’s wholly independent of my priorities.  Even averaging over models is essentially saying “over a wide range of possible priority structures, you tend to get answers to your question lying in such-and-such a range.”  It doesn’t mean you couldn’t construct a really weird ontology that would cause the algorithm to spit out something completely different.