c/o Dipartimento di Filosofia, linguistica e letterature

Via Aquilone, 8
06123 Perugia
IX, 2/2013

Punishment, Reinforcement Learning & Machine Agency

Articolo pubblicato nella sezione Robotics and Public Issues.

venerdì 4 aprile 2014 11:00

Recently there has been increasing interest in robot ethics and machine responsibility, as well as the legal frameworks for judging the agency of robots and machines in questions of legal responsibility and liability (Allen - Wallach 2009, Asaro 2011, Storrs Hall 2012). In no small measure this interest has revived some of the central philosophical questions of the past few centuries, including the nature of determinism and the deterministic nature of algorithms, the question of free will and its relation to morality, and the relation of punishment to free will and algorithmic decision-making. Among the central questions here are: If machines are algorithmic, and thus deterministic, how can they be responsible, moral agents? What does it mean to punish an algorithmic, and presumably deterministic, machine? And is simple non-determinism, of the probabilistic sort, a sufficient basis for asserting agency or ascribing moral responsibility?
It is important, however, to realize that philosophers did not simply throw up their hands and give up on these questions. While such questions are rarely completely resolved or definitively answered, great intellectual progress was made through their consideration. And while these insights were made in somewhat remote areas of philosophy, it warrants a fresh look to see how those insights might be brought to bear on a contemporary quandary, such as the moral status of machines and robots, in light of our best scientific and philosophical understanding.
In particular, I want to consider a view of machine agency which is being asserted more frequently, and has been articulated by J. Storrs Hall (2012). While the view has many components, I wish to examine it through the assertion that reprogramming a computer or robot is functionally, or even essentially, equivalent to punishment. I believe this view is mistaken for a number of reasons. Primarily it requires an overly narrow interpretation of what punishment is and how it functions. Moreover, it misses the point of how and why we ascribe moral agency in the first place. And finally, it misses an opportunity to shed light on how we might begin to understand agency in algorithmic systems. The paper will begin with a brief summary of the issue. I will then consider the reasons for ascribing moral agency, and administering punishment, at least for humans, and why it is unhelpful to try to apply these to machines, at least in overly simplistic ways. I will then consider how we might begin thinking about machine agency in terms of algorithmic decision-making systems.

Punishment and Reprogramming

Storrs Hall (2012) starts from considering the problem that computers and robots are algorithmic and thus deterministic systems. He entertains various responses to this problem, settling on a view that embraces the unpredictability of a system as being sufficient for us to ascribe agency to it. While I do not agree with his view of physical determinism or the arguments he makes for folk psychological ascriptions of moral agency, his two main conclusions seem quite reasonable. The first is that we must effectively treat the world as non-deterministic, whether or not it is deterministic. I believe it is clear from contemporary physics that the universe is not deterministic, whereas Storrs Hall seems to believe that it is deterministic (for a good introduction to determinism in contemporary physics, I recommend Hawking 1999).
And second, that we make ascriptions of moral agency in light of assessments of the feasible and practical means available to the subject whose actions we are judging.
In regard to the second point, I believe that it is true that we do not apply strict standards or criteria of moral agency when making such ascriptions. But I would suggest that it is better to follow Strawson's (1962) argument on this matter, rather than argue that this has anything to do with the complexity or architecture of the individual, or robot, that we are judging. It is a matter of fact that questions of agency and responsibility only arise when we actively seek to judge an agent's actions, and make an ascription of agency or responsibility. Consequently, such ascriptions are always ad hoc reconstructions. This might look different from the subject's own perspective, when actually considering and making a decision. However, we do not have access to this subjective perspective in the case of human agents, though we may attempt to simulate it, and practically speaking we do not really have access to it in most cases involving computers and robots either. Even if we do have access to the causal structures, and decision architectures of agents, these will not necessarily form the exclusive basis for ascribing moral agency.
From these conclusions, Storrs Hall then suggests that reprogramming and machine learning algorithms, such as reinforcement learning, are essentially equivalent to punishment for computers and robots with decision-making software. As he states it:

«We can clarify the final point by considering the purest possible form of punishment or reward for a rational robot. The point of punishment is to change the effective result of its utility calculation. Thus we could achieve the same effect simply by changing its utility function. Thus to state that something was beyond the robot's control seems equivalent to saying that the robot would not have changed the outcome under any possible utility function. This seems to map quite naturally on to a statement about a human to the effect that "he couldn't have changed it if he had wanted to".
Gazanniga (2011) points out that a key intuitive difference between humans (and animals such as dogs and horses) and machines is that when a human misbehaves, you punish it, whereas when a machine does, you fix it.
On our present theory, however, it becomes clear that punishing and fixing are essentially the same: punishing is a clumsy, external way of modifying the utility function. Furthermore, a closer analysis reveals that fixing - modifying the robot's utility function directly - is tantamount to punishment, in the sense that the robot would not "want it to happen and would act if possible to avoid it» (Storrs Hall 2012, p. 4).

This passage is making a number of logical moves in an effort to establish an equivalence between punishment and «fixing» a computer program. I believe this argument and its conclusion are misleading and mistaken for a number of reasons, which I will attempt to elucidate in this paper. In short, I believe that this view takes an overly reductive and simplistic view of punishment, on the one hand. On the other hand, it confuses the nature of machine learning in a way that I believe obscures a potential insight into the nature of agency and autonomy.
The claim that «it becomes clear that punishing and fixing are essentially the same: punishing is a clumsy, external way of modifying the utility function,» could mean one of two things. It could mean that for a simple decision-making program, these are essentially the same. Or it could mean that punishing and fixing are the same in general, for all systems. With regard to the first point, it is important to consider what kind of perspective a system might have of its own decision-making, i.e. its own reflexivity. And this is where we might actually draw some insight into agency in machines. But first, I want to consider the second point, and whether punishment is essentially «fixing» in general, and will turn now to a consideration of the role of punishment in the law.


In thinking about ascribing moral agency and responsibility, it is helpful to consider why we punish people when we find them responsible for wrong-doing. The reasons turn out to be more complicated and interesting than we might assume. There are, in fact, multiple reasons for punishing people, and these can be at odds with each other. As a society, we do not even need to agree on the reasons for punishing people, and in practice the law and moral judgment often mix and blend these reasons together. If we try to separate them out, the main reasons are retribution, deterrence and reform.
The notion of punishment as a form of retributive justice has roots in ancient law, e.g. «An eye for an eye, a tooth for a tooth,» but also has a modern formulation. Retribution is the idea that your violation of a law as an individual creates a debt to society. You have in some sense taken advantage of everyone else following the law without following it yourself, and thus you have taken an extra privilege against society at large. Under the social contract, we enter into a law-bound society to gain certain protections from harm, and agree not to harm other members of society. For instance, I am not supposed to steal your stuff, and you are not supposed to steal my stuff. If I steal your stuff, I have clearly harmed you, but I have also violated the law. So it is not enough for me to simply return your stuff - that would be the straight liability or tort. Theft is not a tort, but a crime. For an act to be a crime there must be criminal intent, and the punishment goes beyond the monetary value of the theft because there is more that is due to the society whose laws have been broken. While there might be a fine imposed, the fine is paid to the state, not directly to the victims, and other punishments such as imprisonment do not directly benefit the state or the prisoner at all.
In the era of psychological behaviorism and social engineering, deterrence has emerged as a principal function of punishment, at least in the framing of many laws. As a function of punishment that aims to prevent future crimes deterrence works on two levels. The first is the individual causal level of deterrence where we confine, exile or kill the guilty person, thereby removing agents from society and preventing them from further bad acts. This has been the justification for extending longer sentences and zero-tolerance policies under the current political rhetoric of getting tough on crime - to keep all the bad apples off the streets.
The other level is the social psychological level of deterrence, wherein we aim to alter everybody's future actions by placing a negative cost on taking wrongful actions. Because we can recognize other people's mistakes, and witness the punishment they receive for their transgressions, we are all meant to learn from their punishment and think twice before doing what they did. This was in part why punishments were often a public display like the pillory and gallows - to demonstrate both that the state would catch you, and that the consequences would be bad.
In the last few decades of social engineering, the most salient and discussed purpose for punishment is reform, wherein the intention is to change the character or behavior of the person who has done something wrong. Reform has multiple interpretations, depending on what we think is wrong with the person who does wrong, and thus upon our moral and psychological theories. If we are utilitarians, we might think that the wrongdoer has miscalculated the values of certain actions, or failed to consider the costs to others and only considered the benefits they might receive. What we need to do in such cases is revise faulty utility functions. For instance, we could impose monetary penalties, and in terms of economic decision making, the rational person will recognize the additional costs of getting caught and being punished, and thus avoid choosing illegal actions.
We could also apply virtue ethics here instead. In this case, our aim is to train people to internalize our moral and legal structures, and reform their aims by providing the moral education that might have previously been lacking. This notion also has important implications if we think about automating law enforcement: Is it sufficient just to get people to obey the rules or do we want them actually to understand why those rules are there and to internalize those rules as members of society? And this begins to get at other notions of virtues, self-realization and autonomy. Consider why we punish children. We want children to learn a specific lesson and not to repeat their behavior, but we also want them to learn a deeper lesson as well. We want them to become a better person, and that is about internalizing the reasons why the rules are in place and that they should not violate them even for a desired advantage, or when they can get away without being caught or punished in a particular situation. And we could also approach reform as Kantians. In this case, we would want the wrongdoer to recognize their duties, and to respect the rights of others, and to internalize these duties and rights into their decision-making as they act in the world.
To return to the claim by Storrs Hall (2012) that punishment and «fixing» are the same, it should now be clear that fixing, or reform, is only one aspect of punishment. Moreover, how we interpret reform and fixing depends on our moral theory and theories of psychology and development. Thus, this claim only makes sense in a very restrictive interpretation of all these concepts. So it does not really hold up, or appear useful, as a general claim. Let us now consider the more restricted version, according to which this equivalence would apply only to a reinforcement learning robot or computer.

Robots, Reinforcement Learning and Punishment

It is important to keep these different variations on the notion of reform, and its distinction from deterrence and retribution, as we consider various proposals for the punishment of robots, such as that of Storrs Hall (2012). According to that proposal, we are meant to consider a robot that is capable of sophisticated decision-making, and also capable of learning. We are further asked to consider that punishments of this system are meant primarily or exclusively as reforms of the system, aimed at improving its future performance and actions.
The first thing to note here is that a robot following a utility function is only deterministic in a qualified sense, not in the metaphysical sense discussed above. We can treat a system that implements a rational decision function as being logically deterministic, and expect it to make the same decision given the same inputs. Apart from building in a randomizing function, we are not really dealing with «magical free will» in such cases (Wallach - Allen 2009, pp. 59-63).
Storrs Hall (2012) appears to have in mind a form of reinforcement learning, a machine learning technique that uses nominal punishments and rewards. The idea is that if the robot makes the wrong choice and you want it to make the right choice in the future, you need to change its decision structure. Typically what we want to do is change the probabilities of making a certain decision or we want to change the values placed on the outcomes such that we re-weight the decision process and the desired outcome becomes more likely, or guaranteed, next time. There are many technical problems with the actual implementation of such learning. Among these is the temporality problem. If you have a robot that has made a very complicated sequence of decisions, e.g. played a long game of chess and then lost, what do you change? How do you decide which move lost the game? Or what was the responsibility of each individual move in the overall sequence of moves that lost the game? How do you decide how you are going to re-weight that whole chain of decisions, based on one outcome at the end? This becomes a really difficult problem, computationally speaking.
With a well-constrained system like chess we can try to deal with that formally and we have additional information, such as looking at multiple games, looking at multiple alternatives for each position, things like that, to try to determine more accurately where revisions should take place. It is not a straightforward problem at all, it is a very complicated problem even in a formal closed system like a game of chess. And then we can add the fact that chess is not a probabilistic game in the sense that every state is determined, or deterministic, and the other player is choosing moves based on different probabilities for expected outcomes. We could further try to model the probability functions and strategy of our opponent, as those diverge from our own model of an ideal player, which raises a whole new set of issues. As we start to consider our robot taking actions in a world with many agents, in which the options are not always clear, figuring out where our decisions might have gone wrong gets even more difficult.
Things start to get really interesting when we consider what might happen if we programmed the robot to actually reflect on this problem for itself. Storrs Hall (2012) uses an interesting example to get at this. The example involves training a robot assistant, and raises the question of how we think about punishment and its relation to revising the utility functions that form the robot's algorithmic decision-making. For the sake of clarity, here is his full description of the example:

«On our present theory, however, it becomes clear that punishing and fixing are essentially the same: punishing is a clumsy, external way of modifying the utility function. Furthermore, a closer analysis reveals that fixing or modifying the robot's utility function directly is tantamount to punishment, in the sense that the robot would not want it to happen and would act if possible to avoid it.
Consider a robot in a situation with two alternatives: it can pick up a $5 bill or a $10 bill, but not both. Its utility function is simply the amount of money it has. It will choose to pick up the $10.
Suppose we want the robot to pick the $5 instead. We threaten to fine it $6 for picking the $10 bill. It will of course pick up the $5, and be better off than the net $4 resulting from the other choice.
Now suppose we give the robot the choice between being in the situation where it is free to choose unencumbered, and the one in which we impose the fine. It will pick the former, since in that situation it winds up with $10 and in the other, $5.
Suppose instead that we give the robot a choice between the unencumbered situation, and being «fixed» - having its utility function changed to prefer the $5 to the $10. It will choose the unencumbered situation for the same reason as before: it will gain $10 from that and only $5 from the other one.
It would be incorrect to think that the prospect of preferring the $5 after being fixed will make a difference to the first choice. The first choice is being made under the present utility function, which by stipulation was concerned with money only. In fact the logical form of the robot's reasoning is that of a two-player game, where the robot's first choice is its own move, and its second choice after possibly being fixed, is the opponent's move. The rational robot will apply a standard minimax evaluation» (Storrs Hall 2012, p. 4).

I think this equivocation of punishment and fixing is mistaken and reductive not only because it reduces punishment to a particularly narrow interpretation of reform, but also because it takes a very narrow interpretation of how we should think about representing decisions to change our own methods of making decisions. But it will take a bit to explain why I see it this way.
Storr Hall's example corresponds closely to what are called Ulysses' problems in decision theory. The story of Ulysses is, that he wants to hear the song of the sirens but he knows that when you hear the song of the sirens, it is so seductive that you are going to steer your ship into the troubled waters and sink. So he tells his sailors to plug their ears with wax and tie him to the mast. And he also tells them that when he hears the sirens' song and begs his crew to untie him from the mast or listen to the song, they should ignore his requests and orders. In this situation Ulysses has a certain set of probabilities, values, desires, e.g. he does not want to crash his ship, but he does want to hear the siren song. He also knows that in that future point in time, when he is listening to the sirens, he will be willing to crash his ship to get closer to them and their song. Thus, he knows now that he is not going to be rational at that moment in the future - that he will have a different utility function -and that the one he has now is better, or more desirable in the long run than the one he will have then. And so he has himself tied to the mast to prevent himself from acting under the irrational utility function.
Ulysses' problems in decision theory deal with the meta-choices over different utility functions, and are useful for thinking about the decisions made around drug addiction. If you know you are a drug addict and you are going to make poor choices when you are under the influence of a drug, then what is your rationality towards decisions to take a drug or not, when you know that it's going to change your sets of values and your rational deliberation in a future point in time? But it also applies to education and learning. When we decide to pursue an education, such as a college degree, we do not really know what exactly we will learn or how the educational experience will change us and our utility functions. In a formal sense, going to college is just as irrational as taking addictive drugs. Of course, we have institutions and social values which aim to ensure the positive value of education, and these are lacking with the siren songs and mind-altering drugs. The point here is that it matters considerably how we interpret the situation in which we revise our decision structures. Not all revisions are good, nor are they all bad, and often we do not have sufficient means for judging these in advance, and less often are we able to choose them freely.
Generally, in reinforcement learning what we are doing is making very small tweaks to our model of the world, the utility functions. We can change the values for how we evaluate certain outcomes, and can make some outcomes more or less desirable. But we can also change probabilities-our expectations for how the world will behave - and this is what a lot of scientific understanding is about. Empirical knowledge aims to develop better estimates of the probabilities of certain outcomes in the world given certain conditions. In our uncertain and probabilistic universe this is very important.
There is another way to approach the revision of utility functions, pointed to by Storrs Halls' reference to the minimax solution, which includes strategy and risk aversion, and is operative in multi-agent reasoning and game theory. We can have estimations not only for my utility functions, but also models of my opponents and their decision structures. So we have to weigh our own uncertainties there, but also our opponents' probability estimates of outcomes, which implies additional uncertainty about whether I have an accurate representation of their probability estimate about outcomes, and their aversion to risk. In the Cuban missile crisis, can JFK really know whether the Soviets are going to launch their nuclear missiles or not? Are they using the same utility function that JFK would use if he were them? Or does he think they are fundamentally irrational in a particular way? This becomes very complicated, moreover, because they can have different relations to risk: they can be risk averse, or risk accepting, and they can change their risk aversion. This raises the further question, When and how does one decide to change their own risk aversion?
The suggestion in the passage above is that the robot can play this multi-agent game with itself, running its alternative utility functions against each other to choose the winner. But this is problematic. First, it is not clear what it means to play a zero-sum game with oneself. The challenge of game-theoretic problems such as the prisoner's dilemma is that one is not able to communicate with the other prisoner to establish cooperation. The successful tit-for-tat strategy in repeated games is essentially a mode of communicating the intention to cooperate. What would it mean to cooperate with oneself in such games? Moreover, this points to the problem of access to information. How is the robot meant to access the information about its future self, and the results of its future decisions, and the influence of its learning on those decisions? For any non-trivial learning, it necessarily lacks access to that information.
For any sufficiently sophisticated learning algorithm, learning amounts to the encoding of experience as an inclination to behave in a certain way in the future, and is a form of data compression. All statistical reinforcement learning algorithms are essentially processes for generating compressed representations of behavioral outputs over sets of inputs. This process is what is known in data compression as a lossy process. That is, there is a loss of information in the process of encoding, making it irreversible in the sense that you cannot recover the full resolution of the original from the copy. The loss is desirable in the sense that the goal is to create a compact and efficient representation rather than store every possible input and treat it individually. But it also means that learning cannot be undone, at least after further learning has taken place. Once you have learned from two different experiences, it becomes difficult or impossible to separate which experiences were responsible for which aspects of your current representation or utility function. Given all of the historical states of the system, we might claim that this could be reconstructed and reversed. This is really the same problem as physical determinism - if we know the whole history of the universe and its states, it becomes deterministic and reversible. But the reality is that the learning process is entropic, and irreversible. And as much as we would like to track all of the states, inputs and outputs of a system, we can only do this for very simple or trivial systems. For any complex systems we very quickly run into Bremermann's Limit and require far more information than is possible or practical to manage. Indeed, the whole point of learning algorithms is to compress this sort of information.

Creating Choices

There is a deeper problem here and this is where I think we have to think about robots and learning at a more fundamental level. It gets into the question of when we might start treating robots and machines as agents who might be punishable, as agents. In changing their model of the world, the utility function, most algorithms only deal with values and probabilities for a fixed set of actions and outcomes. They neither add nor delete any options from the set of choices and actions. But autonomous agents can change the world itself, can introduce new entities in the world, can create new alternatives and new options.
How do we do that? It is a very creative process and that is what happens when we are learning and when a child goes from a two year old that we would not want to ascribe full moral agency to, to being an eighteen year old that we do ascribe moral agency to. Just because they understand a lot more things about the world, their world is far more sophisticated. It is not just that they have adjusted some utility functions on the world of the two year old. And moreover, they create new alternatives, things that their parents, teachers and society did not foresee or provide. In creating these alternatives, if you are a Kantian, you are creating your own moral autonomy, deciding who you want to be by how you see the world and how you make choices. We create a model of ourselves to the extent that we recognize the virtues, values or ethics that we are internalizing as a part of self-knowledge and self-discovery and assertions of our autonomy, of taking responsibility for our actions, instead of just conforming to a behavioral outcome, following the rules. We can follow the rules without believing them, or we can follow the rules by believing and internalizing them, and that is different.
When we punish a robot, is this a punishment or is this just repairing the robot, and why is it important to distinguish these? It is helpful to think about the law of liability and torts here. There are cases where human action brings about damages, or where human inaction with regard to responsibilities or property can cause damages, but there is also the situation in which nature can do things, like damage to property. Hurricanes can do great harm, and nobody is responsible for that - we might call it an act of nature or an act of God. If I own a robot, and a robot does something wrong, this is no longer an act of nature because I have a causal and legal relation to the robot that makes me liable for its actions, even though I am not necessarily intending for the robot to do all the things that it does. But torts and liability are quite capable of dealing with unintentional harms. And if the damages are intentional, then we are talking about the possibility of guilt and culpability of the owner of the robot, and we move into criminal law. This is the Latin mens rea, the guilty mind or intention that is required in order for the act to be criminal - even though there are some difficulties in the context of criminal negligence because the omission of acts can also imply guilt, and it is a bit peculiar that you can be guilty for not doing things (this notion of guilt goes back to duties, the understanding that you had a duty and when you fail to enact that duty: for more on this, see Asaro 2011).
If we consider another aspects of punishment, retribution, then we are also concerned about the intention behind the actions, and not just the consequences of the actions. It is very provocative to think about what constitutes malice in a robot. This has to be something more than just the harm that is caused by a robot, but also a specific intention that its act cause harm. Such a robot would have to be able to represent the world, to represent the agents in the world, represent itself in the world, and have a moral model of itself and other agents in the world. It would also have to recognize that if it does a certain act, it will be violating its own morality by its own model, and recognize that it wants to do that anyway and chooses that at some level. This is very complicated to think about in terms of robots, but it is one direction we might go.
I believe the most promising direction for pursuing the question of machine agency is to focus on the processes by which agents generate alternatives. In a world in which one can merely choose between options, one has very little freedom. If one really wants to transform the world, an agent needs to be capable of inventing new alternatives. While it is quite mysterious how agents do this, it seems to me to be essential for understanding autonomy and moral agency. Moreover, it seems that the ability to choose systems of value is not simply that of choosing between two given sets of values, but also of being able to determine why one set of values is more desirable than another. It may not be easy or even possible to model such systems computationally, but considering such systems would be a good start toward understanding machine agency and moral autonomy.

This paper was originally prepared as a response to J. Storrs Hall, Towards Machine Agency: A Philosophical and Technological Roadmap and was presented at the 2012 "We Robot" conference on law and robotics at the University of Miami Law School. I have attempted to make its arguments more general, by addressing commonly held views of computation, determinism and responsibility expressed in that paper.

Bibliographical References

P. M. Asaro (2011), A Body to Kick, But Still No Soul to Damn: Legal Perspectives on Robotics, in Patrick Lin, Keith Abney, and George Bekey (eds.), Robot Ethics: The Ethical and Social Implications of Robotics, MIT Press, Cambridge (Mass.) 2011, pp. 169-186.
J. Storrs Hall (2012), Towards Machine Agency: A Philosophical and Technological Roadmap, presented at the 2012 "We Robot" Conference, University of Miami Law School
S. Hawking (1999), Does God Play Dice?
P. F. Strawson (1962), Freedom and Resentment, in «Proceedings of the British Academy», Vol. 48, 1962, pp. 1-25.
W. Wallach, C. Allen (2009), Moral Machines: Teaching Robots Right from Wrong, Oxford University Press, Oxford 2009.