Learning rules

When it comes to dogs, I never stop working. There is always something that puts me in motion, something that inspires me to try, experiment, and to reason and analyze rules that seem to be defined and accepted by everybody else. Usually, the first reaction I get is negative. Who are you to question what experts say? However I still try, experiment and reason, even when nobody is listening, or I just get negative feedback. That is what happened to me, and it still happens, with the topic: variable ratio schedule of reinforcement.

If you have studied the principles and terminology of animal psychology, you must know that a positive reinforcement is a pleasant event connected with a certain type of behavior. The positive reinforcement produces an increase in the probability that the dog (an animal) exhibits that behavior. The dog takes a step back and it gets a treat. The dog takes another step back and it gets another treat. The dog eats the treat, and then it watches us and takes another step back to have yet another treat. This, in short, is the transition from classical conditioning (a purely temporal connection between a behavior and its pleasant consequence) to operant conditioning (the dog offers the behavior to produce a pleasant consequence). The ratio of reinforcement measures the connection between successful behavior and positive consequence. If the dog gets a treat every time it takes a step back, the ratio of reinforcement is fixed and its relation is 1:1. If the dog gets a treat every two steps back, the ratio of reinforcement is still fixed but in relation 2:1. If the dog gets a treat after a variable number of steps, the ratio of reinforcement is variable.

learning in dog training

IN THE WORLD OF DOG TRAINERS, THERE ARE THOSE WHO ARGUE THAT IT WILL BE USEFUL, IF NOT NECESSARY, TO SWITCH TO SCHEDULES OF REINFORCEMENT AS SOON AS POSSIBLE, HOWEVER THERE ARE TWO SCHOOLS OF THOUGHT.

One school of thought believes that the unpredictability of reinforcement increases the motivation and makes the behavior resistant to extinction. Imagine that when you push a button, you get one Euro. If you get a Euro every time you press the button, and you suddenly don’t get a Euro in return, you will think that the button no longer works. However, if you get the Euro in an unpredictable sequence, the best strategy is to keep on pressing. It is the principle of gambling.
The other school of thought claims that the only workable strategy to achieve and maintain a learned behavior is to apply a fixed ratio of reinforcement of 1:1. I think I heard Ken Ramirez explain this concept in one of his seminars, and I remember I thought: "at last someone who thinks like me." A fixed ratio of reinforcement of 1:1 does not mean that the dog should get a treat for every step back. This is a limited view of what learning is, how reinforcement works and what learned behaviors are.
To maintain a fixed ratio of reinforcement of 1:1 and to avoid giving a treat for every step back, there are different strategies:

Varying the reinforcers. You can teach your pet to accept different types of reinforcers: primary or secondary (with a social, high or low value). The dog is reinforced every time, but the reinforcers vary.
You can use a certain behavior to reinforce another behavior. This is the Premack Principle. In Obedience, we can send the dog to the cone and when it is there, we can send it to the box. If the dog really enjoys the box, going to the cone is reinforced by knowing it will go to the box afterwards. Only when the box is reached, the dog gets its ball. Wanting to go to the box, reinforces the dog’s will to go to the cone first. The ratio of reinforcement is still fixed at 1:1.
You can transform the initial behavior into something more complex. Now, we have connected a step back with a treat. We wait, and when the dog offers two steps back, we connect two steps back with a treat. We're still reinforcing in a fixed ratio of reinforcement of 1:1, but our unit has moved from 1 step to 2 steps. This is the principle of durations, sequences and chains. Our brick (one step back) has become a bigger brick, or a collection of different bricks.

During a discussion, in my broken English, I realized that there are two distinct phases for me during training:

Learning: The dog learns to connect a behavior with a reinforcer.
Performance: The dog displays a learned behavior.
There is also third phase though:
Duration, sequence, chain: The dog learns to display the same behavior for longer, or to show more behaviors at the same time or in a timeline.
And even a fourth phase:
No reinforcements (training for a competition).

REASONING ON THE TWO DIFFERENT PHASES LEARNING/PERFORMANCE, I STARTED THINKING THAT WHAT IS CHANGING, MORE THAN THE SCHEDULE OF REINFORCEMENT, ARE THE RULES OF LEARNING.

We are moving away from the learning rule: "every correct behavior = reinforcer" to the competition rule: "no information (positive or negative) = correct behavior = no reinforcer".

1. LEARNING

At this stage, the dog does not initially know what successful behavior is. It's up to us to find an effective procedure in order to make the dog understand. The rule is that every right behavior is reinforced in order to create a connection between behavior and reinforcer, and every wrong behavior is not reinforced.

This way the dog gets two types of information:

Behavior, reinforcer = this behavior works
Behavior, no reinforcer = this behavior doesn't work

If we are working on luring exercises, the information is only positive.
At this stage, nobody applies a variable ratio of reinforcement, because the dog needs all relevant information in order to understand, to commit to its memory, to repeat. Especially dogs with no experience, or just a little, can crumble emotionally if a "right" behavior is not reinforced, and the connection behavior/reinforcer is still weak and the dog lacks confidence and motivation on learned behavior.
Therefore, at this stage the ratio of reinforcement is fixed at 1:1. This does not mean to only reinforce one behavior; in shaping, we can reinforce different behaviors. I often work in parallel with two sets of criteria, but the principle is that we reinforce every behavior that is useful to making the ultimate objective clear to the dog. In learning by luring, the information is only positive, there are no errors (behavior is induced). In learning by reinforcement, there are two types of information: correct behavior = reinforcer; misbehavior = no reinforcer. This way the absence of reinforcer is connected to a mistake.

2. PERFORMANCE, EXHIBIT A LEARNED BEHAVIOR

The behavior has been learned: the dog has connected behavior and reinforcement, it has started to propose that behavior, and that behavior has been connected to a signal. At this stage the reinforcer is used to confirm to the dog that it has chosen the right behavior connected with the signal, it is used to discriminate or generalize the behavior. Whenever the dog exhibits the right learned behavior in response to the signal, it distinguishes or generalizes the behavior, and the dog is reinforced. If a certain behavior is connected to a signal, and the dog performs the behavior without our signal, the dog is not reinforced.
At this stage, it is important that the dog learns new rules:

The behavior is successful only if it is performed after our signal.
The behavior is successful even if the context changes (generalization).
The behavior is successful only if it is performed in the same way it was learned.

At this stage, at least initially, the ratio of reinforcement is still fixed at 1:1. This is because we want the dog to understand when the learned behavior works, and when it doesn't work (it doesn’t work if performed without a signal; it does not work to exhibit a different behavior from the one connected to the signal or in a form other than the learned one; it works to exhibit the behavior even if the context changes).

At this stage, you can switch to a variable ratio of reinforcement and work on repetitions of the same pair signal/behavior, or generally on repetitions. At first, this strategy can cause frustration (the dog doesn't get a reinforcer although it has exhibited the correct behavior), and the intensity of the behavior can increase, but it can also lead to the occurrence of behaviors of stress (vocalizations), of security behaviors (learned behaviors that have a long history of reinforcement, or behavior that the dog likes), to variations of learned behavior (if the behavior no longer works, the dog tries to change it). Actually, at this stage you most likely introduce different kinds of reinforcers, rather than a complete lack of reinforcement, unless for the dog the behavior is rewarding in itself. The risk is that the dog connects the absence of reinforcement with a failure. Therefore, at this stage the dog always receives positive information.

force free dog training

3. DURATION, SEQUENCE, CHAIN

In durations, sequences and chains, the rule completely changes. Previously the absence of the reinforcer implied an error; now the dog must understand that the absence of the reinforcer means that the behavior is correct.
Let’s go back to the exercise backward steps. I taught the dog to take three steps back and whenever it takes them, I reinforce it. I have connected the backward steps to a cue, back, and I have introduced a rule that the backward steps only work if there are three of them, plus if they are performed immediately after my signal. Now I want a longer duration, for example from 3 to 10 steps. There are different procedures to increase difficulty, but what I want to discuss here is that when the dog has to repeat the behavior, or to continue with a behavior, it must be sure that even in the absence of reinforcement, the behavior is however correct and successful. We can reassure the dog with our voice "bravo", "back", but we have to get to the point where even without information from us, the dog still manages not to lose self-confidence, to believe that the behavior still works even if the dog does not get the reinforcer.
In durations, we may think that our brick of one step back has become a brick of three steps back. At this point we have lined up several bricks of three steps back, and through the mechanism of anticipation, the dog, after the signal, has started to propose five steps backward instead of three. This is a fairly simple process to increase the duration.
Thus the new rule is:

After a pair signal/behavior, if I give to the dog another signal, the behavior previously exhibited is correct, and to get the reinforcer the dog must exhibit both required behaviors.
The behavior is correct, it works, but it's not enough to get the reinforcer, the dog has to work harder, to be more committed, in order to get what it wants and likes.
At first, dogs feel lost, but then quite quickly they adapt to this new rule. Also, because dogs with more learning experience in a positive setting tend to have an increased motivation to display learned behaviors. They need less reinforcement as motivation to exhibit the behaviors.

At this stage, there is a second rule:

After a pair signal/behavior, if I'm not giving the dog another signal, and I interrupt the duration (sequence, chain), the previously exhibited behavior is incorrect, and we have to start all over again.

This is a fundamental rule, moving from a basic level of learning to a higher level. It is no longer the absence of reinforcement that indicates the error, but the interruption of the sequence of events. In an obligated sequence or chain (i.e. Obedience exercises), the dog learns that starting an exercise all over implies an error, and it learns to concentrate and work harder to get to the end of the exercise.

4. NO REINFORCEMENT (COMPETITION PREPARATION)

Is there anything besides duration, sequences and chains? I have experienced it with Puma, along our preparation for class three of Obedience. Puma is suffering from the absence of the reinforcer at the end of the exercise. It's a problem that I have never had with other dogs in the past. I have always been able to use the reinforcer in training without the dog being so stressed in competition that they become deaf to every signal. Puma doesn't make it. If the rule is: in order to get the ball, you have to successfully perform all the behaviors of an exercise, then that is her expectation. I don't think she reads into the absence of the reinforcer as her mistake, but rather as a betrayal of her expectations. Puma doesn't lose confidence in learned behaviors, but in me and in the context.

Therefore, I had to do a last step:

behavior, reinforcer
duration, sequence, chain-reinforcer
exercise, reinforcer
sequence of exercises-reinforcer

The initial brick, one step back, has become a group of exercises.
The new rule is:
If the exercise is performed correctly, move on to the next exercise, and at the end of the sequence of exercises, there will be the reinforcer.

Piling bricks, we have built a house!

control clicker training

What has surprised me is how much Puma has adapted to the new rule. At first, it was hard; the stress made her lose concentration on each exercise. When she realized the ball was not gone, it was just delayed, she began to concentrate on the exercises, instead of wasting energy and feeling stressed looking for the ball.
An unexpected side effect was that the motivation for the ball (the reinforcer) increased to the point that I am no longer able to work with the ball in my pocket while training exercises that Puma already knows. If I have the ball in my pocket (usually I leave it at the entrance in a backpack), Puma will get excited to an extent where she cannot concentrate. On the contrary, if I don’t have the ball, and I don’t give her the ball, Puma is motivated and focused.
At this level, it is rare to reinforce behaviors or fragments of sequences and chains.
What we reinforce is the whole exercise as it is performed in the competition.

In the recall with two stops, for example, I don’t reinforce the quick start, or the stop and go down, but the exercise as it is in the competition: recall, stop standing, go, stop going down, and heel position. The reinforcer can be the ball or moving on to the next exercise. One of the interesting aspects in the preparation for a competition is that during the competition a proper behavior is no longer connected with any kind of reinforcement, and an error is not reported.
This implies two rules:

Execution is connected with lack of information (the dog receives information only at the end of the exercise, and during the competition it is better to protect the emotional state, maintaining a positive attitude even if the dog makes a mistake),
and execution is connected to the absence of a primary reinforcer. The dog must be autonomous, no longer dependent on us for its own decisions (I mean consequences, of course, and not signals), and not even dependent on a primary reinforcer.

With Puma, the transition from the first to the fourth stage took three years of work. There are more precocious dogs, and more precocious dog trainers, but never try to force time. One thing is to play with rules; another thing is to force the dog to do something that would only gratify us.

Learning should be a game in which no matter how demanding the rules are the dog always wins.