I Introduction
The ultimate goal of the machine learning framework has always been to generate algorithms that perform at least as well as a human being, and robotics in particular, aims at building mechanical machines that can mimic human or animal behaviors. With this objective in mind, the Imitation learning (IL) approach has received an increasing attention, due to its ability to infer the hidden intention (policy) of an expert, which can be a human operator, through the observation of his/her demonstrations. In the literature, two types of IL are predominant: behavioral cloning (BC)
[4, 19]which reproduces the sequences of the experts’ action based on the environment state, and inverse reinforcement learning which maximizes a reward function inferred from the experts’ demonstrations
[14, 9]. These algorithms have been shown to yield nearoptimal policies when trained on highquality demonstrations performed from experts, highlighting their potential for the production of advanced taskoriented robots that can naturally learn from demonstrations [7, 20].Unfortunately, all these studies in both theoretical and applied aspects have assumed the presence of experts who always perform optimally, and of sophisticated operating interfaces that can adequately reflect the intentions of the experts, even when they do not make any mistakes. However, in practice, the demonstrators may lack qualitative expertise, either at the task itself or due to a nonintuitive operating interface, which means that they may be required to be well trained to become familiar with the setting before any demonstration can be recorded. However, this wastes both time and data, and constitutes an impractical constraint for crowdsourcing data collection [13]. Furthermore, even after being trained, a human operator may be subject to distractions due either to limited attention, tiredness or boredom, making the assumption of optimal and mistakefree demonstrations uncertain. For all these reasons, real world demonstrations are highly likely to contain unintentional noise and outliers, which makes it difficult for IL agents to extract an optimal policy. Therefore, in general, such demonstrations which contains wrong actions would implicitly be excluded from the dataset used to train the agent, even when some parts of the demonstration may be informative. Here, we define such a partially optimal demonstration as an amateur demonstration.
To tackle this issue and allow imitation from amateur demonstrations, several methods have been proposed. Indeed, for inverse reinforcement learning, we can cite the works [5] and [21] where additional labels provided by the experts are employed to discriminate amateur demonstrations, the work [18]
which assumes the amateur actions and states to be a Gaussian distributed noise, and the recent work
[17]where a pseudolabeling technique is used to estimate the data density of the nonexpert demonstrations and then a classification risk optimization is performed on all the demonstration dataset, using a symmetric loss function.
In this study, we focus on the neuralnetworkbased BC and, by seeing that amateur demonstrations include noise and outliers, employ the robust tmomentum
[10]optimization algorithm to train the imitator. With the tmomentum strategy, the adverse effect of noise and outliers can be implicitly removed according to its robustness hyperparameter during the stochastic gradient descent (SGD) updates. However, in the original version of the tmomentum, the robustness hyperparameter is needed to be specified before training and is therefore incapable of adapting automatically to the unknown actual ratio of noise and outliers inside amateur data. To address this issue, we extend the tmomentum with a method to automatically adjust the algorithm’s robustness in order to deal with the uncertainty on the ratio of real world wrong demonstrations data for robotics application.
Ii Preliminaries
Iia Behavioral cloning
Behavioral cloning (BC) [4]
is an imitation learning technique which uses a supervised learning approach to capture and reproduce the behavior of a demonstrator, usually referred to as the
expert. As the expert performs the task, his/her actions are recorded along with the state that gave rise to the action. The sequence of these stateaction records, called behavior trace or trajectory, is then used as supervised input signals for the imitator, whose goal is to uncover a set of rules that reproduce the observed behavior. BC is powerful in the sense that the imitator is capable of immediately imitating the demonstrator without having to interact with the environment, making it particularly attractive for robotics applications and for safe and direct transfers of humans subcognitive skills or behaviors to machines.Formally, BC is concerned with the problem of finding a good imitation policy from a set of stateaction demonstration trajectories where is a trajectory . This set of stateaction pairs are used to seek the parameters of an imitation policy that best fits the set. This decision problem is usually solved by employing the maximumlikelihood estimation method. Indeed, assuming each pairs in are independently and identically distributed (i.i.d) and for defined as the imitator’s policy parameterized by , BC solves for an optimal solution such that:
(1)  
(2) 
With this objective, the imitator’s policy eventually converges to the unknown policy that produced the dataset .
IiB Robust optimization with the tmomentum
IiB1 Student’s tbased momentum
Under the deep learning framework, complicated functions such as the policies
can be approximated using neural networks, where the parameters are given by the weights and biases of the networks. With the neural networks, the optimization problem depicted in Eq. (2) is solved by firstorder gradientbased optimization methods. Most of the recent and popular firstorder gradientbased methods developed nowadays build upon the momentum strategy [11], where an average of the past gradients are employed in the stochastic gradient descent updates.At the heart of the momentum methods’ success lies the Exponential Moving Average (EMA), which allows recent gradients to have a greater impact on the average due to higher weights, while slowly forgetting observations that are far in the past and that possesses exponentially smaller weights. Let be the objective function evaluated on a random sample from the training dataset, e.g. a subsample set of stateaction pairs of size in BC , and with the parameters at time corresponding to the weights and biases. With the stochastic gradient of with respect to the parameters , the regular EMAbased firstorder momentum is defined as:
(3) 
where , the exponential decay coefficient, is a fixed value that controls how fast past gradients , , are forgotten.
However, EMAbased momentum methods lack robustness to aberrant values due to the fact that every new observation is given the same weight . This led to the proposition of the tEMA, a new EMA algorithm derived from the Student’s tdistribution likelihood estimator, and its corresponding momentum, the tmomentum [10]. The particularity of the tmomentum lies in the fact that the decay coefficient is no longer fixed, but adaptive, and depends on the squared Mahalanobis distance :
(4) 
where
(5)  
(6)  
(7) 
where is the Student’s tdistribution degrees of freedom parameter which controls the robustness, in the superscript refers to the
th component of the vector, and
is an exponential moving variance estimate at step
, which is computed by default in recent methods. When integrated to momentumbased optimization methods such as Adam (Adaptive moment estimate)
[11], the tmomentum has been shown to improve the robustness of the underlying optimizer and therefore increase the performance of the learning process against heavytailed data sets.IiB2 tEMA with modified weight decay
The decay strategy of the accumulated weights in Eq. (6) implies that at the time step , the past value is not decayed with respect to the new value and that both have the same importance in the value of .
In order to ensure that the past value is decayed and has less importance than the new value , Eq. (6) has been modified in [12] to yield instead:
(8) 
which remains consistent with the maximum likelihood derivation of the tmomentum algorithm as described in [10] and where the change of the decay factor’s value, from in Eq. (6) to , is set by the requirement that the tEMA reverts to the EMA in the limit . With this modification, the value of at the time step is given by:
(9) 
where the value of is effectively reduced with respect to .
In this study, this modified version of the tEMA is the one we employ for the tmomentum.
Iii Robust Behavioral Cloning With Adaptive TMomentum Optimization
Iiia The imperfect demonstrations issue in behavioral cloning
Because BC relies solely on the provided demonstrations in order to find the imitation policy through a supervised learning approach, it requires all trajectories in the dataset to be optimal (i.e. perfect demonstrations) or nearoptimal. Due to this fact, human operators, when given a control interface with the task to perform demonstrations, must first be trained to become highly efficient at using the interface before they can start demonstrating for the imitator; and even after having been trained, distractions, mistakes and limited attention time makes it difficult and nearly impossible for a human to always follow an optimal policy. This leads to trajectories where some stateaction pairs are not optimal, causing the imitator to be biased against the optimal policy.
We again refer to these imperfect demonstrations as being amateur demonstrations, so that the dataset is generated as a mixture of the expert policy and the amateur policy:
(10) 
where and are respectively the stateaction density of the expert policy and amateur policy , represents the proportion of amateur stateaction pairs in the dataset, assumed to be in the range .
In the original setting of behavioral cloning, all of the amateur demonstrations are simply discarded so that the policy that produced the dataset is only from the expert, i.e. ; however, this results in a loss of valuable data since all of the amateur pairs are not necessarily wrong. Due to the fact that BC typically require a lot of data in order to produce an optimal policy [16], a strategy that takes advantage of good parts of the amateur demonstration (stateactions that are similar to the expert’s one), while ignoring wrong or misleading actions in the imperfect demonstration is desirable. In this study, we propose to treat the amateur’s imperfect demonstrations as being outliers and we show empirically how the tmomentum, a robust optimization algorithm, extended to allow adaptive robustness, can produce robust imitators in face of the resultant heavytailed dataset.
IiiB Adaptive tmomentum for automatic robustness
The robustness of the Student’s tdistribution, and therefore of the tmomentum derived from it, is controlled by the degrees of freedom parameter . Indeed, as can be seen in Eq. 5, if , then for all time step and every values are given the same weight independently of the value of , leading back to the non robust EMA derived from Gaussian distribution. In contrast, if , then each value is weighted by , leading to a strong sensitivity to the squared Mahalanobis distance and therefore to a very strong filtering effect. In the formulation of the tmomentum in the original tmomentum paper [10], the degrees of freedom is treated as an hyperparameter whose value must be set before starting the optimization process, meaning that the robustness of the tmomentum is fixed throughout the learning operations.
In practice, the proportion value , introduced previously, is unknown (due to the difficulty to keep track of all imperfect pairs). Although one may analyze the dataset to infer its heaviness before starting training the imitator, in this section, a method for automatically adjusting the robustness of the tmomentum, based on the amount of outlying gradients encountered during training, is introduced.
This mechanism exploits the batch approximation algorithm developed in [1], in particular, the incremental version of the algorithm which is an efficient set of formulas capable of iteratively estimating the degrees of freedom for a given set of data points. Thanks to its incremental nature, the data do not need to be saved in memory and are instead treated sequentially as they are observed. This feature is of prime importance in the case of optimization methods, where the gradients are observed one at a time and can be arbitrarily large, rendering it difficult to store every one of them in memory. In the following, we refer to this algorithm as the Aeschliman’s algorithm.
IiiB1 Direct incremental degrees of freedom estimation algorithm
In order to compute an estimate for the degrees of freedom , the Aeschliman’s direct incremental algorithm is described as it follows: at each step ,

Compute a robust estimate for the mean , such as the median.

Compute the logarithm of the squared euclidean norm of the difference between the recent observed data point and the robust mean: .

Update the arithmetic variance and mean of the variable :
(11) (12) 
Compute a new estimate for the degrees of freedom:
(13) where is the trigamma function.
IiiB2 tmomentum with adaptive degrees of freedom
In order to integrate this algorithm to the tmomentum, a few changes are made to Aeschliman’s algorithm, mainly in order to reduce the computational cost as much as possible. Namely,

The tmomentum is directly used as the estimate of the robust mean, instead of computing the median as Aeschliman et al. did in their paper. Since the tmomentum is considered to be a robust mean estimate, this modification remains consistent with the original algorithm and it avoids the burden of estimating the gradient median, removing the need for a new variable.

Secondly, the squared norm in the variable computation is replaced by the squared Mahalanobis distance from equation (7), i.e. . This modification remains consistent with the original algorithm and can be understood as replacing the variable by a standardized alternative who has mean and variance equals to .

Finally, the arithmetic estimates for the variance and mean of the variable is replaced by exponential moving averages, i.e., the equations (11) and (12) becomes:
(14) (15) With . This particular modification is necessary in order to take into account the fact that machine learning tasks may be nonstationary, which requires the estimated mean and variance of to adapt to the changing data distribution.
The new algorithm is named adaptive Student’s tdistribution based momentum or in short Atmomentum and the pseudoalgorithm is given in Algorithm 1.
Note that, for the practical implementation, the modified Aeschliman’s algorithm is employed to estimate the degrees of freedom scale factor , and the degrees of freedom is obtained using the equation as suggested in the original tmomentum paper [10]. This is necessary in order to keep the updates for being overly robust, since the Aeschliman’s algorithm tends to produce small values for the degrees of freedom, which, when compared to the dimension of the neural network gradients can be negligible.
Iv Experiments
Iva Algorithm setup
IvA1 Optimization algorithm’s choice
In the following, we employ the tAdam [10] optimizer which is the Adam [11] optimizer augmented with the tmomentum. The Adaptive tmomentum version is called AtAdam and in order to investigate the effect of the decay parameter used for the mean and variance of in equations (15) and (14), two values are defined:

one that takes the same value as the considered momentum (here the firstorder momentum of Adam) decay factor, i.e. , and

a larger value, which is set to be equal to the decay factor of the Adam second moment, i.e. .
The results of training with Adam, without the tmomentum’s robustness, are also included for reference.
IvA2 Policy model description
For all experiments, the imitator agent’s policy model is implemented by a PyTorch
[15]neural network with five hidden linear layers made of 100 neurons each, fits out with a layer normalization
[3]and with the ReLU activation function. The outputs are the actions’ mean and covariance matrix diagonal elements for a multivariate Gaussian distribution. Different random seeds are used for each models, but all optimizers share the same set of seeds, e.g. for
trained models, the set is .IvA3 Performance measure
For all experiments, we run each of the trained models on the real robot for a certain number of times (most often 5 times), and count the number of times the imitator was capable of solving the given task. This performance measure is then represented by the success rate:
(16) 
IvB Robots and interface setup
IvB1 Leap Motion hand tracking device
Leap Motion (see Fig. 1(a)) is a hand tracking device that captures the movement of the hands and fingers by using optical sensors and an infrared light. The field of view (FOV) of the sensors is about 150 degrees and the detection range goes roughly from 25 to 600 millimeters above the device. Each object (arm, hand or finger) detected in the FOV of the device is represented by a program class that encodes various informations such as the position, velocity, direction and other characteristics about the object.
IvB2 Qbchain robot and control interface
The qbmove [6] is a one degree of freedom (1DoF) modular actuator with a cubic shape of approximately 66 millimeter width. Its stiffness can also be controlled on the hardware level, but is fixed in the following experiments for simplicity. As can be seen in Fig. 1(b), the robotic arm employed in this section’s experiments is made of 4 cubes assembled such that the first joint axis is vertical, while the three others are horizontal, allowing for an upanddown and circular motion of the end effector, which consists of a gripper.
The interface between the Leap Motion device and the qbmove robotic arm developed to allow a human operator to control the robot uses the palm position and grab strength of the Leap Motion’s first detected hand. The palm position is used as the position of the robot’s end effector and an Inverse Kinematics (IK) algorithm is employed to compute the first three joints’ angular position. In the experiment, ikpy is employed and corresponds to a python inverse kinematics library that can import the kinematic chain of the robot from an URDF file and can quickly approximate the IK solution by employing an iterative optimizer. The obtained joints position values are then sent to the qbchain to move the tip of the fixed part of the gripper. The grab strength is then mapped to the last joint in other to open and close the gripper.
The schematic of the interface is depicted in Fig. 2.
IvB3 D’Claw robot and control interface
D’Claw is a platform introduced by projectROBEL (RObotics BEnchmarks for Learning) [2] for studying and benchmarking dexterous manipulation. It’s a nine degrees of freedom (DoFs) platform that consists of three identical fingers mounted symmetrically on a base, as shown on Fig. 1(c).
Its control interface also uses the leap motion device. In particular, the position of the fingers — the index, the ring and thumb fingers — of the operator is used to control the three fingers of the robot, again through the ikpy library.
IvC Qbchain robot experiment
IvC1 Conditions of the experimentation
A simple pickanddrop task is defined, where the goal is to pick an object, here a soft cube, and drop it inside a box, with an observation consisting of a direct state measure containing information about the angle, the angular velocity and the torque (effort), for each of the four joints (hence, the state space dimension equals ). The action space dimension, on the other hand, is set to be equal to and corresponds to the desired next angle of the joints (i.e. position controller).
During training, a Gaussian white noise is added to the states by using a scale factor
, i.e. , in order to augment the dataset and improve the generalization ability of the models. A small batch size of is used to reduce the computational cost, and to drive the ability of the gradient updates to escape from local optima.IvC2 Dataset description
trajectories are collected and then divided into expert trajectories that are almost perfect, and amateur trajectories that contain hesitant or poor demonstrations. The expert trajectories are then further split into two data sets; one, containing trajectories, for training and another one for validation, comprised of the remaining trajectories.
IvC3 Results
The tests results on the robot, for trained policies, are given by the success rate over all models and summarized in Fig 3 where the error bars correspond to the confidence interval. This success rate is computed by running each trained model times (i.e. total number of runs = ) and Eq. (16) is employed by counting the number of times the model is able to solve the task (i.e. pick the object and drop it in the box). Each episode is ran with a fixed budget of steps and a model is said to have failed if it is not able to complete the task within this number of steps.
The success rates in Fig 3 show that, using a robust optimization method such as the tmomentum based Adam algorithm, it is possible to efficiently train a behavioral cloning agent with datasets that contain not only expert demonstrations, but also amateur performances.
Fig. 4, where the success rate of trained models is summarized with total number of runs per model, displays the contribution of the amateur demonstrations. Indeed, we can see that, when considering a small number of expert demonstrations (i.e. trajectories), the addition of the demonstrations containing imperfect pairs increases the success rate of the models trained with the robust tmomentum optimizer. This result highlights the fact that amateur demonstrations are useful and can be used to augment the size of the training dataset, instead of being discarded as it is usually done in BC.
However, in Fig. 5, after removing the amateur demonstrations and setting the noise scale factor to , we computed the success rates by running again trained models times each (i.e. total number of runs = ). With this modification, we can see that in the absence of imperfect demonstrations and without the Gaussian noise for state augmentation, the Adam optimizer performs better than tAdam, due to the fixed high robustness of the later.
This result allows us to display the importance of the adaptive robustness feature of AtAdam. Indeed, in the same Fig. 5, we see how the adaptive tmomentum optimizer improves the success rates of the imitators and performs even better than Adam. Hence, the adaptive robustness unarguably allows it to extract more optimal information from the expert dataset than what is allowed with nonrobust methods. AtAdam, thanks to its automatic robustness adjustment, is able to find a compromise between the toorobust tAdam with its and the nonrobust Adam with its , outperforming both methods. Fig. 6 shows the median of the adapting degrees of freedom’s factor during the learning. We can see that AtAdam has a median robustness parameter higher than .
IvD D’Claw robot experiment
To further confirm the ability and limitation of the robust BC with the adaptive tmomentum algorithm to adapt to different ratio of imperfect demonstrations, we conducted the following experiments using the D’Claw robot.
IvD1 Conditions of the experimentation
In the experiments, we define the task to consist in rotating a passive DoF (the object located on the middle of the base in Fig. 1(c)) to a fixed target angle. Specifically, the task consists in turning the object from the angle to the target angle , with the success being achieved if the object’s position falls within the range . The state space is given by the angular position and velocity of the fingers’ nine joints, the target position and the current angular position of the object along with their cosine and sine values, the object’s velocity and finally a success flag and the error between the current position and the target position, for a total dimension of . The actions’ dimension is set to corresponding to the position of the fingers’ joints. The batch size is again set to , but this time no noise is included in the states during training.
IvD2 Dataset description
For this task, only 34 demonstrations are recorded, consisting in 14 amateur demonstrations with imperfect stateaction pairs, and 20 expert demonstrations. The expert data is then split in half; one half is used for training and the other half for validation. All the demonstrations were successful ones, where the operator was able to solve the task.
IvD3 Results
Fig. 7 shows the average performance of models with runs each. Each run is given a fixed budget of steps and the success is achieved if the imitator is capable of bringing the object’s position within the range of the target position, i.e. . The success rate of Adam is as expected with the addition of imperfect demonstrations, but the one of AtAdam with also suffered a significant decrease. On the other hand, AtAdam with maintains its performance for half of the amateur demonstrations, but then deteriorates when amateur trajectories are given. Since the success rate of tAdam with its robustness fixed at increased by adding the amateur trajectories, it is likely that the proposed adjustment rule for the tmomentum’s degrees of freedom was incomplete, or that the simultaneous optimization of and caused the policy to fall into one of the local solutions when updating with temporarily high .
For further investigation, Fig. 8 shows the success rates of the models trained using only the amateur data. As we can see, despite being previously affected by the presence of imperfect demonstrations in the previous result, AtAdam is capable of altering its robustness to extract the most useful information from this imperfect dataset. Interestingly, with amateur trajectories, the success rate in Fig. 8 is higher than that in Fig. 7. This suggests that the decrease in success rate of AtAdam may be due to a cause outside the proposed method. That is, BC is poor at learning multimodal policies [8], and if the policy optimized by the amateur demonstrations and the one by the expert’s are different but both can solve the task, learning with both demonstrations will fail due to the nature of BC.
V Conclusions
In this study, we showed how the tmomentum could be used to produce robust imitators under the BC framework. Taking advantage of the Aeschliman’s algorithm [1], we introduced a mechanism to automatically adjust the robustness of the tmomentum strategy, in order to deal with different proportion of imperfect and noisy pairs in the demonstrations. The application on two different robots with different tasks having different degrees of difficulties displayed the effectiveness of the proposed approach.
As implied by the experiments, the amateur demonstrations may make the policy multimodal, hence, this reaffirms the fact that the standard BC and/or the policy model should be modified in order to resolve this multimodality. In addition, the proposed method can be regarded as a kind of safety net, because it removes outliers at the final stage of optimization. An unsupervised classification of demonstrations and/or a robust design of the loss function would be required to actively utilize amateur demonstrations and further bring forth their potential for wide and unlimited imitation learning applications. In future works, the proposed method will be integrated to such algorithms.
References

[1]
(2010)
A novel parameter estimation algorithm for the multivariate tdistribution and its application to computer vision
. In European conference on computer vision, pp. 594–607. Cited by: §IIIB, §V.  [2] (2020) ROBEL: robotics benchmarks for learning with lowcost robots. In Conference on Robot Learning, pp. 1300–1313. Cited by: §IVB3.
 [3] (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §IVA2.
 [4] (1995) A framework for behavioural cloning.. In Machine Intelligence 15, pp. 103–129. Cited by: §I, §IIA.
 [5] (2019) Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International Conference on Machine Learning, pp. 783–792. Cited by: §I.
 [6] (2011) VSAcubebot: a modular variable stiffness platform for multiple degrees of freedom robots. In IEEE international conference on robotics and automation, pp. 5090–5095. Cited by: §IVB2.
 [7] (2018) Teaching a robot to grasp real fish by imitation learning from a human supervisor in virtual reality. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 7185–7192. Cited by: §I.
 [8] (2020) A divergence minimization perspective on imitation learning methods. In Conference on Robot Learning, pp. 1259–1277. Cited by: §IVD3.
 [9] (2016) Generative adversarial imitation learning. Advances in neural information processing systems 29, pp. 4565–4573. Cited by: §I.
 [10] (2020) Robust stochastic gradient descent with studentt distribution based firstorder momentum. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §I, §IIB1, §IIB2, §IIIB2, §IIIB, §IVA1.
 [11] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IIB1, §IIB1, §IVA1.
 [12] (2021) Tsoft update of target network for deep reinforcement learning. Neural Networks 136, pp. 63–71. Cited by: §IIB2.
 [13] (2018) Roboturk: a crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, pp. 879–893. Cited by: §I.
 [14] (2000) Algorithms for inverse reinforcement learning.. In Icml, Vol. 1, pp. 2. Cited by: §I.
 [15] (2019) Pytorch: an imperative style, highperformance deep learning library. In Advances in neural information processing systems, pp. 8026–8037. Cited by: §IVA2.

[16]
(2011)
A reduction of imitation learning and structured prediction to noregret online learning.
In
International conference on artificial intelligence and statistics
, pp. 627–635. Cited by: §IIIA.  [17] (2021) Robust imitation learning from noisy demonstrations. In International Conference on Artificial Intelligence and Statistics, pp. 298–306. Cited by: §I.
 [18] (2020) Variational imitation learning with diversequality demonstrations. In International Conference on Machine Learning, pp. 9407–9417. Cited by: §I.
 [19] (2018) Behavioral cloning from observation. arXiv preprint arXiv:1805.01954. Cited by: §I.
 [20] (2019) Generative adversarial imitation learning with deep pnetwork for robotic cloth manipulation. In IEEERAS International Conference on Humanoid Robots, pp. 274–280. Cited by: §I.
 [21] (2019) Imitation learning from imperfect demonstration. In International Conference on Machine Learning, pp. 6818–6827. Cited by: §I.
Comments
There are no comments yet.