# An Adaptive Imitation Learning Framework for Robotic Complex Contact-Rich Insertion Tasks Yan Wang, et al.

Jan 13, 2022

## 1 Introduction

Contact-rich insertion is a ubiquitous robotic manipulation skill in both product assembly and home scenarios. Some contact-rich insertion tasks involve nonlinear and low-clearance insertion trajectories and require varying force control policies at different phases, which we define as complex contact-rich insertion tasks, such as ring-shaped elastic part assembly and USB insertion. Such tasks demand skillful maneuvering and control, which makes them challenging for robots.

Imitation learning (IL) is a promising approach to tackle complex contact-rich insertion tasks by reproducing the trajectory and force profiles from human demonstrations. However, there are some concerns that prevent IL from working efficiently and safely in actual applications:

1) Force profiles are not easy to acquire from demonstrations compared with trajectory profiles: Trajectory profiles can be easily obtained from kinesthetic teaching, teleoperation, simulation, among other methods, but force profiles usually demand additional haptic devices (Kormushev et al., 2011). Even with the force sensor that is integrated into the robot, it suffers from the strict position limit, i.e., the hand of the demonstrator should never be between the end-effector (EEF) and the force sensor, which usually makes the demonstrations of the complex contact-rich tasks inconvenient. Also, when there is no real robot available, force profiles from simulated environments can be unsuitable for actual tasks due to the reality gap (Mirletz et al., 2015).

2) Motion shift of the EEF of the manipulator exacerbates the compounding error problem (Ross and Bagnell, 2010; Ross et al., 2011) of IL because IL usually learns a one-step model that takes a state and an action and outputs the next state, and one-step prediction errors can get magnified and lead to unacceptable inaccuracy (Asadi et al., 2019).

3) Demonstrations are usually task specific and require human repetitive teaching efforts for new tasks even if demonstrations with topologically similar trajectories have already been collected.

The main contribution of this paper is the development of an adaptive imitation learning framework for robot manipulation (Figure 1), which introduces DMPs into a hybrid trajectory and force-learning framework in a modular fashion, to learn the control policies of a specific class of complex contact-rich insertion tasks based on the trajectory profile of a single instance (note that a trajectory profile can include several trajectory demonstrations of a task instance), thus, relieving human demonstration burdens. We show that the proposed framework is sample efficient, generalized to novel tasks, and is safe enough to be qualified for the learning on both simulated environment and real hardware.

FIGURE 1. System overview of the adaptive robotic imitation framework. The upper and the lower part are the trajectory learning and the force learning parts, respectively. The switch symbol between the reinforcement learning (RL) agent and the dynamical movement primitives (DMPs) module means the update of DMPs is executed using a modular learning strategy.

The rest of this paper is organized as follows: After discussing the most related work in the Related work section, we set up our problem and introduce some techniques applied in our framework in the Preliminaries section. In the Adaptive robotic imitation framework section, we describe the overview and details of the proposed adaptive imitation learning framework. Then we experimentally evaluate the performance of this framework on simulated environment and real hardware using a UR3e robotic arm in the Experimental evaluation section.

## 2 Related work

In this section, we provide an overview of the application of IL and RL approaches in the context of contact-rich insertion tasks and the position of our work in the existing literature.

### 2.1 Imitation learning

Imitation learning (IL), also referred to as learning from demonstration (LfD), is a powerful approach for complex manipulation tasks, which perceives and reproduces human movements without the need of explicit programming of behavior (Takamatsu et al., 2007; Kormushev et al., 2011; Suomalainen and Kyrki, 2017; Hu et al., 2020). Among the IL approaches, DMPs (Ijspeert et al., 2013) have shown the ability to generalize demonstrations in different manipulation tasks (Peters and Schaal, 2008; Metzen et al., 2014; Hu et al., 2018; Sutanto et al., 2018). However, the forces and torques that a human applies during the demonstrations of contact-rich tasks are required to regress a proper admittance gain of robot controller (Tang et al., 2016) or to match with modified demonstrated trajectories using DMPs (Abu-Dakka et al., 2015; Savarimuthu et al., 2017). To quickly program new peg-in-hole tasks without analyzing the geometric and dynamic characteristics of workpieces (Abu-Dakka et al., 2014) exploits demonstrations and exception strategies to develop a general strategy that can be applied to the objects with similar shapes, which need to be inserted. However, force profiles are still essential for such strategies to modify the trajectories of the learned movements.

In contrast, we study the case wherein only the trajectory profile of a single instance is available in a class of complex contact-rich insertion tasks, and based on this trajectory profile, we manage to solve other variations of this instance with different object sizes or shapes but topologically similar insertion trajectories without explicitly knowing the concrete geometric characteristics. In this context, it does not help even if the original force profile is available because the new trajectories are unknown so that we cannot match the trajectory and the force profiles.

### 2.2 Reinforcement learning

Reinforcement learning (RL) methods have been widely used for contact-rich robotic assembly tasks (Inoue et al., 2017; Thomas et al., 2018; Vecerik et al., 2019; Beltran-Hernandez et al., 2020) to circumvent difficult and computationally expensive modeling of environments. However, sample efficiency and safety problem have always been issues that affect its practicality in complex contact-rich manipulation tasks.

To improve the sample efficiency and guarantee the safety of RL, human prior knowledge is usually incorporated for learning complex tasks. One such way is reward shaping (Ng et al., 1999), where additional rewards auxiliary to the real objective are included to guide the agent toward the desired behavior, e.g., providing punishment when a safety constraint such as collision is violated (Beltran-Hernandez et al., 2020). Generally, reward shaping is a very manual process. It is as difficult to recover a good policy with reward shaping as to specify the policy itself (Johannink et al., 2019). Although some prior work considers reward shaping as a part of the learning system (Daniel et al., 2015; Sadigh et al., 2017), human efforts are still necessary to rate the performance of the system. Therefore, another way occurs that human prior knowledge is included in RL through demonstration (Atkeson and Schaal, 1997) to guide the exploration. Some work initializes RL policies from demonstration for learning classical tasks such as cart-pole (Atkeson and Schaal, 1997), hitting a baseball (Peters and Schaal, 2008), and swing-up (Kober and Peters, 2009). Beyond initialization using demonstration, some promising approaches incorporate demonstrations with the RL process through replay buffer (Vecerik et al., 2017; Nair et al., 2018) and fine-tuning with augmented loss (Rajeswaran et al., 2018). However, these methods require humans to be able to teleoperate the robot to perform the task so that the observation and action spaces of demonstration (state–action pairs) are consistent with the RL agent, which is not always available for an industrial manipulator.

Considering the lack of teleoperation system, residual RL (Johannink et al., 2019) combines the conventional controller, which ships with most robots with deep RL to solve complex manipulation tasks, where the problems can be partially handled with conventional feedback control, e.g., with impedance control, and the residual part, including contacts and external object dynamics, is solved with RL. Based on Johannink et al. (2019), Davchev et al. (2020) proposes a residual LfD (rLfD) framework that bridges LfD and model-free RL through an adaptive residual learning policy operating alongside DMPs applied directly to the full pose of the robot to learn contact-rich insertion tasks. However, Davchev et al. (2020) does not discuss how to handle different force requirements at different phases, e.g., the search phase and insertion phase, of the insertion task.

In the proposed framework, we utilize DMPs on the skill level together with a novel HGCIL approach to provide nominal trajectories for the controller to follow and learn the motion policy of the controller by RL. Specifically, the framework learns the time–variant force–control gains to behave accordingly at different phases of the insertion task, which is not discussed in Davchev et al. (2020), and DMPs are also updated by RL to adapt the existing nominal trajectories to new tasks during the training process.

## 3 Preliminaries

In this section, we describe the problem statement and provide fundamentals of some key techniques utilized in our adaptive robotic imitation framework.

### 3.1 Problem statement

Let

$A$

be a complex contact-rich insertion task class, which represents a set of tasks with topologically similar trajectories. We define a task

$A(n)∈A$

as the nth instance of

$A$

. P(n) is the demonstrated trajectory profile of A(n) consisting of k demonstrated trajectories, Γ, i.e.,

$P(n)=Γ1,Γ2,…,Γk(n)$

, and each Γ in P(n) consists of a sequence of the EEF poses, p, in the task space. Using the hybrid trajectory and force learning framework proposed by Wang et al. (2021), we can learn a proper control policy for each A(n) if P(n) is accessible.

To clarify, we assume an L-shaped object insertion (L insertion) task class, referred to as

$A$

. The goal of L insertion is to insert an L-shaped workpiece held by a robotic gripper into a groove with a corresponding shape, and the clearances are no more than 1 mm. There are some instances where

$A(1),A(2),A(3),A(4)∈A$

, and Figure 2 shows the L-shaped workpiece, L, involved in each instance. With A(1) as the base instance, the L of A(2) gets its shape by applying an affine transformation to the L of A(1); the Ls of A(3) and A(4) further reshape it by extending the bottom and doubling the entity, respectively.

FIGURE 2. A class of L-shaped object insertion tasks. The shapes and sizes of workpieces are different among tasks, but these tasks possess topologically similar insertion trajectories (unit: mm).

In this paper, we assume that only the demonstrated trajectory profile of A(1), P(1), is available as shown in Figure 3. We know that other instances of

$A$

have similar trajectories to A(1) but have no access to concrete information of these trajectories or geometric characteristics of objects involved in these instances. Although we can collect their trajectory profiles through demonstrations, it would be time-consuming and tedious when the number of instances is quite large, which brings huge burdens to the human demonstrator. Therefore, we need an effective trajectory learning approach that can adapt an existing trajectory profile to new similar scenarios to reduce the human burden, and this is the motivation that we introduce the DMPs to the hybrid trajectory and force learning framework.

FIGURE 3. The insertion trajectory of A(1).

### 3.2 Dynamical movement primitives

##### 3.2.1 Positional dynamical movement primitives

Following the modified formulation of positional DMP introduced by (Park et al., 2008), the differential equation of a one-dimensional positional DMP has three components. The first component is the transformation system that creates the trajectory plan:

$τv̇=K[(g−x)−(g−x0)s+f(s)]−Dv(1)$

where

$x∈R$

and

$v=τẋ$

are the position and velocity of a prescribed point of the system, respectively.

$τ∈R+$

is a temporal scaling factor.

$x0,g∈R$

are the initial and goal positions, respectively.

$K,D∈R+$

are the spring and damping terms, respectively, and D is chosen as

$D=2K$

to keep the system critically damped. s is a phase variable, and it is governed by the second component of DMP formulation, a canonical system:

$τṡ=−αs,α∈R+$

.

The third component is a nonlinear function approximation term (called forcing term), f, to shape the attractor landscape,

$f(s)=∑i=1Nωiψi(s)∑i=1Nψi(s)s(2)$

where

$ψi(s)=exp(−hi(s−ci)2)$

are Gaussian basis functions with centers ci and widths hi, and ωi is their weights.

In this paper, we utilize three-dimensional DMPs for the three positional degrees of freedom (DoF). Therefore, we rewrite Eq. 1 in multidimensional form as shown in Eq. 3:

$τv̇=K[(g−x)−(g−x0)s+f(s)]−Dvτẋ=v(3)$

Each DoF has its own transformation system and forcing term but shares the same canonical system.

##### 3.2.2 Orientational dynamical movement primitives

Besides positional DMPs, insertion tasks are also highly dependent on orientation. Therefore, we also utilize orientational DMPs (Pastor et al., 2011; Ude et al., 2014). A unit quaternion qS3 is commonly used to describe an orientation because it provides a singularity-free and nonminimal representation of the orientation (Ude et al., 2014). S3 is a unit sphere in

$R4$

. The transformation system of orientational DMPs is:

$τη̇=K[2⁡log(g∗q̄)]−Dη+f(s)τq̇=12η̃∗q(4)$

where gS3 denotes the goal quaternion orientation,

$q̄$

denotes the quaternion conjugation of q, and * denotes the quaternion product.

$η̃=[0,ηT]T$

is the angular velocity quaternion.

$K,D∈R3×3$

are angular stiffness and damping gains, respectively. The canonical system and the nonlinear forcing term, f(s), are defined in the same way as the positional DMPs. We also use the quaternion logarithm log(⋅) and exponential map exp(⋅) as given in Ude et al. (2014).

##### 3.2.3 Coupling term

Eqs. 3 and 4 can be used to imitate a demonstrated trajectory. However, we sometimes desire to modify the behavior of the system online in practice. To modify a DMP online, an optional coupling term, Ct, is usually added to the transformation system of DMP. For example, a one-dimensional positional DMP with Ct has the formulation as follows:

$τv̇=K(g−x)−Dv−K(g−x0)+Kf(s)+Ct̲(5)$

Ideally, Ct would be zero unless a special sensory event requires modifying the DMP. In the field of robotic manipulation, coupling terms have been used to avoid obstacles (Rai et al., 2014), to avoid joint limits (Gams et al., 2009), to grasp under uncertainty (Pastor et al., 2009), etc. This term is vital for our adaptive framework, and we will discuss it in the Adaptive robotic imitation framework section.

### 3.3 Goal-conditioned imitation learning

In a typical IL setting, the ith demonstrated trajectory Γi in a trajectory profile P is in the form of state–action pairs, i.e.,

$Γi=(s0i,a0i,…,sTi,aTi)$

, where T represents the total time steps. For a complex nonlinear trajectory, some specific states, commonly known as bottleneck states, need to be reached to correctly imitate the whole trajectory. It is challenging for behavior cloning (BC), a conventional approach, which learns a policy π(a|s) from the state–action pairs, to imitate, such a trajectory due to compounding errors in the Markov decision process (MDP). Goal-conditioned IL (GCIL) is a self-supervised method that learns a goal-conditioned policy that has been proven to be more effective than BC in reproducing the said complex trajectory (Kaelbling, 1993; Schaul et al., 2015; Ding et al., 2019). In a goal-conditioned setting, the state–action pairs are replaced by state–action–goal triplets,

$(sti,ati,sgi)$

, and a goal-conditioned policy π(a|s, sg), which attempts to match different goals is learned instead of π(a|s). Data relabeling (Lynch et al., 2019) is an effective data augmentation method usually used by GCIL, which treats each state

$st+ki$

visited within a demonstrated trajectory from

$sti$

to

$sgi$

as a latent goal state. This technique is particularly effective in the low data regime where a few demonstrations are available.

Algorithm 1 | Modular learning process.

## 4 Adaptive robotic imitation framework

### 4.1 System overview

The architecture of our framework is shown in Figure 1, which is built on a hybrid trajectory and force learning framework from our previous work (Wang et al., 2021). It consists of a trajectory learning part and a force learning part. The former takes an existing trajectory profile, P(m), of the task

$A(m)∈A$

as input and generates the nominal trajectory,

$Γ(n)N$

$A(n)∈A$

.

$Γ(n)N$

is learned from P(m) by an IL agent, which consists of an adaptive DMP module (ADMP) and a skill policy module. The force learning part is composed of an RL agent and a parallel position/force controller (Chiaverini and Sciavicco, 1993). The RL agent learns both the parameters and the position/orientation commands of the controller following

$Γ(n)N$

to control the industrial rigid manipulator to finish A(n) with proper force control policy. In the rest of this section, we will introduce each part of this framework in detail.

### 4.2 Modular learning strategy

In the proposed framework, we use a modular learning strategy because end-to-end learning can become very inefficient and even fail as networks grow (Glasmachers, 2017), which is known as the curse of dimensionality. In contrast, structured training of separate modules may be more robust. Moreover, assembly tasks are naturally divided into different subtasks that can be learned in different modules, e.g., in our problem setting, a task can be divided into a trajectory learning part and a force learning part. Therefore, we introduce DMPs into the framework in a modular learning fashion expecting to overcome the curse of dimensionality.

ADMP works in the trajectory learning part. It keeps constant after finding a seemingly suitable nominal trajectory

$Γ(n)N$

of A(n) with a small amount of trial and error, and then the framework only updates the parameters of the controller for the force learning at each training step. If the learning performance is constantly poor with the current

$Γ(n)N$

after certain steps, ADMP will be updated again with a given frequency to search for an alternative

$Γ(n)N$

. This mechanism is represented by the switch symbol in Figure 1. The whole modular learning process is shown in Algorithm 1.

### 4.3 Trajectory learning

In the trajectory learning, we hope to adapt trajectories in an existing task trajectory profile to new trajectories that are suitable for other similar tasks. Therefore, we introduce ADMP to achieve this goal. As we only use ADMP to realize spatial scaling, we set the temporal scaling factor τ to 1 in Eqs. 3 and 4.

As mentioned in the Coupling term section, the behaviors of ADMP can be modified by changing the coupling terms, Ct, in Eq. 5. Therefore, it is a promising approach to learn proper Ct for ADMP to adapt to new scenarios. Moreover, the forcing term weights, ω, can also affect the resulting trajectories.

To discern how different components of the DMP formulation affect the results, we make an investigation by introducing random Ct or adding random noise to ω in the DMP formulation of a sine wave as depicted in Figure 4. The green line is a sine wave trajectory. We spatially scale the sine wave to match a new goal using two-dimensional DMPs. The first subfigure in Figure 4 shows the scaled trajectory using vanilla DMPs, which means no Ct is added, and ω is chosen to match the original trajectory without noise. With such a baseline, we then add 1) only Ct; 2) only ω noise; and 3) both Ct and ω noise to the DMP formulation to observe the effects on the resulting trajectories. The results in Figure 4 indicate that 1) Ct facilitate local exploration based on the original trajectory, 2) noise added to ω leads to locally smooth but globally different trajectory, and 3) adding both Ct and ω results in a trajectory with both global shape change and local exploration.

FIGURE 4. Comparison of different resulting trajectories by changing different components of the DMPs formulation. Green line is the original sine wave trajectory. Blue star and red star are the original goal and the new goal, respectively.

Considering our requirements, the global change in trajectory may benefit the coarse adaptation to geometric characteristics of new workpieces, and the local exploration can help to tackle some delicate bottleneck states along the trajectory. Therefore, we choose to add both Ct and ω noise to the DMP formulation. In the framework, instead of meaningless variables, Ct and ω noise are learned by the RL agent through interacting with the environment.

##### 4.3.2 Hierarchical goal-conditioned imitation learning

We train the skill policy using an HGCIL approach proposed in our previous work (Wang et al., 2021). Following the goal-conditioned setting in the Goal-conditioned imitation learning section, we reorganize the original trajectory profile P(n) into a hierarchical goal-conditioned (HGC) trajectory profile

$P(n)skill$

. A trajectory

$Γ∈P(n)skill$

consists of a sequence of poses (p0, p1, … , pT), which are Cartesian poses of EEF in our framework. Sliding along each sequence in P(n) with two predefined hierarchical windows, Ws and Wm, we obtain a new sequence of triplets,

$(p,pl,ph)=(pt,pt+min(w,Wm),pt+w),ift+w≤Tt=1,2,…,T;w=1,2,…,Ws.$

where p, pl, and ph represent the current pose, the subgoal pose, and the goal pose, respectively. Note that pl plays the role of action between two consecutive states here. All these triplets compose

$P(n)skill$

and the skill policy π(pl|p, ph) is trained using a fully connected neural network with three hidden layers, each with 256 units, a dropout rate of 0.1, and ReLu as the activation function, which maps the observation, (p, ph), to the action, pl. With the skill policy π(pl|p, ph), the IL agent can spontaneously find subgoals, pl, for a distant goal along the trajectory and provides pl to the parallel controller. All these subgoals compose the nominal trajectory,

$Γ(n)N$

. Since pl can be periodically updated based on p and ph, the motion drift of EEF is constrained, and the goal-conditioned setting assists the EEF in recovering from unseen states.

### 4.4 Force learning

##### 4.4.1 Reinforcement learning-based controller

An RL-based controller proposed in our previous work (Beltran-Hernandez et al., 2020) is responsible for learning the proper force control policy decided by the gain parameters of the controller, acp, as well as the position/orientation commands of EEF, ap. The RL-based controller consists of an RL agent and a parallel position/force controller. The parallel position/force controller includes a proportional derivative (PD) controller generating part of the movement command,

$pcp$

, based on the position feedback, and a proportional integral (PI) controller adjusting the movement command by

$pcf$

according to the force feedback.

The learning process starts with pl from the trajectory learning every time step. p is the actual Cartesian pose of EEF, and f = [ f, τ] is the contact force, where

$f∈R3$

is the force vector and

$τ∈R3$

is the torque vector. fg is the reference force of the insertion task. The pose error of EEF, pe = pl − p, the velocity of the EEF,

$ṗ$

, and f serve as inputs to the RL agent, while pe and f also serve as feedback to the parallel position/force controller. For the controller, the RL agent gives policy actions consisting of ap and acp. ap = [v, w] are the position/orientation commands where

$v∈R3$

is the position and

$w∈R4$

is the quaternion to control the movements of the robot;

$acp=[Kpp,Kpf,S]$

are the gain parameters of the controller where

$Kpp$

,

$Kdp=2Kpp$

,

$Kpf$

, and

$Kif=0.01Kpf$

are PD proportional, PD derivative, PI proportional, and PI integral gains, respectively, and

$S=diag(s1,s2,s3,s4,s5,s6),sn∈[0,1](6)$

is the selection matrix, whose elements correspond to the degree of control that each controller has over a given direction. Finally, the actual position command,

$pc=ap+pcp+pcf$

, is produced by the controller based on all inputs and sent to the manipulator.

##### 4.4.2 Algorithm and reward

We use Soft-Actor-Critic (SAC) (Haarnoja et al., 2018) as the RL algorithm of the scheme, which is a state-of-the-art model-free and off-policy actor-critic deep RL algorithm based on the maximum entropy RL framework. It encourages exploration according to a temperature parameter, and the core idea is to succeed in the task while acting as randomly as possible. As an off-policy algorithm, it can use a replay buffer to reuse information from recent operations for sample-efficient training. We use a reward function as follows:

$r(s)=w1Mpepmax1,2+w2Mfefmax2+γ.(7)$

fe = fg − f is the contact force error. pmax and fmax are defined maximum values. y = M(x), x ∈ [1, 0] linearly maps x to y ∈ [1, 0]. Therefore, the smaller pe and fe are, the higher the reward is.

$z1,2$

is the l12 norm (Levine et al., 2016), which is given by

$12z2+α+z2$

. This norm is used to encourage the EEF to precisely reach the target position, but to also receive a larger penalty when far away. γ is the auxiliary term, which can be a positive reward (100) for finishing the task successfully, a negative one (−50) for excessive force, or 0 otherwise. w1 and w2 are hyperparameters to weight the components.

## 5 Experimental evaluation

In this section, we evaluate the efficacy of our adaptive robotic imitation framework in learning a class of complex contact-rich insertion tasks from a single instance. We perform a sequence of empirical evaluations using the L insertion task class. We divide this section into three parts: first, applying the framework on a simulated environment to study its sample efficiency, generalizability to different task instances, and safety during the training sessions; second, applying the framework to real insertion tasks to further validate its adaptiveness in the physical world; and third, ablation studies to investigate the effect of different components on the overall performance of our framework.

### 5.1 Implementation details

We evaluate the proposed framework both on a simulated environment built in the Gazebo nine and on a real UR3e robotic arm as shown in Figure 5. The real UR3e robotic arm uses a control frequency of 500 Hz, which is the maximum available for the robot. The RL control policy runs at a frequency of 20 Hz on both the simulated environment and the real robot. The training sessions are performed on a computer with a GeForce RTX 2060 SUPER GPU and an Intel Core i7-9700 CPU. The implementation of the ADMP method was based on the DMP implementation from the DMP++ (Ginesi et al., 2019) repository, and for the RL agent, we used the SAC implementation from the TF2RL (Ota, 2020) repository.

FIGURE 5. The simulated environment in Gazebo and the real experiment environment with a UR3e robotic arm. We show the setup for A(2) task where L is directly attached to the robot.

### 5.2 Evaluation on simulated environment

First, we evaluated the efficacy of the adaptive robotic imitation framework on the simulated environment. We used the L insertion task class described in the Problem statement section, and we assumed access to only a trajectory profile of A(1) consisting of six demonstrated trajectories.

##### 5.2.1 Sample efficiency

The most concerning point of the learning framework is the sample efficiency. By providing a nominal trajectory learned from demonstration to the RL learning process, the sample efficiency can be largely improved according to our previous work (Wang et al., 2021). However, the framework in this paper indirectly generates the nominal trajectory by adapting existing trajectories using ADMP and may cost more time than using demonstrated trajectories. Therefore, we are interested in whether the framework is still sample-efficient compared with other alternatives.

We compared the learning curves of training sessions on A(2) task with frameworks using different trajectory learning methods: ADMP (ours), demonstrated trajectory (DEMO), and RL from scratch (w/o) as shown in Figure 6.

FIGURE 6. Learning curves of the training sessions on A(2) task with frameworks using different trajectory learning methods: Adaptive DMPs (ADMP), demonstrated trajectory (DEMO), and RL without trajectory learning (w/o). The red dashed line represents the near-optimal reward.

Among these methods, ADMP showed the highest sample efficiency of 40 K steps, even higher than the baseline DEMO (55 K), and the learning result of ADMP was also as good as the DEMO. Although the better performance of ADMP than DEMO may result from suboptimal demonstration, this result indicated that introducing the DMP component into our framework was indeed effective in adapting to new tasks and alleviating human demonstration burden, and the sample efficiency was at least not lower than using demonstrated trajectories of new tasks.

##### 5.2.2 Generalizability

Since the proposed framework displayed good adaptation to A(2) task, we then tested with A(3) and A(4) to study its generalizability to different tasks. The result is shown in Figure 7. It indicated that the framework could generalize among different kinds of task instances with good sample efficiencies and learning results. In detail, the steps cost for convergence in learning A(2), A(3), and A(4) were 40, 50, and 55 K steps, respectively. We analyzed that different sample efficiencies mainly resulted from their different difficulties: the object shapes in A(2) were the most similar to A(1) with an affine transformation, while the other two involved more variations.

FIGURE 7. Learning curves and sample efficiencies of task instances A(2), A(3), and A(4). The red dashed line represents the near-optimal reward.

##### 5.2.3 Safety

Finally, we compared the collision percentage during the training sessions of each task using frameworks with and without ADMP as shown in Figure 8. Five training sessions were implemented for each pair of task and framework, and the collision percentage of each training session, Pcol, is calculated by:

$Pcol=TotalCollisionNumberTotalEpisodeNumber×100%.$

FIGURE 8. Collision percentage during the training sessions.

With ADMP, the collision percentages of A(2), A(3), and A(4), were diminished to 7.9%, 20.9%, and 30.5% from 30.2%, 39%, and 72.6%, respectively. The result indicated that the proposed framework with ADMP was also qualified for our requirement of lowering the chance of collision during the training sessions, which reduces the equipment wear and tear and the risk of damaging the workpieces on real hardware.

### 5.3 Experiments on a real robot

After evaluating the sample efficiency, generalizability, and safety of the framework on the simulated environment, we applied it to some real insertion tasks belonging to the L insertion task class to test its adaptiveness in the physical world.

##### 5.3.1 Sim-to-real transfer

We first executed sim-to-real transfers using a trained IL agent, which learned the skill policies on simulation and obtained the control policies for A(2), A(3), and A(4) on the real hardware. The L objects in these tasks were directly attached to the robot for stability. The learning curves are shown in Figure 9. Benefiting from the learned skill policies, our framework learned good control policies for A(2), A(3), and A(4) at about 20 K steps. Although it took some time for the RL agent to adapt to the physical world, the result indicated that the skill policies learned by the framework on the simulation provided good initialization and effectively enhanced the sample efficiency of the real learning process.

FIGURE 9. Sim-to-real tasks A(2), A(3), and A(4) and their learning curves. The L objects are directly attached to the robot.

FIGURE 10. Two real assembly tasks and their learning curves. Left: USB insertion task. Right: plug insertion task. The objects are grasped by the gripper. A jig is used in the USB insertion to improve the stability considering the contact area between the USB and the gripper.

Table 1 displays the success rates and average steps cost among 20 trials for each task. In each trial, the EEF was initially set to a random pose in a distance range of [15, 45] (unit: mm) and a pitch angle range of [10, 30] (unit: °) away from the target pose. Figure 11 shows the results of the initial and the learned control policies of the two tasks, including the Euclidean distance errors of EEF, pitch angle errors, and the force/torque data during the evaluation process. Note that although only the result of a single run is provided for each policy, it is typical enough to verify the effectiveness of the proposed framework on learning good policies for the tasks.

TABLE 1. Performance on the two real assembly tasks.

FIGURE 11. Distance errors, pitch angle errors, and force/torque data of a USB insertion (left) and a plug insertion (right) using their initial/learned policies. The error values have been mapped to a range of [1, 0] and the force/torque values have been mapped to a range of [−1, 1].

### 5.4 Ablation studies

In this part, we executed two ablation studies to investigate how different hyperparameters and strategies affected the performance of the proposed framework. We ran each ablation study on A(2) task following the settings in the Evaluation on simulated environment section.

##### 5.4.1 Effects of the dynamical movement primitive components

In the Adaptive action of adaptive dynamical movement primitives section, we provide a simple investigation on how Ct and ω of the DMPs affect the generalized trajectory, and the conclusion is that Ct benefits the local exploration while ω benefits the global change of the trajectory. As we assume that both the local exploration and the global change are necessary to efficiently learn the new trajectory, we tune both Ct and ω of the DMPs during the learning process.

In this part, we investigated whether such a choice indeed improved the learning performance. We compared the learning performance of four choices: 1) ADMP (tuning both Ct and ω); 2) tuning Ct; 3) tuning ω; 4) vanilla DMP (tuning neither Ct nor ω), on A(2), A(3), and A(4) tasks. The learning curves are shown in Figure 12. The result showed that ADMP could guarantee both the learning speed and the stability on new tasks. Although separately tuning Ct or ω could also obtain good performances on some tasks, it depended on the tasks so that it was less universal than ADMP. Also, the vanilla DMPs hardly took effect without parameter tuning through RL, which meant that the intrinsic compliance of the controller could not tackle new tasks effectively.

FIGURE 12. Effects of the DMPs components on the learning performance of A(2) (left), A(3) (middle), and A(4) (right) tasks.

##### 5.4.2 Effects of the number of demonstrated trajectories

In our framework, the skill policy plays an important role to generate the nominal trajectory whose quality affects the learning performance. Therefore, we investigated how the number of demonstrated trajectories to train the skill policy would affect the learning results. We tested three numbers, n = 1, 5, 10, and plotted the results as shown in Figure 13.

FIGURE 13. Effects of the number of demonstrated trajectories on the learning performance of A(2) (left), A(3) (middle), and A(4) (right) tasks.

When there was only a single trajectory, the learning performance was poor because it was difficult for the skill policy trained with limited data to handle unseen states during the learning process. However, when there were 10 trajectories, the large amount of data conversely confused the skill policy because of the high redundancy so that the performance was unstable. Therefore, we chose 5 as the optimal number of demonstrated trajectories, and all the results in the Evaluation on simulated environment and Experiments on a real robot sections 5-2 and 5-3 were produced using this number.

##### 5.4.3 Effects of the modular learning strategy

As mentioned in the Modular learning strategy section, we utilized a modular learning strategy for the learning of ADMP parameters assuming the curse of dimensionality would lower the performance of RL. Table 2 shows the number of parameters to tune in the learning process. First, following the parameter selection of Wang et al. (2021), we used six parameters for the position/orientation command, one

$Kpp$

parameter for the PD control, one

$Kpf$

parameter for the PI control, and six parameters for the selection matrix, S. Then, we assigned the coupling terms, Ct, and the forcing term weights, ω, six parameters, respectively, which were used to adjust the trajectory in the six DoFs. Therefore, there were, in total, 26 parameters for different functional components involved in the learning process. Under the modular strategy in Algorithm 1, the number of parameters was reduced to 14 by fixing the 12 DMP parameters when a promising trajectory was found, which was assumed to be more robust than tuning all the 26 parameters simultaneously.

TABLE 2. Action space of the learning process.

To verify this assumption, we compared the modular learning with the end-to-end (E2E) learning as shown in Figure 14. We used 10 demonstrated trajectories for each task in this comparison. From the results, we found that it was hard for the E2E learning to converge, while the modular learning possessed relatively higher learning speed. It indicated that modular learning was more suitable for our framework than E2E learning when there were large numbers of parameters with different functions to tune.

FIGURE 14. Effects of the modular learning and the end-to-end (E2E) learning on the learning performance of A(2) (left), A(3) (middle), and A(4) (right) tasks.

## 6 Conclusion

In this work, we propose an adaptive robotic imitation framework for the hybrid trajectory and force learning of complex contact-rich insertion tasks. The framework is composed of learning the nominal trajectory through a combination of IL and RL, and learning the force control policy through an RL-based force controller. We highlight the use of the adaptive DMPs (ADMP), where the coupling terms and the weights of forcing terms in the DMP formulation are learned through RL to effectively adapt the trajectory profile of a single task to new tasks with topologically similar trajectories, which alleviates human repetitive demonstration burdens.

The experimental results show that the proposed framework is comparably sample efficient as a framework using explicitly demonstrated trajectories, has good generalizability among different instances in a task class, and is qualified for the safety requirement by lowering the chance of collision during the training sessions compared with the model-free RL approach. Moreover, the ablation studies show that a proper number of demonstrated trajectories and the modular learning strategy play vital roles in the proposed framework, which affects the speed and the stability of the learning process.

From the experimental results on the real hardware, we also found that the topological similarity of trajectories could affect the learning speed. Therefore, it may improve the efficacy of adapting the DMP parameters if we can represent new trajectories topologically close to the previous ones, and it remains an interesting issue for our future research.

## Data Availability Statement

The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation.

## Author Contributions

YW formulated the methodology. YW and CB-H provided the software. YW performed the investigation. YW wrote the original draft. YW and CB-H reviewed and edited the manuscript. WW and KH supervised the study, and KH was in charge of the project administration. KH acquired the funding.

## Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

## Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

## Acknowledgments

The first author would like to acknowledge the financial support from the China Scholarship Council Postgraduate Scholarship Grant 201806120019.