In the previous sections, we reviewed models that require a large amount of data to learn directly a task or to train more generic representations (to allow tackling some tasks with fewer data). But there is not always enough data to train a model from generic representations (by using them or adapting them); this is the case for pathological speech [19]. Google is trying to acquire more data of that natureFootnote 1. But acquiring such data can be expensive and time consuming. Mustafa et al. recommend the use of adaptive techniques to tackle the limited amount of data problem in such case [22]. We think few-shot techniques can be another solution to this problem. Nevertheless, some non-common tasks such as pathological speech or dialect identification with few examples are still hard to train with SOTA techniques based on large speech datasets. This is why we investigate the following few-shot techniques and see the adaptations required for using them on speech datasets.

Few-shot notations

Let us consider a distribution P from which we draw independent identically distributed (iid) episodes ((mathcal {E}) or datasets). (mathcal {E}) is composed of a support set (mathcal {S}), unlabeled data (bar {mathbf {x}}) and a query set (mathcal {Q}). The support set corresponds to the supervised samples to which the model has access:

$$mathcal{S} = {(x_{1}, y_{1}), ldots (x_{s}, y_{s}) },$$


with xi being samples and yi being the corresponding labels, such as yi{1,2,…,K} and K being the number of classes appearing in P. The query set is composed of samples to classify (hat {mathbf {x}}) with (hat {mathbf {y}}) being the corresponding ground truth.

To summarize, episodes drawn from P have the following form:

$$begin{aligned} mathcal{E} = { & mathcal{S} = {(x_{1}, y_{1}), ldots, (x_{s}, y_{s}) },\ & bar{mathbf{x}} = (bar{x}_{1}, ldots, bar{x}_{r}),\ & mathcal{Q} = {(hat{x}_{1}, hat{y}_{1}), ldots, (hat{x}_{t}, hat{y}_{t})} end{aligned}$$


with s, r, and t fixed values that respectively represent the number of supervised samples for the support set, the number of unsupervised samples, and the number of supervised samples for the query set.

In this survey, we will focus on few-shot learning techniques where r=0, t≥1 and s=kn, with n being the number of times each label appears for the support set and k the number of classes selected from P, such as kK. Hence, we have an n-shot with k-ways (or classes) for each episode. One-shot learning is just a special case of few-shot learning where n=1. In some few-shot frameworks, we only sample one episode from P and it represents our task.

Few-shot learning techniques

In this section, we will review frameworks that impacted the few-shot learning field in image processing, frameworks with a formulation that seems suitable for speech processing, and frameworks already successfully used by the speech community.

Siamese technique

Siamese neural networks are designed to be used per episode [50]. They involve measuring the distance between two samples and judging whether or not they are similar. Hence, a siamese network uses the samples from the support set (mathcal {S}) as references for each class. It is then trained using all the combinations of samples from (mathcal {S} bigcup mathcal {Q}) which provides much more training than having only s+t samples in classical feedforward frameworks. Siamese Networks take two samples (x1 and x2) as input and compute a distance between them, as follows:

$$phi(x_{1}, x_{2}) = sigma(sumboldsymbol{alpha}|Enc(x_{1}) – Enc(x_{2})|),$$


with Enc being a DNN encoder that represents the signal input, σ being the sigmoid function, α learnable parameters that weight the importance of each component of the encoder, and x1 and x2 sampled from either the support set or the queries set.

To define the class of a new sample from (mathcal {Q}) or any new data, we must compute the distance between each reference from (mathcal {S}) and the new sample. An example of comparison between a reference and a new example is shown in Fig. 2. The class of the reference with the lowest distance then becomes the prediction of the model. To train such a model, [50] used this loss function:

$$mathcal{L} = mathbb{E}_{y(x_{i}) = y(tilde{x}_{j})} log(phi(x_{i}, tilde{x}_{j})) + mathbb{E}_{y(x_{i}) neq y(tilde{x}_{j})} log(1 – phi(x_{i}, tilde{x}_{j}))$$

with (tilde {mathbf {x}} = [x_{1}, ldots, x_{s}, hat {x}_{1}, ldots, hat {x}_{t}]) from (mathcal {S}) and (mathcal {Q}). y(x) is a function that returns the label corresponding to the example x. The last layer of ϕ should be a softmax function.

Fig. 2
figure 2

Example of comparison between a reference (xi) and a new example ((hat {x}_{j})) from the query set, where Enc is the same network applied to both xi and (hat {x}_{j}). The model outputs the distance between the classes xi and (hat {x}_{j})

Eloff et al. used a modified version of this framework for multimodal learning with the modalities being speech and image signals [51], but to our knowledge, there is no study yet concerning just speech processing. The speech signals used consist of 11-digit numbers (zero to nine and the ‘‘oh’’ used in phone numbers) with the corresponding 10 images (‘‘oh’’ and zero give the same images). The problem is to associate speech signals with the corresponding images. In their experiment, the model shows some invariance to speakers (accuracy of 70.12% ±0.68) using only a one-shot configuration, which is a promising result.

Siamese neural networks are not very suitable when the number of classes K or the number of shots q become too high. It increases the number of references to be compared and the computation time to forward the model. The primary problem concerns training the model. Once the model has been trained, we can reduce this effect by pre-calculating all encodings of the examples of the support set. This also dramatically increases the number of combinations for the training phase, which can be viewed as a positive point. This framework does not seem appropriate for end-to-end ASR with large vocabularies, such as in English (around 470,000 words), though it may be sufficient for languages such as Esperanto (around 16,780 words). The other way to use such a framework in ASR systems is to use it in hybrid models as an acoustic model, where we can train it on every phoneme (for example 44 phonemes/sounds in English) or more refined sound units.

The siamese framework seems interesting for tasks such as speaker identification, as a new speaker can be added without retraining the model (supposing the model had generalized) or changing the architecture of the model. We only have to add at least one sample of the new speaker to the references. Furthermore, the siamese formulation seems well adapted for speaker verification. We only need to replace the pair (x,speaker_id) by the pair ((mathbf {x}, mathcal {S}_{top5})), where (mathcal {S}_{top5}) is a support set composed of signals from the top 5 predictions of the identification sub-task.

Nevertheless, this framework will be of limited use if the number of speakers to identify become too high. Even so, it is possible to use such techniques in an end-to-end ASR system when the vocabulary is limited, such as in the experiment described in [51]. Also, this framework was used in emotion recognition [52]. In their experiments, they used their approach over the IEMOCAP [47] using a 3-way task (which is different from all other papers reviewed in this work that use 4 classes). Nevertheless, they managed to obtain an unweighted average recall of 67.4% using a 10-shot configuration, which is an encouraging result.

Matching network

The matching networks system described in [53] is a few-shot framework designed to be trained with a set of multiple episodes (with typically 5-ways to 25-ways), which consists of a single model φ. This model evaluates new examples given the support set (mathcal {S}) as in the siamese framework:

$$varphi(hat{x}, mathcal{S}) :rightarrow hat{y}.$$


In matching learning, φ is as follows:

$$varphi(hat{x}, mathcal{S}) = sumlimits_{(x_{i}, y_{i}) in mathcal{S}} a(hat{x}, x_{i}) y_{i},$$


with, a being the attention kernel.

In [53], this attention kernel is as follows:

$$a(hat{x}, x_{i}) = text{softmax}(c(f(hat{x}), g(x_{i}))),$$


where c is the cosine distance, and f and g are embedding functions.

Vinyals et al. used a recurrent architecture to modulate the representation of f using the support set S [53]. The goal is to have f following the same type of representation as g. To do this, the g function is as follows:

$$g(x_{i}) = overrightarrow{h_{i}} + overleftarrow{h_{i}} + g'(x_{i}),$$


where (overrightarrow {h_{i}}) and (overleftarrow {h_{i}}) represent a bi-LSTM output over g(xi), which is a DNN.

The f function is as follows:

$$f(hat{x}) = attLSTM(f'(hat{x}), g(S), m),$$


with attLSTM being an LSTM requiring a fixed number of recurrences (here m) and g(S) representing the application of g to each xi from the S set. f is a DNN with the same architecture as g, but not necessarily sharing the parameter values.

Training this framework therefore consists in the maximization of the log likelihood of φ given the parameters of g and f.

Figure 3 illustrates forward time of the matching network model. For forward time on new samples, (g(mathcal {S})) can be pre-calculated to gain computation time. Nevertheless, matching networks have the same disadvantages as siamese networks when q and/or K become too high. Furthermore, adding new classes to a trained matching network model is not as easy as for siamese network models. As this requires retraining the matching network model to add an element to the support set. Despite these disadvantages, matching learning showed better results than the siamese framework on image datasets [53]. This is why it should be investigated in speech processing to see if it is still the case.

Fig. 3
figure 3

Illustration of the matching network model to predict the class of a new example (hat {x}_{i})

Prototypical networks

Prototypical networks [54] are designed to work with multiple episodes. In the prototypical framework, the model φ makes its predictions given the support set (mathcal {S}) of an episode such as the previously seen frameworks. This framework uses training episodes as mini-batches to obtain the final model. This model is formulated as follows:

$$varphi(hat{x}, S) = {softmax}_{k}(-d(f(hat{x}), mathbf{c}_{k})),$$


where ck is the prototype of the class k, d being a Bregman divergence (for their useful properties in optimization, see [54] for more details), which also has the following property: Rn×Rn→[0,+ inf[.

Snell et al. used the Euclidean distance for d instead of the cosine distance used in meta learning and matching learning papers [54]. As a result, they obtain better results in their experiments. Next, they go further by reducing the Euclidean distance to a linear function.

In the prototypical framework, there is only one prototype for each class k as illustrated in Fig. 4. It is computed as follows:

$$mathbf{c}_{k} = frac{1}{|mathcal{S}_{k}|} sumlimits_{(x_{i}, y_{i}) in mathcal{S}_{k}}f(x_{i}),$$


Fig. 4
figure 4

Illustration of the prototypical network model to predict class of a new example (hat {x}_{i})

with f being a mapping function such as (mathbb {R}^{D} rightarrow mathbb {R}^{M}) and (mathcal {S}_{k}) being the samples with k of the support set.

Prototypical networks require only one comparison per class and not q per class for q-shot learning as in siamese and matching learning networks. That is why this framework is less subject to the high computation problem for prediction of new samples, as it is only influenced by high K. It will certainly be insufficient for end-to-end ASR systems on the English language due to the large vocabulary issues described in Section 4.2.1, but it is a step towards it.

In speaker recognition, prototypical networks were used over a portion of Voxceleb1 [25] by [55]. They obtained under 20-ways and using 5-shot an accuracy of 72.77 which are promising results.


Meta-learning systems [56] are designed to be trained over multiple episodes (also called datasets). In this framework, a trainee model ((mathcal {T})) with parameters (theta ^{mathcal {T}}) trained from the start of every episode, usually has a classic DNN architecture. The support set and the query set in the episodes are considered as the training set and the test set for the trainee model.

Along with this trainee model, a second model is trained: the meta model ((mathcal {M})) with (theta ^{mathcal {M}}) parameters. This meta model is the key of meta learning, it consists in monitoring the trainee model by updating (theta ^{mathcal {T}}) parameters. To train this meta model, Ravi et al. suggests sampling iid episodes from P to form the meta-dataset ((mathcal {D})) [56]. This meta-dataset is composed of a training set ((mathcal {D}_{train})), a validation set ((mathcal {D}_{valid})), and a testing set ((mathcal {D}_{test})).

While the trainee model is training on an episode (mathcal {E}_{j}), the meta model is used to update its parameters:

$$theta^{mathcal{T}_{j}}_{t} = mathcal{M}(theta^{mathcal{T}_{j}}_{t-1}, mathcal{L}^{mathcal{T}_{j}}, nabla_{theta^{mathcal{T}_{j}}_{t-1}}mathcal{L}^{mathcal{T}_{j}}),$$


with (mathcal {L}^{mathcal {T}_{j}}) being the loss function of the trainee model learned with the episode (mathcal {E}_{j}) and (theta _{t-1}^{mathcal {T}_{j}}) being the parameters of the trainee model at step t−1. Also, (mathcal {M}) has to guess initial weights of the trainee models at step t=0 ((theta ^{mathcal {T}_{j}}_{0})).

The learning curve (loss) of the trainee model with (mathcal {E}_{j}) is viewed in [56] as a sequence that can be the input of the meta model (mathcal {M}). For simplicity, we will use the notation of (mathcal {T}) instead of (mathcal {T}_{j}) for the next few paragraphs. Figure 5 illustrates the learning steps of the trainee using the meta model.

Fig. 5
figure 5

Illustration of Meta-Learning for training with episode (mathcal {E}_{j}) at step t. Here the Meta model (mathcal {M}) processes the different training steps of the trainee (mathcal {T}) as a sequence

Trainee parameters update

Ravi and Larochelle identify the learning process of (mathcal {T}) using classic feedforward update on the episode Ej to be similar to the ct update gate of the LSTM framework [56]. In the meta learning framework, ct is used as the (theta ^{mathcal {T}}_{t}) estimator, as follows:

$$theta_{t}^{mathcal{T}} = f_{t} odot theta_{t-1}^{mathcal{T}} + i_{t} odot tilde{theta_{t}^{mathcal{T}}},$$


with (tilde {theta _{t}^{mathcal {T}}} = -alpha _{t} nabla _{theta _{t-1}^{mathcal {T}}} mathcal {L}_{t}^{mathcal {T}}) being the update term of the parameters (theta _{t-1}^{mathcal {T}}), ft being the forget gate, and it the update gate.

Parameters of the meta model

Both it and ft are part of the meta learner. In the meta-learning framework, the update gate is formulated as follows:

$$i_{t} = sigma(mathbf{W}_{I}.[nabla_{theta_{t-1}^{mathcal{T}}}mathcal{L}_{t}^{mathcal{T}}, mathcal{L}_{t}^{mathcal{T}}, theta_{t-1}^{mathcal{T}}, i_{t-1}] + mathbf{b}_{I}),$$


with WI and bI being parameters of (mathcal {M}). The update gate is used to control the update term in equation 14, like the learning rate in the classic feedforward approach.

Next, the forget gate in the meta-learning framework is formulated as follows:

$$f_{t} = sigma(mathbf{W}_{F}.[nabla_{theta_{t-1}^{mathcal{T}}}mathcal{L}_{t}^{mathcal{T}}, mathcal{L}_{t}^{mathcal{T}}, theta_{t-1}^{mathcal{T}}, f_{t-1}] + mathbf{b}_{F}),$$


with WF and bF parameters of (mathcal {M}).

This gate is here to decide whether the training of the trainee should restart or not. This can be useful to avoid the problem of a sub-optimal local minimum. Note that this gate is not present in classic feedforward approaches (where this gate is equal to one).

The trainee model ((mathcal {T})) of this framework can be any kind of model, such as a siamese neural network. It can therefore have the advantages of this framework. It can also avoid the disadvantages of the siamese neural network, as it can use any other framework (usually classic DNN). This framework is interesting to training efficient models for speech processing (in terms of learning speed) when we have multiple ASR tasks with different vocabularies. For example, suppose we have the following kinds of speech episodes: dialing numbers, commands to a robot A, and commands to a robot B. The model can initialize good filters for the first layers (as this still involves speech processing). Another example could be training acoustic models for multiple languages (with each episode corresponding to a language).

Graph neural network

Graph neural networks (GNNs) are used by Garcia and Bruna to introduce their few-shot framework [57]. This framework is designed to be used with multiple episodes, they called tasks. In this framework, one model is used over a complete graph G: G=(V,E) and every node corresponds to an example. For few-shot learning, a GNN consists in applying graph convolution layers over the graph G.

Initial vertices to guess the ground truth of a query (tilde {x_{i}}) from the query set (mathcal {Q}) are constructed as follows:

$$begin{aligned} V^{(0)} = (& (Enc(x_{1}), h(y_{1})), ldots, (Enc(x_{s}), h(y_{s})), \ & (Enc(bar{x_{1}}), u), ldots, (Enc(bar{x_{r}}), u)\ & (Enc(tilde{x_{i}}), u)) end{aligned},$$


where Enc is an embedding extraction function (a neural network or any classic feature extraction technique), h the one-hot encoding function, and u=K−11K a uniform distribution for examples with unknown labels (the unsupervised ones from (bar {mathbf {x}}) and/or from the query set (mathcal {Q})).

The vertices at each layer l (with 0 being the initial vertices) will henceforth be denoted:

$$V^{(l)} = (v_{1}, ldots, v_{n}),$$


where n=s+r+1 and (V^{(l)} in mathbb {R}^{n*d_{l}}).

Every layer (with an illustration of a layer in Fig. 6) in a GNN is computed as follows:

$$V^{(l+1)} = Gc(V^{(l)}, A^{(l)}),$$


Fig. 6
figure 6

Illustration of the input of the first layer (here a graph convolution) of a GNN. Here, we have three samples (represented by vertices vi, vj and vk) in the support set and one query (represented by the vertex vu)

with A(l) being the adjacency operators constructed from V(l) and Gc being the graph convolution.

Construction of the adjacency operators

The adjacency operator uses a set:

$$A^{(l)} = {tilde{A}^{(l)}, mathbf{1}},$$


with (tilde {A}^{(l)}) being the adjacency matrix of V(l).

For every (i,j)E (remember that we have complete graphs), we compute the values of the adjacency matrix as follows:

$$tilde{A}^{(l)}_{i, j} = phi(v_{i}^{(l)}, v_{j}^{(l)}),$$



$$phi(v_{i}^{(l)}, v_{j}^{(l)}) = f(|v_{i}^{(l)} – v_{j}^{(l)}|),$$


with f being a multi-layer perceptron with its parameters denoted θf. (tilde {A}^{(l)}) is then normalized using the softmax function over each line.

Graph convolution

The graph convolution requires the construction of the adjacency operators set and is computed as follows:

$$Gc(V^{(l)}, A^{(l)}) = rho(sumlimits_{B in A} BV^{(l)} theta^{(k)}_{B, l}),$$


with B being an adjacency operator from A, (theta _{B, l}^{(k)} in mathbb {R}^{d_{l-1}, d_{l}}) learnable parameters and ρ being a pointwise linearity (usually leaky ReLU).

Training the model

The output of the resulting GNN model is a mapping of the vertices to a K-simplex that gives the probability of (tilde {x_{i}}) being in class k. V. Garcia and J. Bruna used the cross-entropy to train the model using all other samples in the query set (mathcal {Q}) [57]. Hence, the GNN few-shot framework consists in learning θf and θ1,lθcard(A),l parameters with all episodes.

Few-shot GNN on audio

This framework was used by [58] on 5-way audio classification problems. The 5-way episodes are randomly selected from the initial dataset: AudioSet [59] for creating the 5-ways training episodes and TV program (from [60]) data to create the 5-ways test episodes.

Zhang et al. compare the use of per class (or intra-class) attention and global attention, which gave the best results [58]. They applied it for each layer. Their experiments were performed for 1-shot, 5-shots, and 10-shots with the respective accuracy of 69.4± 0.66, 78.3± 0.46, and 83.6± 0.98. Such results are an encouragement for the use of few-shot learning for speech signals. Nevertheless, this framework does not allow the use of multiple classes and shots per episode, which increase the number of nodes and thus the computations in forward time. Hence, it is not suitable for large vocabulary problems.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit


This article is autogenerated using RSS feeds and has not been created or edited by OA JF.

Click here for Source link (