# Auxiliary function-based algorithm for blind extraction of a moving speaker – EURASIP Journal on Audio, Speech, and Music Processing

#### ByJakub Janský, Zbyněk Koldovský, Jiří Málek, Tomáš Kounovský and Jaroslav Čmejla

Jan 4, 2022

A static mixture of audio signals that propagate in an acoustic environment from point sources to microphones can be described by the time-invariant convolutive model. Let there be d sources observed by m microphones. The signal on the ith microphone is described by

$$x_{i}(n)=sum_{j=1}^{d}sum_{tau=0}^{L-1} h_{{ij}}(tau)s_{j}(n-tau),quad i=1,dots,m,$$

(1)

where n is the sample index, (s_{1}(n),dots,s_{d}(n)) are the original signals coming from the sources, and hij denotes the time-invariant impulse response between the jth source and ith microphone of length L.

In the short-time Fourier transform (STFT) domain, convolution can be approximated by multiplication. Let xi(k,) and sj(k,) denote, respectively, the STFT coefficient of xi(n) and sj(n) at frequency k and frame . Then, (1) can be replaced by a set of K complex-valued linear instantaneous mixtures

$$mathbf{x}_{k}=mathbf{A}_{k} mathbf{s}_{k}, qquad k = 1,dots,K,$$

(2)

where xk and sk are symbolic vectors representing, respectively, ([x_{1}(k,ell),dots,x_{{m}}(k,ell)]^{T}) and ([s_{1}(k,ell),dots,s_{d}(k,ell)]^{T}), for any frame (ell =1,dots,N); Ak stands for the m×d mixing matrix whose ijth element is related to the kth Fourier coefficient of the impulse response hij; K is the frequency resolution of the STFT; for detailed explanations, see, e.g., Chapters 1 through 3 in [3].

### Blind source extraction

For the BSE problem, we can write (2) in the form

$$mathbf{x}_{k} = mathbf{a}_{k} s_{k} + mathbf{y}_{k},qquad k=1,dots,K,$$

(3)

where sk represents the source of interest (SOI), ak is the corresponding column of Ak, called the mixing vector, and yk represents the remaining signals in xk, i.e., yk=xkaksk.

Since there is the ambiguity that any of the original sources can play the role of the SOI, we can assume, without loss of generality, that the SOI corresponds to the first source in (2); hence, ak is the first column of Ak. The problem of guaranteeing the extraction of the desired SOI will be addressed in Section 3.3.

The assumption that the original signals in (2) are independent implies that sk and yk are independent. We will also assume that m=d, i.e., that there is the same number of microphones as that of the sources. It follows that the mixing matrices Ak are square. By assuming also that they are non-singularFootnote 1 and that their inverse matrices exist, the existence of a separating vector wk (the first row of (mathbf {A}_{k}^{-1})) such that (mathbf {w}_{k}^{H}mathbf {x}_{k}=s_{k}) is guaranteed. We pay for this advantage by the limitation that yk belongs to a subspace of dimension d−1. In other words, the covariance of yk is assumed to have rank d−1 as opposed to real recordings where the typical rank is d (e.g. due to sensor and environment noises). Nevertheless, the assumption m=d brings more advantages than disadvantages as shown in [10]. One way to compensate is to increase the number of microphones so that the ratio (frac {d-1}{d}) approaches 1. BSE appears to be computationally more efficient than BSS when d is large since, in BSE, yk is not separated into individual signals.

In [13], the BSE problem is formulated by exploiting the fact that the d−1 latent variables (background signals) involved in yk can be defined arbitrarily. An effective parameterization that involves only the mixing and separating vectors related to the SOI has been derived. Specifically, Ak and (mathbf {A}_{k}^{-1}) (denoted as Wk) have the structure

$$mathbf{A}_{k} = left(begin{array}{ll} mathbf{a}_{k} & mathbf{Q}_{k} end{array}right) = left(begin{array}{cc} gamma_{k} & mathbf{h}_{k}^{H}\ mathbf{g}_{k} & frac{1}{gamma_{k}}(mathbf{g}_{k}mathbf{h}_{k}^{H}-mathbf{I}_{d-1}) \ end{array}right),$$

(4)

and

$$mathbf{W}_{k} = left(begin{array}{c} mathbf{w}_{k}^{H}\ mathbf{B}_{k} end{array}right) = left(begin{array}{cc} {beta_{k}}^{*} & mathbf{h}_{k}^{H}\ mathbf{g}_{k} & -gamma_{k} mathbf{I}_{d-1} \ end{array}right),$$

(5)

where Id denotes the d×d identity matrix, wk denotes the separating vector which is partitioned as wk=[βk;hk]; the mixing vector ak is partitioned as ak=[γk;gk]. The vectors ak and wk are linked through the so-called distortionless constraint(mathbf {w}_{k}^{H}mathbf {a}_{k} = 1), which, equivalently, means

$$beta_{k}^{*}gamma_{k} + mathbf{h}_{k}^{H}mathbf{g}_{k} = 1, qquad k=1,dots,K.$$

(6)

Bk=[gk,−γkId−1] is called the blocking matrix as it satisfies that Bkak=0. The background signals are given by zk=Bkxk=Bkyk, and it holds that yk=Qkzk. To summarize, (2) is recasted for the BSE problem as

$${}mathbf{x}_{k} ,=,left(!begin{array}{cc} gamma_{k} & mathbf{h}_{k}^{H}\ mathbf{g}_{k} & frac{1}{gamma_{k}}(mathbf{g}_{k}mathbf{h}_{k}^{H}-mathbf{I}_{d-1}) \ end{array}!right)! left(begin{array}{c} s_{k}\ mathbf{z}_{k} end{array}right),quad k=1,dots,K.$$

(7)

### CSV mixing model

Now, we turn to an extension of (7) to time-varying mixtures. Let the available samples of the observed signals (meaning the STFT coefficients from N frames) be divided into T intervals; for the sake of simplicity, we assume that the intervals have the same integer length Nb=N/T. The intervals will be called blocks and will be indexed by (tin {1,dots,T}).

A straightforward extension of (7) to time-varying mixtures is when all parameters, i.e., the mixing and separating vectors, are block-dependent. However, such an extension brings no advantage compared to processing each block separately. In the constant separating vector (CSV) mixing model, it is assumed that only the mixing vectors are block-dependent while the separating vectors are constant over the blocks. Hence, the mixing and de-mixing matrices on the tth block are parameterized, respectively, as

$$mathbf{A}_{k,t} = left(begin{array}{cc} mathbf{a}_{k,t} & mathbf{Q}_{k,t} end{array}right) = left(begin{array}{cc} gamma_{k,t} & mathbf{h}_{k}^{H}\ mathbf{g}_{k,t} & frac{1}{gamma_{k,t}}(mathbf{g}_{k,t}mathbf{h}_{k}^{H}-mathbf{I}_{d-1}) \ end{array}right),$$

(8)

and

$$mathbf{W}_{k,t} = left(begin{array}{c} mathbf{w}_{k}^{H}\ mathbf{B}_{k,t} end{array}right) = left(begin{array}{cc} {beta_{k}^{*}} & mathbf{h}_{k}^{H}\ mathbf{g}_{k,t} & -gamma_{k,t} mathbf{I}_{d-1} \ end{array}right).$$

(9)

Each sample of the observed signals on the tth block is modeled according to

$$mathbf{x}_{k,t}=mathbf{A}_{k,t} left(begin{array}{c} s_{k,t}\ mathbf{z}_{k,t} end{array}right),$$

(10)

where sk,t and zk,t represent, respectively, the kth frequency of the SOI and of the background signals at any frame within the tth block. Note that, the CSV coincides with the static model (7) when T=1.

The practical meaning of the CSV model is illustrated in Fig. 1. While CSV admits that the SOI can change its position from block to block (the mixing vectors ak,t depend on t), the block-independent separating vector wk is sought such that extracts the speaker’s voice from all positions visited during its movement. There are two main reasons for this: First, the achievable interference-to-signal ratio (ISR) depends on wk so it has order (mathcal {O}(N^{-1})), compared to when wk is block-dependent, which yields ISR of order (mathcal {O}(N_{b}^{-1})); this is confirmed by the theoretical study on Cramér-Rao bounds in [24]. Second, the CSV enables BSE methods to avoid the discontinuity problem mentioned in the previous section.

The CSV also brings a limitation. Formally, the mixture must obey the condition that for each k a separating vector exists such that (s_{k,t}=mathbf {w}_{k}^{H}mathbf {x}_{k,t}) holds for every t; a condition that seems to be quite restrictive. Nevertheless, preliminary experiments in [23] have shown that this limitation is not crucial in practical situations and does not differ much from that of static methods (spatially overlapping speakers cannot be separated), especially when the number of microphones is high enough to provide sufficient degrees of freedom. When the speakers are static, the rule of thumb says that the speakers cannot be separated or, at least, are difficult to separate through spatial filtering, when their angular positions with respect to the microphone array are the same. Hence, moving speakers cannot be separated based on the CSV when their angular ranges with respect to the array during the recording are overlapping. The experimental part of this work presented in Section IV validates these findings.

### Source model

In this section, we introduce the statistical model of the signals adopted from IVE. Samples (frames) of signals will be assumed to be identically and independently distributed (i.i.d.) within each block according to the probability density function (pdf) of the representing random variable.

Let st denote the vector component corresponding to the SOI, i.e., (mathbf {s}_{t}=[s_{1,t},dots,s_{K,t}]^{T}). The elements of st are assumed to be uncorrelated (because they correspond to different frequency components of the SOI) but dependent, that is, their higher-order moments are taken into account [9]. Let ps(st) denote the joint pdf of st and (phantom {dot {i}!}p_{mathbf {z}_{k,t}}(mathbf {z}_{k,t})) denote the pdfFootnote 2 of zk,t. For simplifying the notation, ps(·) will be denoted without the index t although it is generally dependent on t. Since st and (mathbf {z}_{1,t},dots,mathbf {z}_{K,t}) are independent, their joint pdf within the tth block is equal to the product of marginal pdfs

$$p_{s}(mathbf{s}_{t})cdotprod_{k=1}^{K} p_{mathbf{z}_{k,t}}(mathbf{z}_{k,t}).$$

(11)

By applying the transformation theorem to (11) using (10), from which it follows that

$$left(begin{array}{c} s_{k,t}\ mathbf{z}_{k,t} end{array}right)=mathbf{W}_{k,t}mathbf{x}_{k,t}= left(begin{array}{c} mathbf{w}_{k}^{H}mathbf{x}_{k,t}\ mathbf{B}_{k,t}mathbf{x}_{k,t} end{array}right),$$

(12)

the joint pdf of the observed signals from the tth block reads

$$begin{array}{*{20}l} p_{mathbf{x}}({mathbf{x}_{k,t}}_{k}) &= p_{s}left(left{mathbf{w}_{k}^{H}mathbf{x}_{k,t}right}_{k}right) \ &quad timesprod_{k=1}^{K} p_{mathbf{z}_{k,t}}(mathbf{B}_{k,t}mathbf{x}_{k,t}) |det mathbf{W}_{k,t}|^{2}. end{array}$$

(13)

Hence, the log-likelihood function as a function of the parameter vectors wk and ak,t and all available samples of the observed signals in the tth block is given by

{}begin{aligned} &mathcal{L}({{mathbf w}_{k}}_{k},{{mathbf a}_{k,t}}_{k}|{{mathbf x}_{k,t}}_{k})\ &{kern17pt}= hat{mathrm E}left[log p_{s}({{hat s}_{k,t}}_{k})right] +sum_{k=1}^{K} hat{mathrm E}left[log p_{{mathbf z}_{k,t}}(hat{mathbf z}_{k,t})right]\ &{kern17pt}quad+log |det {mathbf W}_{k,t}|^{2}, end{aligned}

(14)

where ({hat s}_{k,t}=mathbf {w}_{k}^{H}mathbf {x}_{k,t}) and (hat {mathbf {z}}_{k,t}=mathbf {B}_{k,t}mathbf {x}_{k,t}) denote the current estimate of the SOI and of the background signals, respectively.

In BSS and BSE, the true pdfs of the original sources are not known, so suitable model densities have to be chosen in order to derive a contrast function based on (14). To find an appropriate surrogate of ps(st), the variance of SOI, which can be changing from block to blockFootnote 3 has to be taken into account. Let f(·) be a pdf corresponding to a normalized non-Gaussian random variable. To reflect the block-dependent variance, ps(st) should be replaced by

$$p_{s}(mathbf{s}_{t}) approx fleft(left{frac{{ s}_{k,t}}{{sigma}_{k,t}}right}_{k}right)left(prod_{k=1}^{K}{sigma}_{k,t}right)^{-2},$$

(15)

where (sigma ^{2}_{k,t}) denotes the variance of sk,t. Its unknown value is replaced by the sample-based variance of (hat s_{k,t}), which is equal to (hat sigma _{k,t}=sqrt {mathbf {w}_{k}^{H}widehat {mathbf {C}}_{k,t}mathbf {w}_{k}}) where (widehat {mathbf {C}}_{k,t}=hat {mathrm {E}}left [mathbf {x}_{k,t}mathbf {x}_{k,t}^{H}right ]) is the sample-based covariance matrix of xk,t.

It is worth noting that in the case of the static mixing model, i.e. when T=1, it can be assumed that (sigma ^{2}_{k,t}=1) because of the scaling ambiguity.

Similarly to [13], the pdf of the background is assumed to be circular Gaussian with zero mean and (unknown) covariance matrix (phantom {dot {i}!}mathbf {C}_{mathbf {z}_{k,t}}=mathrm {E}left [mathbf {z}_{k,t}mathbf {z}_{k,t}^{H}right ]), i.e., (phantom {dot {i}!}p_{mathbf {z}_{k,t}}sim mathcal {CN}(0,mathbf {C}_{mathbf {z}_{k,t}})). Next, by Eq. (15) in [13] it follows that | detWk,t|2=|γk,t|2(d−2), which corresponds to the third term in (14).

Now, by replacing the unknown pdfs in (14) and by neglecting the constant terms, we obtain the contrast function in the form

$$begin{array}{*{20}l} mathcal{C}&({{mathbf w}_{k}}_{k},{{mathbf a}_{k,t}}_{k,t}) \ &= frac{1}{T}sum_{t=1}^{T}left{hat{mathrm{E}}left[log fleft(left{frac{{mathbf w}_{k}^{H}{mathbf x}_{k,t}}{hatsigma_{k,t}}right}_{k}right)right] -sum_{k=1}^{K}log(hatsigma_{k,t})^{2} right.\ &quad-sum_{k=1}^{K} hat{mathrm E}left[{mathbf x}_{k,t}^{H}{mathbf B}_{k,t}^{H}{mathbf C}_{{mathbf z}_{k,t}}^{-1}{mathbf B}_{k,t}{mathbf x}_{k,t}right] \&quad+left.(d-2)sum_{k=1}^{K}log |gamma_{k,t}|^{2}right}. end{array}$$

(16)

The nuisance parameter (phantom {dot {i}!}mathbf {C}_{mathbf {z}_{k,t}}) will later be replaced by its sample-based estimate (widehat {mathbf C}_{{mathbf z}_{k,t}}=hat {mathrm E}left [hat {mathbf z}_{k,t}hat {mathbf z}_{k,t}^{H}right ]).