# Low-complexity deep unfolded neural network receiver for MIMO systems based on the probability data association detector – EURASIP Journal on Wireless Communications and Networking

#### ByPedro H. C. de Souza and Luciano L. Mendes

Aug 9, 2022

A classical problem in the MIMO literature is to decide which symbols were transmitted by each antenna when only (2) is available at the receiver. This detection problem can be solved optimally, however at great computational effort, by the MLD for MIMO as follows

begin{aligned} mathbf {{hat{a}}}_k = underset{{mathbf {a}}_k in {mathbb {S}}^{2N_text {t}}}{arg min } Vert {mathbf {r}}_k – {mathbf {H}}_k{mathbf {a}}_kVert _2^2, end{aligned}

(7)

for which (mathbf {{hat{a}}}_k in {mathbb {S}}^{2N_text {t}}) is the estimated vector of symbols’ coordinates.

It is known that the prohibitive complexity presented by the MLD motivated the research of several alternative detectors for MIMO throughout the last decades [3]. The PDA detector is one of these alternatives that presents significantly lower complexity when compared with the MLD, with an affordable bit error rate(BER) performance loss under specific conditions, as will be detailed in Sects. 3.5 and 4. In Sect. 3.1, the PDA detectors’ algorithm first proposed in [9] is briefly revisited, followed by our proposed DU-PDA, for which the PDA is the underlying algorithm.

### Probability data association detector

Before the detection task is carried out by the PDA detector, the received signal, ({mathbf {r}}_k), is preprocessed or equalized using the ZF principle as follows [2, 3, 9]

begin{aligned} {{textbf {z}}}_k = {{textbf {H}}}^{dagger }_{k} {{textbf {r}}}_k = {{textbf {a}}}_k + {{textbf {v}}}_k, end{aligned}

(8)

wherein ({mathbf {H}}^{dagger }_{k} = ({mathbf {H}}^T_k {mathbf {H}}_k)^{-1} {mathbf {H}}^T_k) is the left Moore–Penrose pseudoinverse and ({mathbf {v}}_k = {mathbf {H}}^{dagger }_{k} {mathbf {n}}_k) is the enhanced AWGN. Let us rewrite (8), such that

begin{aligned} {mathbf {z}}_k = {mathbf {e}}_i a_k (i) + underbrace{sum _{j ne i}{{mathbf {e}}_j a_k (j)} + {mathbf {v}}_k}_{{mathcal {V}}_i}, forall i,j in {0,1,ldots ,2N_text {t} – 1}, end{aligned}

(9)

where ({mathbf {e}}_i) is the vector with 1 (one) at its ith entry and 0 (zero) otherwise, and ({mathcal {V}}_i) is a multivariate random variable (RV) that can be seen as the effective interference-plus-noise contaminating (a_k (i)) [9]. Therefore, the crux is at detecting the symbol transmitted by the ith antenna, while considering that all other (j ne i) transmitted symbols are interference added to the noise term, which is described by ({mathcal {V}}_i).

Therefore, the PDA detector associates, for each (a_k (i)), a probability vector ({mathbf {p}}_i in {mathbb {R}}^{sqrt{M}}), which is given by the evaluation of ({P_m(a_k (i) = qleft( mright) | {mathbf {z}}_k,{{mathbf {p}}_j}_{forall j ne i})}); ({qleft( mright) in {mathbb {S}}}) being a coordinate of the M-QAM constellation and ({m in {0,1,ldots ,sqrt{M} – 1}}). It is important to remark that the PDA detector uses all ({{mathbf {p}}_j}_{forall j ne i}) associated with interfering symbols already detected, thanks to the incorporation of a strategy similar to that of successive interference cancellation (SIC) detectors. This significantly reduces the computational complexity for calculating ({mathbf {p}}_i), since otherwise ({P_m(a_k (i) = qleft( mright) | {mathbf {z}}_k)}) would have to be evaluated. The problem here is the requirement of computing multiple integrals for each received symbol, rendering this evaluation prohibitive in practice. Dropping the subscript ((cdot )_k) in order to simplify the notation and assuming that ({mathcal {V}}_i) has a Gaussian distribution [9, 11], then the likelihood function of ({{mathbf {z}} | a(i) = qleft( mright) }) can be defined as

begin{aligned} P_m({mathbf {z}} | a(i) = qleft( mright) ) propto exp left( alpha _m left( iright) right) , end{aligned}

(10)

for which

begin{aligned} alpha _m left( iright) = left( {mathbf {z}} – varvec{mu }_i – 0.5{mathbf {e}}_i qleft( mright) right) ^text {T} varvec{Omega }_i^{-1} {mathbf {e}}_i qleft( mright) , end{aligned}

(11)

wherein (mathrm {E}left[ {mathcal {V}}_i right] = varvec{mu }_i) and (mathrm {COV}left[ {mathcal {V}}_i right] = varvec{Omega }_i) are given by

begin{aligned} varvec{mu }_i&= sum _{j ne i}{{mathbf {e}}_jleft( {mathbf {q}}^text {T} {mathbf {p}}_j right) }, end{aligned}

(12)

begin{aligned} varvec{Omega }_i&= sum _{j ne i}{{mathbf {e}}_j{mathbf {e}}_j^text {T}left( left( {mathbf {q}}^2right) ^text {T} {mathbf {p}}_j – varvec{mu }_j^2 right) } + 0.5sigma ^2 {mathbf {G}}^{-1}, end{aligned}

(13)

where ({mathbf {q}} = [qleft( 0right) qleft( 1right) ldots q(sqrt{M} – 1)]^text {T}) and ({{mathbf {G}}^{-1} = ({mathbf {H}}^text {T} {mathbf {H}})^{-1}}) is the inverse of the Gram matrix [2] that accounts for the noise enhancement caused by the ZF. To evaluate the posteriors probabilities associated with each symbol, we compute

begin{aligned} P_m(a(i) = qleft( mright) | {mathbf {z}},{{mathbf {p}}_j}_{forall j ne i}) approx frac{P_m({mathbf {z}} | a(i) = qleft( mright) )}{sum limits _{m=0}^{sqrt{M} – 1}{P_m({mathbf {z}} | a(i) = qleft( mright) )}}, end{aligned}

(14)

which can be seen as an approximate form of the Bayesian theorem [11]. Then, substituting (10) into (14) yields

begin{aligned} p_i left( mright) = frac{exp left( alpha _m left( iright) right) }{sum limits _{m=0}^{sqrt{M} – 1}{exp left( alpha _m left( iright) right) }}. end{aligned}

(15)

Finally, the PDA detector procedure is given in Algorithm 1.

Note that the optimal detection sequence [9] used in Algorithm 1 can be found with the aid of the following operation:

begin{aligned} rho left( iright) = frac{1}{{mathbf {f}}_i^text {T} {mathbf {H}} {mathbf {f}}_i}max left{ 0,{mathbf {f}}_i^text {T} {mathbf {h}}_i – sum limits _{j ne i}{|{mathbf {f}}_i^text {T} {mathbf {h}}_j |}right} ^2, end{aligned}

(16)

where ({mathbf {f}}_i^text {T}) represents the ith row of ({mathbf {F}} = {mathbf {H}}^{dagger }) and ({mathbf {h}}_j) denotes the jth column of ({mathbf {H}}). Note that larger magnitudes for (rho left( iright)) mean that the ith antenna suffers less IAI [3]. In other words, the off-diagonal entries of the ith row from ({mathbf {F}}{mathbf {H}}) have, combined, smaller magnitudes than its ith diagonal entry. It is easy to show that the optimal sequence is defined by sorting ({varvec{rho } = [rho left( 0right) rho left( 1right) ldots rho (2N_text {t} – 1)]^text {T}}) in a descending order, denoted as ({{k_i in {1,ldots ,2N_text {t}} | rho left( k_0right)> rho left( k_1right)> ldots > rho left( k_{2N_text {t}}right) }}).

### Deep unfolding

Prior to presenting our proposed DU-PDA detector, a brief description of NNs and deep unfolding is provided in this section.

In general, the NN architecture has shown great potential for detecting signals, but its design and parameterization, among other problems, impose limitations [4]. Alternatively, this architecture can be adapted such that iterations of an given algorithm are unfolded on its layers [5, 6, 12], hence the term “unfolding.” It is also commonly assumed that the NN employs several layers and, consequently, the term “deep” is added.

More specifically, consider an algorithm with an input vector denoted by ({mathbf {x}} in {mathbb {R}}^{N}), for which its output is given by ({mathbf {y}} in {mathbb {R}}^{S}), then this algorithm can be expressed by [12]

begin{aligned} yleft( sright) = g left( {mathbf {x}},varvec{psi },varvec{Theta }right) , forall s in {0,1,ldots ,S – 1}, end{aligned}

(17)

wherein (varvec{Theta }) is the set of all parameters used by the algorithm, (g(cdot )) represents a mapping function, usually nonlinear, and (varvec{psi }) is iteratively updated as follows

begin{aligned} psi _ell left( sright) = f left( {mathbf {x}},psi _{ell – 1} left( sright) ,varvec{Theta }right) , end{aligned}

(18)

where the (ell)th iteration also involves an operation with a mapping function (f(cdot )) and (varvec{psi }_0) denotes the initial value.

Therefore, in the deep unfolded context, (varvec{psi }_ell) can be understood as the input-output relationship at the (ell)th layer of a NN architecture, as illustrated in Fig. 1.

Note that dimensions of learnable parameters (varvec{Theta }) are defined according to the underlying algorithm after which (17), (18) and the architecture depicted in Fig. 1 are based. This includes weights and bias, for example, which are optimized by the NN training algorithm [4, 12]. In other words, this means that the number of layers and neurons is fixed, thereby simplifying considerably the process of defining what is commonly known as the NN hyperparameters.

Moreover, improvements are also obtained by using the aforementioned learnable parameters directly into the iterative algorithm. That way, learning capabilities of NNs can be applied for optimizing algorithms such that its global performance, computational complexity, or even both, are improved. In Sect. 3.3, the PDA detector, reviewed in Sect. 3.1, is implemented using the deep unfolded architecture for NNs, unveiling our proposed DU-PDA detector for MIMO systems.

### Proposed deep unfolded PDA detector

Aiming to take advantage of the iterative algorithm of the PDA detector, we propose the DU-PDA detector. Firstly, in the DU-PDA detector, the received signal, ({mathbf {r}}), is preprocessed at the (ell)th layer by the following operation [8]; [13, §IV-B, p. 1706]

begin{aligned} {mathbf {z}}_ell = mathbf {{hat{a}}}_ell + w_ell {mathbf {H}}^text {T}left( {mathbf {r}} – {mathbf {H}}mathbf {{hat{a}}}_ell right) , forall ell in {0,1,ldots ,L – 1}, end{aligned}

(19)

where (mathbf {{hat{a}}}_ell in {mathbb {R}}^{2N_text {t}}) is the estimated transmitted symbol vector and the scalar (w_ell in {mathbb {R}}) represents a learnable parameter. Note that this preprocessing principle differs from the ZF, which is used by the PDA detector, as defined in (8). In contrast, for the proposed DU-PDA, it is employed a preprocessing based on the approximate message passing (AMP)algorithm [14], which also bear similarities with the Richardson method [2, §IV-6, p. 9]. In this way, (mathbf {{hat{a}}}_ell) is updated iteratively until it converges to an acceptable approximation of the transmitted symbol vector. Interestingly, when we have (mathbf {{hat{a}}}_ell rightarrow {mathbf {a}}), then the so-called residual term (left( {mathbf {r}} – {mathbf {H}}mathbf {{hat{a}}}_ell right) rightarrow {mathbf {n}}), which give us a result in (19) similar to (8).

The preprocessed signal of (19) is then fed into the following operationFootnote 1:

begin{aligned} psi _{ell ^*} left( mright)&= text {softm}left( left( {mathbf {z}}_ell – varvec{mu }_{ell ^*} – 0.5{mathbf {e}}_{ell ^*} qleft( mright) right) ^text {T} varvec{Omega }_{ell ^*}^{-1} {mathbf {e}}_{ell ^*} qleft( mright) right) nonumber \&forall m in {0,1,ldots ,sqrt{M} – 1}, end{aligned}

(20)

where

begin{aligned} text {softm}left( x_ell left( mright) right) = frac{e^{x_ell left( mright) }}{sum _{m=0}^{L – 1} e^{x_ell left( mright) }}. end{aligned}

(21)

Note that the nonlinear function (text {softm}(cdot )) is applied at each layer. This makes (20) identical to (15) except that it is unfolded on successive layers and that (varvec{psi }_j = {mathbf {p}}_j). Notably, this also distinguishes the proposed DU-PDA from other architectures [5, 8, 13] that use instead optimal denoisers at each layer, which do not account for interfering symbols as the underlying PDA algorithm of the DU-PDA does. Moreover, since the preprocessing is modified, then it is necessary to redefine the covariance matrix, (varvec{Omega }_{ell ^*}), as follows [15, §III-D, p. 2023], [8]

begin{aligned} varvec{Omega }_{ell ^*} = sum _{j ne {ell ^*}}{{mathbf {e}}_j{mathbf {e}}_j^text {T}left( left( {mathbf {q}}^2right) ^text {T} varvec{psi }_j – varvec{mu }_j^2 right) } + {mathbf {e}}_{ell ^*}{mathbf {e}}_{ell ^*}^text {T}mathrm {COV}left[ {mathbf {z}}_ell – {mathbf {a}}right] , end{aligned}

(22)

where

begin{aligned} mathrm {COV}left[ {mathbf {z}}_ell – {mathbf {a}}right] = frac{[epsilon _ell ]_+ Vert {mathbf {I}}_{2N_text {t}} – w_ell {mathbf {H}}^text {T} {mathbf {H}}Vert _2^2 + 0.5sigma ^2 Vert w_ell {mathbf {H}}^text {T}Vert _2^2}{2N_text {t}}, end{aligned}

(23)

wherein ([x]_+ = max {(0,x)}) and for which

begin{aligned} epsilon _ell = frac{Vert {mathbf {r}} – {mathbf {H}}mathbf {{hat{a}}}_ell Vert _2^2 – N_text {r} sigma ^2}{Vert {mathbf {H}}Vert _2^2}. end{aligned}

(24)

Equation (23) can be understood as the empirical mean-squared error (MSE) estimator of the covariance matrix originated from the residual and noise terms of (19). More importantly, note that (varvec{Omega }_{ell ^*}) is now a diagonal matrix. This means that computing (varvec{Omega }_{ell ^*}^{-1}) is not as costly as its counterpart in (11), that is, in the PDA detector. More details about such implications are given in Sect. 3.4.

Therefore, by considering developments presented in this subsection and the general model described in Sect. 3.2, we have

begin{aligned} psi _{{ell ^*} + 1} left( mright)&= text {softm}left( {mathbf {z}}_ell ,psi _{ell ^*} left( mright) ,{w_ell , varvec{mu }_{ell ^*}, varvec{Omega }_{ell ^*}}right) , end{aligned}

(25)

which is similar to what is evaluated in (15) with the addition, however, of a learnable parameter and a different preprocessing of the received signal. Note also that (varvec{psi }_L = {mathbf {y}}), meaning that the last layer output is also given by (25). Furthermore, let

begin{aligned} mathbf {{hat{a}}}_{ell + 1} = sum _{j ne ell }{{mathbf {e}}_j z_ell left( jright) + {mathbf {e}}_ell left( {mathbf {q}}^text {T} varvec{psi }_{ell ^*}right) }, end{aligned}

(26)

such that the convergence of (19) might be improved, given that the soft combining of symbols’ coordinates and their estimated associated probabilities are fed forward to the next layer.

In Algorithm 2,

we detail the general procedure carried out by the proposed DU-PDA detector.

The ground truth used for training the NN is defined by ({{mathbf {I}}_{ell ^*} = [Ileft( 0right) Ileft( 1right) ldots I(sqrt{M} – 1)]^text {T}}), such that ({mathcal {I}} = {{mathbf {I}}_{ell ^*}}_{forall ell ^*}). It indicates the known constellation coordinates that are transmitted for the training procedure; thus, (I_{ell ^*}left( mright) in {0, 1} forall m). Observe also that the PDA detector outputs approximate posteriors, as shown in (15), which is leveraged by our proposed DU-PDA detector in Algorithm 2 when employing the categorical cross-entropy loss function:

begin{aligned} {mathcal {L}}left( {mathcal {I}},varvec{psi }right) = frac{-1}{sqrt{M}}sum limits _{ell ^*}{{mathbf {I}}_{ell ^*} log {left( varvec{psi }_{ell ^*} right) } + left( 1 – {mathbf {I}}_{ell ^*} right) log {left( 1 – varvec{psi }_{ell ^*} right) }}. end{aligned}

(27)

Bear in mind that the loss is calculated considering the output of all L unfolded layers and not only the last one. Also, note that the use of (27) contrasts with the popular choice of the MSE loss function [5]. Additionally, it is a well-known fact that the cross-entropy loss function is more appropriate for classification tasks.

### Simplified DU-PDA

The model of the DU-PDA presented in the previous subsection can be simplified even further if some assumptions are made. Therefore, a new variation of the proposed DU-PDA detector, namely the simplified DU-PDA detector, is presented in this subsection. For this detector, the calculations performed in (23) are simplified and the scalar (0.5sigma ^2) is applied directly in (22). The reasoning behind this approach lies in the asymptotic case, that is, when (N_text {t} rightarrow infty) and (N_text {r} rightarrow infty). For this case, the first term of (23) vanishes, sinceFootnote 2

begin{aligned} {mathbf {H}}^text {T} {mathbf {H}} rightarrow {mathbf {I}}_{2N_text {t}}, end{aligned}

(28)

and similarly for the second term we have

begin{aligned} Vert w_ell {mathbf {H}}^text {T}Vert _2^2 rightarrow 2N_text {t}, end{aligned}

(29)

which yields

begin{aligned} mathrm {COV}left[ {mathbf {z}}_ell – {mathbf {a}}right]&rightarrow frac{[epsilon _ell ]_+ Vert {mathbf {I}}_{2N_text {t}} – {mathbf {I}}_{2N_text {t}}Vert _2^2 + N_text {t} sigma ^2}{2N_text {t}} nonumber \&rightarrow 0.5sigma ^2, end{aligned}

(30)

wherein, for the sake of simplicity, the learnable parameter (w_ell) is omitted. This is analogous to the channel hardening effect present in massive MIMO systems [2, 3], where values for (N_text {t}) and (N_text {r}) are large. Although we demonstrate via computational simulations in Sect. 4 that the simplified DU-PDA only presents marginal losses in performance, it is still unknown if other similar architectures proposed in the literature [5, 6, 8, 13] are robust enough to allow such simplifications.

### Computational complexity

According to the guidelines presented in [4, §IV-C, p. 122404], the global computation complexity of the PDA detector is approximately given by

begin{aligned} {mathcal {O}}(16N_text {t}^4 + 8sqrt{M}N_text {t}^3 + 8N_text {t}^2(N_text {r} + sqrt{M}) + 4N_text {t} N_text {r}). end{aligned}

(31)

However, if we let (N_text {r} gg sqrt{M}) and simplify constants, then it can be written more compactly as

begin{aligned} {mathcal {O}}(N_text {t}^4 + sqrt{M}N_text {t}^3 + N_text {t}^2N_text {r} + N_text {t} N_text {r}). end{aligned}

(32)

Note that ({mathcal {O}}(8N_text {t}^3 + 16N_text {t}^2 N_text {r} + 4N_text {t} N_text {r})) refers to the local cost of (8), where the inverse of ({mathbf {G}}) costs ({mathcal {O}}(8N_text {t}^3))Footnote 3 and ({{mathcal {O}}(16N_text {t}^4 + 8sqrt{M}(N_text {t}^3 + N_text {t}^2))}) is the complexity due to computing (11), for which (varvec{Omega }_i^{-1}) costs ({mathcal {O}}(8N_text {t}^3)) [9] per outer iteration in Algorithm 1.

Moreover, the DU-PDA detector has an approximate global complexity of

begin{aligned} {mathcal {O}}(4LN_text {t}^2 + 4LN_text {t}(4N_text {r} + sqrt{M}) + LN_text {r}). end{aligned}

(33)

Consider again that all constants are simplified and that (N_text {r} gg sqrt{M}) is simplified (33) to

begin{aligned} {mathcal {O}}(LN_text {t}^2 + LN_text {t}N_text {r} + LN_text {r}). end{aligned}

(34)

The global complexity is composed mainly by the local cost of (19), given by ({mathcal {O}}(8N_text {t} N_text {r})) per layer, and the local cost of (23), expressed by ({mathcal {O}}(4N_text {t}^2 + 8N_text {t} N_text {r} + N_text {r})) for each layerFootnote 4. The NN training stage cost is not taking into account when calculating the computational complexity of the detection stage, since the training stage is assumed to be computed offline as discussed in [4]. Nevertheless, in general, the backpropagation algorithm used for training NNs has a complexity that scales linearly with the number of training samples, (N_text {TR}), and training iterations, say (N_text {TI}). More importantly, it scales exponentially with the number of layers L because of the chain rule derivatives calculated during backpropagation. In principle, this is a high complexity when compared with the detection or forward-pass complexity, but once trained, the NN-based detector may serve multiple users during a prescribed timeline [16]. This means that the training complexity cost is distributed over time and users, whereas the detection complexity is fixed for each user and transmission cycle. Hence, since training is not performed in the detection cycle, its complexity is not considered, enabling a fair comparison with other detectors.

Furthermore, recall that the simplified form of calculation demonstrated by (30) reduces even further the global complexity of the proposed DU-PDA detector. More specifically, the global complexity of the simplified DU-PDA detector is given approximately by ({mathcal {O}}(LN_text {t}N_text {r})), meaning that the cost is reduced to one order-of-magnitude when compared to the DU-PDA detector.

From the computational complexity associated with each detector, it is possible to conclude that the PDA is more complex than the proposed DU-PDA. More specifically, this cost difference is due to the higher-order term (N_text {t}^4), included in the PDA global complexity. This is expected because of the inversion of matrices performed by the PDA detector, which are not necessary for both the DU-PDA and simplified DU-PDA detectors. Also, notice that for both of these detectors, the total number of layers L might significantly increase its global complexity. It is demonstrated in Sect. 4, however, that this number is a multiple of (N_text {t}), thus still implying in a lower global complexity for the DU-PDA when compared to the PDA. In fact, the simplified DU-PDA complexity becomes even lower than that of the ZF in the aforementioned case. Additionally, an optimal detection sequence, such as (16), is not a general requirement for the DU-PDA, which further reduces its global complexity in relation to the PDA.

Despite shedding light on how detectors’ computational complexity compares to each other, these are only asymptotic predictions of complexity. A detailed evaluation of system end-to-end latency [17, 18], for example, is out of scope in this work. However, it can be verified for a typical (4 times 8) MIMO considered in Sect. 4, that the symbol detection (see Line 15 of Algorithm 2) of the DU-PDA takes approximately 50 milliseconds in average with neglectable variance. Note that this time value heavily depends on the implementation of the proposed detector, which in this work is based on the TensorFlow library [19] not yet optimized for a full-fledged hardware implementation. Indeed, implementations using hardware description language (HDL) can provide a more reliable analysis on the end-to-end latency of the proposed detector.

For convenience, Table 1 summarizes the global computational complexity for all detectors of interest. Observe that the AMP detector and the sphere detector (SD) are also included for the sake of completeness. For the AMP, (N_text {I}) refers to the number of iterations or updates executed, whereas for the SD we considered the fixed-complexity SD [3, §VIII-D, p. 20], since its performance is near-optimum. To conclude, note also in Table 1 how the complexity of all detectors increases polynomially with the number of transmitting antennas (N_text {t}). The exceptions, however, are the MLD and the SD, whose complexity increases exponentially with (N_text {t}) and (sqrt{N_text {t}}), respectively, as expected.