3.2.2 Selected Memory Module
With pairs of multi-modal representations \(\{(h_1^I, h_1^T) \dots (h_z^I, h_z^T)\}\) as input, this module aims to generate high-quality hash codes and select negative samples for contrastive learning. The specific process is as follows, we first cluster the multi-modal pairs to get pseudo-category labels. We then store all the representations in a memory bank and generate hash codes using the hash function. Next, we conduct contrastive learning by selecting negative samples from different categories. Meanwhile, we use the pseudo labels and centers obtained from a Hadamard Matrix to optimize the hash codes. Now we explain each step in detail.
The first step is to cluster the representations to get pseudo labels for pairs. We first concatenate the output of the two encoder modules to get the multi-modal representation
H, which is shown in Eq.
1.
$$\begin{aligned} H = concat(H_I, H_T) \end{aligned}$$
(1)
We then use k-means clustering algorithm [
3] to cluster
H into
m groups. Specifically, the representations are pre-divided into
m groups. The initial cluster centers are determined by randomly selecting
m objects. Then, we iteratively calculate the distance between each representation and the cluster centers, assigning each representation to the cluster center that is closest to it. The cluster centers are updated for the next round’s calculation. In this way, each multi-modal representation is given a pseudo-class label, denoted as
\(L=\{l_1,l_2,...,l_n\}\). The value of parameter
m is set to 24 in our research paper.
The second step is to store all the multi-modal representations in the memory bank. Unlike the traditional methods that usually learn representations in the batch data, we store global multi-modal information of different categories by building a memory bank, which alleviates batch learning restriction. The image-text representations stored in the bank will later be used to provide negative samples for contrastive learning. The input of memory bank is the multi-modal representations obtained by the encoders. Given query
\(h_i^*,*\in {i,t}\), We aim to directly retrieve the correlated/positive keys from
\(K=\{k_1, k_2,..., k_n \}\). The
i-th key
\(k_i\) corresponds to the
i-th image-text pair. We update the memory bank by the momentum update method as follows:
$$\begin{aligned} {\textbf{v}}_{i^{\prime }}= & {} \delta {\textbf{v}}_{i^{\prime }}+(1-\delta ) \frac{{\textbf{h}}_{i}^{x}+{\textbf{h}}_{i}^{y}}{2} \end{aligned}$$
(2)
$$\begin{aligned} k_i= & {} sign(v_{i}) \end{aligned}$$
(3)
This training objective of the memory bank accomplishes searching by keys. That is, given a multi-modal representation h, we obtain the key
\(k_i\) and use the key to retrieve relevant queries in the memory bank
\(K_n\). The retrieved result is denoted as
\(\{k_i^{+}\}\). Next, we optimize the negative samples selection process for contrastive learning by selecting negative samples in different categories of the memory bank. The negative samples(denoted as
\(\{k_j^{-}\}_{j=-1}^K\)) consist of distinguished class samples, which enhance the discriminative power of contrastive learning and avoid similar samples being mis-selected. Using a hash function
f composed of a sign function and MLP, we obtain image hash codes
\(B_I={b_{n}^i}\) and text hash codes
\(B_T=b_{n}^t\) for each pair. The hash code
\(b_{n}^*,*\in {i,t}\) has length
q for the convenience of retrieval.
The third step is to optimize the class hash centers and pull the class hash centers as far away as possible, this provides optimized class hash centers for the next hash codes training process. In brief, we train the image and text hash codes based on the optimized class hash centers and the pseudo-class labels from the first step. We initialize the hash class centers from the Hadamard matrix. In this way, we can ensure that the class centers vector is composed of 1 and \(-\,\)1. In addition, any two rows (or columns) of the Hadamard matrix are orthogonal according to the features of the orthogonal square. Thus, the constructed class vectors are orthogonal to each other.
Now we optimize the hash centers to maximize the distances between class hash centers. We refer to [
25], a method which optimizes the class hash center with the constraint that the Hamming distance between any two centers is not less than the minimum distance
d. This method uses Gilbert-Varshamov bound [
23] to obtain a large
d. Each of these hash centers corresponds to one class respectively.
Given samples in
m classes, we optimize
m hash centers
\(C=\{c_1, c_2,..., c_m\}\) of our hash codes by maximizing the optimization target and with the restricted condition shown in Eq.
4.
$$\begin{aligned} \begin{aligned}&\max _{c_1,..., c_m \in \{-1,1\}^{q}}\frac{1}{m(m-1)} \sum _{i}\sum _{j: j \ne i}\left\| c_i - c_j \right\| _H \\&\text{ s.t. } \left\| c_i - c_j \right\| _H \ge d(1 \le i,j \le m, j \ne i), \end{aligned} \end{aligned}$$
(4)
where
q is the hash code length,
d is the minimal distance and
\(||.||_H\) represents the Hamming distance. We use the Gilbert-Varshamov bound [
23] to determine a large minimal distance
d while ensuring the feasibility of our optimization procedure. Specifically, there are
m q-bit codes that the minimal Hamming distance of any two codes is
d, as long as
m,
q and
d satisfy
\(\frac{2^{q}}{c} \le \sum _{i=0}^{d-1}\left( \begin{array}{c} q \\ i \end{array}\right)\). Hence, to obtain a large
d, we only need the maximum of
d to satisfy the equation. Due to the monotone increasing function
\(f(d)= \sum _{i=0}^{d}\left( \begin{array}{c} q \\ i \end{array}\right)\), we have
$$\begin{aligned} \left\{ \begin{array}{l} \frac{2^{q}}{c} \le \sum _{i=0}^{d^{\star }-1}\left( \begin{array}{l} q \\ i \end{array}\right) \\ \frac{2^{q}}{c}>\sum _{i=0}^{d^{\star }-2}\left( \begin{array}{l} q \\ i \end{array}\right) \end{array}\right. \end{aligned}$$
(5)
Since
\(d^{\star }\) is an integer 1, 2, 3, ...,
q, we can find its value by exhaustively searching. This objective makes hash centers of different categories as far as possible.
We further leverage the following facts to improve the implementation of Eq.
4.
1.
The hamming distance
\(\left\| c_i - c_j \right\| _H\) is equivalent to
\(-c_{i}^{T} c_{j}\). Maximizing the hash distance is equivalent to minimizing the inner product [
26].
2.
\(4\left\| c_i-c_j \right\| _H=\left\| c_i - c_j \right\| _2^2= c_{i}^{T} c_{i}+c_{j}^{T} c_{j}-2c_{i}^{T} c_{j}=2q - 2c_{i}^{T}c_{j}\) [
25], where,
\(\Vert .\Vert _2\) is
\(\ell\)2 norm.
3.
The inner product of two hash centers is bounded by an upper limit to ensure accuracy, which limits the minimal hamming distance d with \(\left\| c_i-c_j \right\| _H \ge d\).
Thus, Eq.
4 can be simplified as follows:
$$\begin{aligned} \begin{aligned} \min _{c_1, c_2,..., c_m \in \{-1,1\}^{q}}&\sum _{j: j \ne i} c_{i}^{T}c_{j} \\ \text{ s.t. }&c_{i}^{T} c_{j} \le (q-2 d)\\&(1 \le i, j \le m, i \ne j). \end{aligned} \end{aligned}$$
(6)
To simplify the implementation, we adopt an optimization procedure that alternately updates one of the hash centers
\(c_i\) while keeping other centers
\(c_j (1 \le j \le m; j \ne i)\) fixed. Specifically, when all
\(c_j (j\ne i)\) are fixed, the subproblem w.r.t.
\(c_i\) can be formulated as:
$$\begin{aligned} \begin{aligned}&\min _{c_{i} \in \{-1,1\}^{q}}\sum _{j: j \ne i}c_{i}^{T} c_{j} \\&\text{ s.t. } c_{i}^{T} c_{j} \le q-2d (1 \le j \le m, j \ne i), \end{aligned} \end{aligned}$$
(7)
To solve the subproblem in Eq.
7, we utilize the
\(\ell_{p}-box\) algorithm proposed in [
29]. The
\(\ell_{p}-box\) algorithm showed that the binary constraint
\(z\in \{-1,1\}^q\) is equivalent to
\(z\in [-1,1]^{q} \bigcap \left\{ z:\Vert z\Vert _{p}^{p}=q\right\}\). The proof of the
\(\ell_{p}-box\) algorithm can refer to [
25]. We set
\(p = 2\) for simplicity. By dropping the binary constraint, we can reformulate Eq.
7 into the following equivalent form.
$$\begin{aligned} \begin{aligned} \min _{c_{i}, z_{1}, z_{2}}&\sum _{j: j \ne i} c_{i}^{T} c_{j} \\ \text{ s.t. }&c_{i}^{T} c_{j} \le (q-2 d) \\&(1 \le j \le m, i \ne j) \\&c_{i}= z_{1}, c_{i} = z_{2}, z_{1} \in {\mathcal {S}}_{b}, z_{2} \in {\mathcal {S}}_{p}, \end{aligned} \end{aligned}$$
(8)
where
\(S_b=\left\{ z:-1 _q<z< 1_q \right\}\) and
\(S_p=\left\{ z:\left\| z\right\| ^2_2=q \right\}\),
\(1_q\) represents a
q-dimensional vector with all ones. We obtain
\(S_b\) and
\(S_p\) by using the
\(\ell_{p}-box\) algorithm above.
An equality constraint can replace the inequality constraint by adding an auxiliary variable
\(z_3\). The inequality constraints
\(c_{i}^{T} c_{j} \le (q-2 d)\) is equal to the equality constraint
\(c_{i}^{T}C_{\sim i}+z_{3} =(q-2 d)1_{m-1}\) where
\(C_{\sim i}=[c_1, c_2,...,c_i-1, c_i+1,..., c_m]\) and
\(z_3\in R_{+}^{m-1}\),
\(R_{+}^{c-1}=\left\{ z: z \in [0,+\infty )^{m-1}\right\}\), further reformulating the problem in Eq.
8 as:
$$\begin{aligned} \begin{aligned}&\min _{c_{i}, z_{1}, z_{2}, z_{3}}\sum _{j: j \ne i}c_{i}^{T} c_{j} \\&\text{ s.t. } c_{i}^{T}C_{\sim i}+z_{3} =(q-2 d)1_{m-1}\\&\quad (1 \le j \le m, i \ne j)\\& \quad c_{i} = z_{1}, c_{i} =z_{2}, z_{1} \in {\mathcal {S}}_{b}, z_{2} \in {\mathcal {S}}_{p},z_{3} \in R_{+}^{m-1}. \end{aligned} \end{aligned}$$
(9)
The augmented Lagrange function w.r.t. Equation
9 is:
$$\begin{aligned} \begin{aligned}&L\left( c_{i}, z_{1}, z_{2}, z_{3}, y_{1}, y_{2}, y_{3}\right) = \sum _{j \ne i} c_{i}^{T} c_{j} \\&\quad + y_{1}^{T}\left( c_{i}-z_{1}\right) + \frac{\mu }{2}\left\| c_{i}-z_{1}\right\| _{2}^{2} \\&\quad + y_{2}^{T}\left( c_{i}-z_{2}\right) + \frac{\mu }{2}\left\| c_{i}-z_{2}\right\| _{2}^{2} \\&\quad + y_{3}^{T}\left( c_{i}^{T} C_{\sim i}+z_{3}-e\right) \\&\quad + \frac{\mu }{2}\left\| c_{i}^{T} C_{\sim i}+z_{3}-e\right\| _{2}^{2} \\&\quad \text{ s.t. } z_{1} \in S_{b}, z_{2} \in S_{p}, z_{3} \in R_{+}^{m-1}, \\&e = (q-2d)1_{m-1} \end{aligned} \end{aligned}$$
(10)
where
\(y_1, y_2, y_3\) are Lagrange multipliers.
We update each variable \(c_i,z_1,z_2,z_3\) in the following way.
1. Update \(h_i\): We update
\(c_i\) by fixing other variables except
\(c_i\), the subproblem of
L in Eq.
10 w.r.t.
\(c_i\) is an unconstrained objective. The gradient of
L w.r.t.
\(c_i\) is
$$\begin{aligned} \begin{aligned} \frac{\partial L\left( c_{i}\right) }{\partial c_{i}}&=2\mu c_{i}+\mu C_{\sim i}C_{\sim i}^{T}c_i + \sum _{j\ne i}c_j \\&\quad + y_1 + y_2 + C_{\sim i} y_3 \\&\quad - \mu (z_1 + z_2 + C_{\sim i}e - C_{\sim i} z_3) \end{aligned} \end{aligned}$$
(11)
By setting this gradient to zero in Eq.
11, we can update
\(c_i\) by
$$\begin{aligned}&c_{i}\leftarrow (2\mu I_{q}+ \mu C_{\sim i}C_{\sim i}^{T})^{-1}\nonumber \\&\quad (\mu (z_1+z_2+C_{\sim i}e-C_{\sim i}z_3)\nonumber \\&\quad - \sum _{j\ne i}c_j-y_1-y_2-C_{\sim i}y_3 \end{aligned}$$
(12)
2. Update \(z_1,z_2,z_3\): The subproblem of
L in Eq.
10 w.r.t.
\(z_1,z_2,z_3\) is:
$$\begin{aligned} \begin{aligned} \left\{ \begin{aligned} L\left( z_{1}\right)&= y_{1}^{T}\left( c_{i}-z_{1}\right) +\frac{\mu }{2}\left\| c_{i}-z_{1}\right\| _{2}^{2} \\ L\left( z_{2}\right)&= y_{2}^{T}\left( c_{i}-z_{2}\right) +\frac{\mu }{2}\left\| c_{i}-z_{2}\right\| _{2}^{2} \\ L\left( z_{3}\right)&= y_{3}^{T}\left( c_{i}^{T} C_{\sim i}+z_{3}-e\right) \\&\quad +\frac{\mu }{2}\left\| c_{i}^{T} C_{\sim i}+z_{3}-e\right\| _{2}^{2} \\&\text{ s.t. } z_{1} \in S_{b},z_{2} \in S_{p},z_{3} \in R_{+}^{m-1} \end{aligned}\right. \end{aligned} \end{aligned}$$
(13)
Then, we set the gradients to zero in Eq.
13, we can obtain update
\(z_1,z_2,z_3\) by
$$\begin{aligned} \begin{aligned} \left\{ \begin{array}{l} z_{1} \leftarrow P_{S_{b}}\left( c_{i}+\frac{1}{\mu } y_{1}\right) \\ z_{2} \leftarrow P_{S_{p}}\left( c_{i}+\frac{1}{\mu } y_{2}\right) \\ z_{3} \leftarrow P_{R_{+}^{m-1}}\left( e-c_{i}^{T} C_{\sim i}-\frac{1}{\mu } y_{3}\right) \end{array}\right. \end{aligned} \end{aligned}$$
(14)
Following [
29], we project
\(z_1,z_2,z_3\) into
\(S_{b},S_{p},R_{+}^{m-1}\) respectively. All of these projections have closed-form solutions:
$$\begin{aligned} \left\{ \begin{array}{l} z_1 \leftarrow min(1,max(-1,c_i+\frac{1}{\mu } y_1)) \\ z_2 \leftarrow \sqrt{q} \frac{c_i+\frac{1}{\mu } y_2}{\left\| c_i+\frac{1}{\mu } y_2 \right\| _2} \\ z_3 \leftarrow max(0,e-c_i^T C_\sim i-\frac{1}{\mu } y_3) \end{array} \right. \end{aligned}$$
(15)
3. Update \(y_1,y_2,y_3\): The Lagrange multipliers
\(y_1, y_2\) and
\(y_3\) can be updated by
$$\begin{aligned} \left\{ \begin{array}{l} y_1 \leftarrow y_1 + \mu (c_i-z_1) \\ y_2 \leftarrow y_2 + \mu (c_i-z_2) \\ y_3 \leftarrow y_3 + \mu (c_i^T C_{\sim i}-z_3-e) \end{array} \right. \end{aligned}$$
(16)
Overall, we can learn class hash centers by maximizing the optimization target. Next, we will describe the loss functions based on class hash centers and the pseudo-class label that improve image hash codes
\(B_I\) and text hash codes
\(B_T\).