Privacy-Preserving Speaker Verification using Ranking-of-Element Hashing

Pattern Recognition · 2025

Hong-Hanh Nguyen-Le, Lam Tran, Dinh Song An Nguyen, Nhien-An Le-Khac, Thuc Nguyen

University College Dublin · Ohio State University · University of Science, HCMC, Vietnam

Overall architecture of the speaker verification system — **Figure 1.** The overall architecture of the speaker verification system. During enrollment, raw voice references are pre-processed, passed through a Deep Feature Extractor to obtain voice embeddings, then protected by the RoE-based Template Protection module. During verification, the same pipeline produces a protected template that is compared via Soft Matching.

Abstract

The advancements in automatic speaker recognition have led to the exploration of voice data for verification systems. This raises concerns about the security of storing voice templates in plaintext. In this paper, we propose a novel cancellable biometrics that does not require users to manage random matrices or tokens.

First, we pre-process the raw voice data and feed it into a deep feature extraction module to obtain embeddings. Next, we propose a hashing scheme, Ranking-of-Elements (RoE), which generates compact hashed codes by recording the number of elements whose values are lower than that of a random element. This approach captures more information from smaller-valued elements and prevents the adversary from guessing the ranking value through Attacks via Record Multiplicity. Lastly, we introduce a Soft Matching method to mitigate the variations in templates resulting from environmental noise. We evaluate the performance and security of our method on two datasets: TIMIT and VoxCeleb1.

Motivation

Biometric Template Protection (BTP) has gained attention as a promising approach to securing biometric embeddings. Voice-based BTP schemes must meet security properties including irreversibility, unlinkability, and revocability, while maintaining high recognition performance.

Existing methods like Winner-Take-All (WTA) hashing have several limitations:

Limitation 1

Sensitivity to High-Value Elements

WTA exclusively concentrates on identifying maximum values within the template. Larger values are more likely to be selected as output, leading to performance degradation and information loss from smaller-value elements.

Limitation 2

Vulnerability to Reversion Attacks

The local ranking of each element in the original vector can be vulnerable against reversion attacks, violating security requirements related to revocability and unlinkability.

Limitation 3

Dependency on User-Specific Data

State-of-the-art voice-based CB using WTA requires users to manage binary orthogonal matrices, raising practical security and usability limitations.

Limitation 4

Large Hashed-Code Size

WTA-based approaches often require a large hashed-code size to preserve accuracy performance, resulting in higher storage complexity.

Core Idea: RoE records the number of elements whose values are lower than that of a randomly selected element — giving every element an equal chance to be selected, eliminating information loss from smaller-valued elements, and preventing the adversary from guessing rankings through record multiplicity attacks.

Proposed System

The system comprises three main modules: Voice Processing, Deep Feature Extraction (RawNet3), and RoE-based Template Protection.

Ranking-of-Elements Hashing

RoE hashing introduces two key modifications to address WTA's limitations:

Equal selection probability: Instead of emphasizing the index of the largest element, RoE records the number of elements whose values are smaller than that of a selected element. Each element has an equal chance to be the output.
Random element selection within each window: To prevent the adversary from guessing the ranking value, RoE randomly selects an element within each k-sized window via Attacks via Record Multiplicity (ARM).

Given window size $k$, the RoE algorithm randomly selects an element $a$ within each $k$-sized window and records the count of elements whose values are lower than that of element $a$. This produces hashed codes in the range $[0, k-1]$.

Example of RoE hashing scheme — **Figure 2.** Example of the RoE hashing scheme, where m = 3, p = 2, and k = 5. The process involves four steps: (1) Random permutation generation, (2) Hadamard product vector generation, (3) k-window construction, and (4) Index based on Ranking-of-Elements. The selected element's rank (count of smaller values) becomes the hashed code.

Soft Matching

Environmental factors like microphone quality and channel conditions can cause variations in voice features. The Soft Matching procedure incorporates a threshold t to provide robust tolerance for ranking changes:

$$\text{soft\_matching}(T^R, T^P) = 1 - \frac{\text{sum}(d)}{m}$$

where $d_i$ is softened: if $|d_i| \leq t$, then $d_i = \frac{|d_i|}{t+1}$; otherwise $d_i = 1$. This allows the system to tolerate small differences from environmental factors while still distinguishing between genuine users and impostors.

Main Results

EER Comparison (Recognition Performance)

RoE consistently outperforms WTA and IoM (Index-of-Maximum) across both datasets and all parameter settings.

Table 1 · EER (%) Comparison of WTA, URP-IoM, and RoE hashing (m = 512)

Method	VoxCeleb1				TIMIT
Method	k=8	k=16	k=32	k=64	k=8	k=16	k=32	k=64
WTA	7.51	6.03	5.23	3.61	5.15	4.72	2.61	1.82
IoM	4.82	3.94	3.23	2.01	3.12	2.03	1.31	0.93
RoE	4.21	3.2	1.53	1.14	2.02	1.11	0.84	0.7

RoE achieves the lowest EER across all configurations. Lower is better.

TMR Comparison (True Match Rate at FMR=0.1%)

Table 2 · TMR (%) Comparison of WTA, URP-IoM, and RoE hashing (m = 512)

Method	VoxCeleb1				TIMIT
Method	k=8	k=16	k=32	k=64	k=8	k=16	k=32	k=64
WTA	85.16	86.18	87.45	88.51	86.27	87.19	88.62	89.10
IoM	89.07	89.84	90.75	91.59	89.89	90.57	91.38	94.18
RoE	93.61	94.67	96.23	96.78	95.63	96.24	97.01	98.47

RoE achieves the highest TMR across all configurations. Higher is better.

Impact of Soft Matching

Table 3 · Comparison with and without Soft Matching (m = 512, k = 128, t = 2)

Method	With Soft Matching		Without Soft Matching
Method	EER (%)	TMR (%)	EER (%)	TMR (%)
VoxCeleb1
WTA	1.83	90.27	9.34	79.26
IoM	1.21	92.76	8.26	81.24
RoE	0.92	97.40	7.56	85.68
TIMIT
WTA	1.32	91.07	8.75	82.05
IoM	0.61	95.62	7.60	83.48
RoE	0.53	98.81	5.98	84.91

Soft Matching reduces EER by 6.64% on VoxCeleb1 and 5.45% on TIMIT for RoE, while increasing TMR by 11.72% and 13.90% respectively.

Security & Privacy Analysis

Extensive security and privacy analyses demonstrate that RoE fulfills the requirements specified in the ISO/IEC 24745 standard:

Security

Irreversibility

Computationally infeasible to reconstruct the original embedding from a protected template. With optimal parameters, a brute-force attack requires $2^{25088}$ attempts for $m = 512$.

Privacy

Unlinkability

Mated and non-mated score distributions almost overlap, with $D^{sys}_{\leftrightarrow} = 0.045$ on VoxCeleb1 and $D^{sys}_{\leftrightarrow} = 0.0225$ on TIMIT — satisfying the unlinkability requirement.

Security

Revocability

Pseudo-imposter and imposter score distributions completely overlap, indicating newly cancellable voice templates are indistinguishable from old ones.

Robustness

Resistance to ARM Attacks

Random element selection within each window requires adversaries to collect at least $2^7$ protected templates per individual. Combined with $\binom{256}{128} \approx 2^{252}$ feature component possibilities, the attack is infeasible in practice.

Citation

If you find this work useful in your research, please consider citing:

@article{nguyenle2025roe,
  title     = {Privacy-preserving speaker verification system
               using Ranking-of-Element hashing},
  author    = {Nguyen-Le, Hong-Hanh and Tran, Lam
               and Nguyen, Dinh Song An
               and Le-Khac, Nhien-An and Nguyen, Thuc},
  journal   = {Pattern Recognition},
  volume    = {159},
  pages     = {111107},
  year      = {2025},
  publisher = {Elsevier}
}

Acknowledgments

This publication has emanated from research conducted with the financial support of Science Foundation Ireland under Grant number 18/CRT/6183. This publication has also been supported by DAC-lab@FIT - Decentralized Applied Cryptography Lab of Faculty Technology, VNU-HCM University of Science.