Research projects and teams in PSST
Team 1: Defence
Fellow 1: Speech anonymisation for privacy-preserving emotion recognition
Hosts: AALTO (21 months) & RUB (24 months)
We examine how non-verbal elements of speech help identify speakers and how these can be hidden. We create efficient techniques to anonymize speech while keeping the speaker’s emotions intact. This is challenging because features of speech convey both identity and emotion. We will develop a new method to separate and synthesize speech features using deep neural networks.
Partnering with Loihde, we will enhance and use an emotion recognition system to measure emotion detection accuracy in original and anonymized speech and an ASR system to measure speech recognition accuracy. This project builds on our previous work with variational information bottlenecks [1]. We will collaborate with other Fellows to separate speaker attributes, eliminate identifying cues, and evaluate our methods.
- Nelus, A. and Martin, R.: “Privacy-Preserving Audio Classification Using Variational Information Feature Extraction,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2864-2877, 2021
Expected results:
- A de-identification technique that maintains emotions (with AALTO and RUB).
- An improved emotion recognition system (with Loihde) and evaluation methods (with RUB and AALTO).
Supervisors: Tom Bäckström, Rainer Martin and Tuomas Lahtinen
Secondment: Loihde, adviser Tuomas Lahtinen (6 months).
Fellow 2: Disentangled representations for selective attribute suppression
Hosts: EURECOM (19 months) & RUB (26 months)
Current solutions for privacy through de-identification mainly suppress voice identity [1]. We aim to go beyond this with help from experts at EURECOM, RUB, and Orange to develop new methods for selective attribute suppression.
Working with Fellow 4 on resynthesizing anonymized attributes and Fellow 6 on stronger attack models, we use multi-task learning [2] to create speech representations that can handle various privacy-sensitive attributes, such as age, gender, dialect, health, and emotions. This approach offers users more control over their privacy in smart speech technology.
- Tomashenko, N., Wang, X., Vincent, E., Patino, J., Srivastava, B.M.L., Noé, P.-G., Nautsch, A. et al. “The VoicePrivacy 2020 Challenge: Results and findings.” Computer Speech & Language 74 (2022): 101362.
- Pascual, S. et al., ‘Learning Problem-Agnostic Speech Representations from Multiple Self-Supervised Tasks’, in Proc. Interspeech, 2019, 161–165
Expected results:
- The first privacy-enhancing technology (with RUB) that lets users control their privacy.
- Users can choose which attributes to disclose or suppress, with feedback on the impact based on performance assessments from Aalto and Orange.
Supervisors: Nicholas Evans, Rainer Martin and Nicolas Gengembre
Secondment: Orange, adviser Nicolas Gengembre (6 months).
Fellow 3: Protection against deepfakes in speech
Hosts: INESC ID (32 months) & AALTO (13 months)
Speech technologies can create synthetic voices that sound genuine, posing a risk for misuse like deepfake images and videos. This concern has led to efforts in detecting synthetic speech, such as in the ASVspoof challenge [1]. Another area of research focuses on protecting against voice cloning attacks using adversarial perturbations that prevent voice synthesis or embedding watermarks to trace the origin of speech signals [2-4].
We aim to enhance both protection methods: watermarking (building on Aalto’s work) and adversarial perturbations (leveraging INESC-ID’s expertise) [5]. Our methods will be tested in real-world scenarios with VoiceMod, and in collaboration with Fellows 5, 9, and 11.
- Delgado, Héctor, et al. “ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan.” (2021)
- Wang, Yuanda, et al. “VSMask: Defending Against Voice Synthesis Attack via Real-Time Predictive Perturbation.” arXiv preprint arXiv:2305.05736 (2023).
- Juvela, Lauri, and Xin Wang. “Collaborative Watermarking for Adversarial Speech Synthesis.” arXiv preprint arXiv:2309.15224 (2023).
- Chen, Guangyu, et al. “WavMark: Watermarking for Audio Generation.” arXiv preprint arXiv:2308.12770 (2023).
- Shamsabadi, Ali Shahin, et al. “FoolHD: Fooling speaker identification by highly imperceptible adversarial disturbances.” In Proc. ICASSP 2021, IEEE, 2021.
Expected results: Watermarking (with AALTO) and perturbation (with INESC-ID) frameworks for privacy-enhanced speech data, tested in real-world settings (with VoiceMod).
Supervisors: Isabel Trancoso, Lauri Juvela and Pritish Chandna
Secondment: VoiceMod, adviser Pritish Chandna (6 months).
Fellow 4: Transparent Exchange of Speaker Attributes
Hosts: SRU (33 months) & AALTO (12 months)
Fellow 4 will focus on transparent speaker attribute exchange, working with Fellow 2 on protecting attributes that are hard for people to perceive. Transparent exchange involves converting one voice attribute (e.g., gender, age, emotion) into another in an understandable way (e.g., changing pitch, speed, loudness, pronunciation).
This exchange can protect people’s attributes in recorded speech or hide attributes in real-time while they retain their own voices. The modified voices will be tested for clarity, naturalness, ease of use, and effectiveness against attribute inference attacks.
Expected Results: Techniques to convert voices (e.g., changing the speaker’s gender) in understandable terms, tested with help from EURECOM, VoiceInteraction, AALTO, and TUB.
Supervisors: Martha Larson, Tom Bäckström and Carlos Mendes
Secondment: VoiceInteraction, adviser Carlos Mendes (6 months).
Team 2: Attack
Fellow 5: Revealing social relationships in conversations
Hosts: SRU (25 months) & INESC-ID (20 months)
When people talk, privacy can be compromised beyond just the audio. Conversations can reveal social relationships, such as if participants are related in a business meeting or who has more authority. This information comes from turn-taking, word choice, and speech style and patterns. Publishing such data may leak sensitive information, but this area is still under-researched. This project aims to identify these revealing factors and find ways to detect and hide them using natural language processing and audio analysis.
Fellow 5 will work with Fellow 4 to develop methods to conceal this information defensively and with Fellow 3 to watermark edited recordings to differentiate them from originals. Fellows 8 and 12 will help introduce evaluation metrics for this task.
Expected Results: An approach to analyze conversations, identify words and speech patterns revealing social relationships, retrieve this information, and create a privacy-protected version of the conversation that keeps the meaning intact, with contributions from SRU, ELDA, INESC ID, and AALTO.
Supervisors: Martha Larson, Alberto Abad, Khalid Choukri, Victoria Arranz, and Marwa Hadj Salah
Secondment: ELDA, advisers Khalid Choukri, Victoria Arranz, and Marwa Hadj Salah (6 months).
Fellow 6: Robust attack models and tools for the credible evaluation of anonymisation and attribute suppression
Hosts: EURECOM (24 months) & TUB (21 months)
Current methods for evaluating anonymization [1] often use automatic speaker verification models that can be too optimistic. Fellow 6 will develop stronger attack models and metrics that analyze speech signals at various levels. This work aims to show that privacy-sensitive attributes can still be re-identified, proving that existing privacy protections may fall short.
Using robust attack models, we can improve anonymization and attribute suppression through techniques like adversarial training. Fellow 6 will work with Fellows 1, 2, and 9 to use these attack models to enhance defenses, apply them to legal frameworks with Fellow 7, and improve evaluation metrics with Fellow 12.
- Delgado, Héctor, et al. “ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan.” (2021)
Expected Results: A set of open-source tools for credible anonymization and attribute suppression evaluation using robust attack models. These tools will enhance future VoicePrivacy Challenge evaluations, with contributions from TUB, Omilia, RUB, and INRIA.
Supervisors: Nicholas Evans, Dorothea Kolossa and Themos Stafylakis
Secondment: Omilia, adviser Themos Stafylakis (6 months).
Fellow 7: Privacy impact assessment for comprehensive attacks exploiting audio, speech, and metadata
Hosts: INRIA Nancy (31 months) & SRU (14 months)
Recital 26 of the GDPR states that “to determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used”. Fellow 7 will go beyond standard utterance-level speaker verification attacks and explore additional attacks using spoken content [1], metadata (e.g., IP address), or external data (e.g., social media) [2], following the principles in Article 29 Data Protection Working Party’s Opinion 05/2014 on Anonymisation Techniques which mentions linkability-, singling out- and inference-type attacks [3]. Combined into more complex forms, these attacks will be assessed for increased privacy threats. The study will also consider re-identifying speaker groups or inferring sensitive traits without re-identification, both of which are recognized as privacy threats by GDPR.
We will collaborate with Fellow 10 and Fellow 4 on the privacy impact assessment (PIA) of two concrete use cases (speech affecting diseases and meetings), and with Fellows 6, 8 who will develop, respectively, improved utterance-level attacks and theoretical metrics in parallel.
- P. Lison, I. Pilán, D. Sánchez, M. Batet, and L. Øvrelid, “Anonymisation models for text data: State of the art, challenges and future directions”, in Proc. ACL-IJCNLP. pp. 4188–4203, 2021.
- D. Romanini, S. Lehmann, and M. Kivelä, “Privacy and uniqueness of neighborhoods in social networks”, Scientific Reports 11:20104, 2021.
- https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf
Expected Results:
- Qualitative and quantitative assessment of privacy risks from complex attacks beyond standard speaker verification (with EURECOM).
- Proposal for a standardized PIA procedure for speech data (with SRU, CNIL, INESC ID).
Supervisors: Emmanuel Vincent, Martha Larson, Vincent Toubiana.
Secondment: CNIL, adviser Vincent Toubiana (6 months).
Fellow 8: Attacking information bottlenecks – Theoretical metrics and bounds of privacy
Hosts: AALTO (23 months) & INRIA Saclay (22 months)
Current speech anonymization approaches lack transparency and provability. Key questions include how to provide evidence for anonymization performance and the best theoretical performance metric. Neural networks used for anonymization rely on information bottlenecks to allow desired information through while discarding private data [1]. We aim to develop theories to analyze this information content based on bottleneck size and neural network complexity.
We will use differential privacy [2] and quantitative information flow [3] theory to measure the performance by the accuracy of desired versus private information. Collaborating with Fellow 7, we will apply this methodology to various speech characteristics such as emotion (with Fellow 1), linguistic content (with Fellow 5), and model updates (with Fellow 11).
- R. Aloufi, H. Haddadi, and D. Boyle, “Configurable privacy-preserving automatic speech recognition”, in Proc. Interspeech, pp. 861-865, 2021.
- C. Dwork, “Differential privacy”, in International Colloquium on Automata, Languages, and Programming, pp. 1-12, 2006.
- N. Fernandes, A. McIver, and P. Sadeghi, “Explaining ∊ in local differential privacy through the lens of quantitative information flow”, in 2024 IEEE 37th Computer Security Foundations Symposium (CSF), pp. 419-432, 2024.
Expected Results:
- Define anonymization evidence and performance metrics (with INRIA, TUB) using differential privacy, balancing computational complexity (with Loihde) and bottleneck entropy.
- Experimental and theoretical analysis of information in different voice aspects.
Supervisors: Tom Bäckström, Catuscia Palamidessi and Tuomas Lahtinen.
Secondment: Loihde, adviser Tuomas Lahtinen (6 months).
Team 3: Utility
Fellow 9: Robust privacy-preserving industrial voice interfaces
Hosts: RUB (22 months) & AALTO (23 months)
Fellow 9 will work with VIC on privacy in voice-controlled interfaces for noisy industrial and medical environments using touch-free headsets and distant microphones. Using our privacy-preserving wake word verification [1] and the SpeechBrain toolkit [2], we will develop a new wake word detection system for multiple connected acoustic sensors with limited computational power. This system will process raw audio locally and send privacy-protected information to a cloud service for sensor fusion, user authentication, noise reduction, and speech recognition.
The project aims to create privacy-preserving audio representations that still allow flexible server-based processing. Fellow 9 will cooperate with Fellow 3 on authentication methods based on watermarking, Fellow 11 on distributed architectures, and Fellow 6 to ensure that the attack models and evaluation methods used are realistic.
- Koppelmann, T. et al. (2021): Privacy-Preserving Feature Extraction for Cloud-Based Wake Word Verification, in Proc. INTERSPEECH, 2021
- https://speechbrain.github.io/
Expected Results:
- Develop a robust and privacy-preserving voice-controlled interface using multiple microphones.
- Improve acoustic processing with multiple sensors and sensor fusion (with RUB).
- Create new authentication methods for distributed device scenarios (with AALTO, INESC ID, VIC).
Supervisors: Rainer Martin, Lauri Juvela and Diane Hirschfeld
Secondment: VIC, adviser Diane Hirschfeld (6 months)
Fellow 10: Detection of speech-affecting diseases in anonymized speech
Hosts: INESC ID (20 months) & TUB (25 months)
State-of-the-art de-identification solutions [1] aim to remove all speaker information while keeping linguistic content. However, certain speaker traits are useful for detecting speech-affecting diseases. Anonymized speech retaining these traits can help diagnose and monitor such diseases while protecting patient identity.
Fellow 10, working with Fellow 2, will identify features related to speech-affecting diseases and minimize speaker-identifying features. These sets will be used in adversarial training to keep health-related cues and remove identifying ones. The anonymized samples will be evaluated both objectively and subjectively.
- Delgado, Héctor, et al. “ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan.” (2021)
Expected Results: Develop anonymization technology (with EURECOM and RUB) that preserves features for diagnosing speech-affecting diseases (with INESC ID and TUB) while preventing identification of the speaker.
Supervisors: Alberto Abad, Sebastian Möller and Johannes Tröger
Secondment: ki:elements, adviser Johannes Tröger (6 months).
Fellow 11: Utility of Speech Samples as Privacy-Preserving, Transparent and Reusable Model-Updates for Distributed Learning
Hosts: AALTO (16 months) & EURECOM (29 months)
Distributed learning (like federated learning) protects privacy by only sharing model updates, not private training data. However, implementing and evaluating these systems is difficult due to model-specific requirements. Fellow 11 will explore using short speech samples to represent model updates [1], making the updates resource-efficient, transparent, and model-agnostic. This allows for conventional speech coding and privacy evaluation.
Initially, we will approximate updates with speech samples and later derive these samples using model inversion techniques [2]. These updates will be privacy-protected with Fellow 3, implemented in a distributed wakeword scenario with Fellow 9, and evaluated with Fellows 8 and 12.
- Maouche, M., Srivastava, B. M. L., Vauquier, N., Bellet, A., Tommasi, M., & Vincent, E. (2022, September). Enhancing speech privacy with slicing. In Interspeech 2022-Human and Humanizing Speech Technology.
- Yang, Z., Zhang, J., Chang, E. C., & Liang, Z. (2019, November). Neural network inversion in adversarial setting via background knowledge alignment. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security (pp. 225-240).
Expected results:
- A sample-selection approach for approximating model updates in disentanglement tasks using speech samples (with EURECOM).
- Theoretical requirements for model structures that allow inversion, and experiments to see how well these estimates match natural speech (with TUB).
Supervisors: Tom Bäckström, Nicholas Evans and Laurent Besacier
Secondment: Naver, adviser Laurent Besacier (6 months)
Fellow 12: Methods for subjective and objective evaluation of privacy
Hosts: TUB (26 months) & INRIA Nancy (19 months)
For speech services to gain acceptance, privacy-preservation effects must be clear to users [1]. Fellow 12 will create methods to quantify the trade-off between privacy and utility in anonymized speech, using both objective classifier tools and subjective perceptual evaluations [2,3].
First, subjective methods will evaluate the impact of anonymization on speaker identity, emotion, speech quality, and computational complexity. Crowdsourcing will efficiently handle these evaluations. Results will be compared with ML-based classifier performance to create common evaluation criteria for Team 2.
Second, methods will assess perceived privacy and user experience, and observe user behavior when interacting with a service, with results forwarded for standardization with Fellow 7. Both lab studies and field tests with sensor data tracking will be used.
The methods will assess the work of Fellows 4, 5, 6, 10, and 11.
- A. Leschanowsky, S. Rech, B. Popp, and T. Bäckström, “Evaluating privacy, security, and trust perceptions in conversational AI: A systematic review”, in Computers in Human Behavior 159:108344, 2024.
- B. O’Brien, N. Tomashenko, A. Chanclu, and J.-F Bonastre, “Anonymous speaker clusters: Making distinctions between anonymised speech recordings with clustering interface”, in Proc. Interspeech, pp. 3580-3584, 2021.
- J. Williams, K. Pizzi, N. Tomashenko, and S. Das, “Anonymizing speaker voices: Easy to imitate, difficult to recognize?”, in Proc. 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 12491-12495, 2024.
Expected Results:
- Create subjective and objective methods (with Vocapia) to quantify the effect of anonymization (with INESC ID, SRU) and prepare for standardization (with INRIA).
- Develop standards for subjective evaluation of perceived privacy, user experience, and behavior tracking.
Supervisors: Sebastian Möller, Dorothea Kolossa, Slim Ouni and Bernard Prouts
Secondment: Vocapia, adviser Bernard Prouts (6 months).