Publications | Chin-Jou Li

2026

Preprint

PRiSM: Benchmarking Phone Realization in Speech Models

Shikhar Bharadwaj^*, Chin-Jou Li^*, Yoonjae Kim^*, Kwanghee Choi, Eunjung Yeo, and 11 more authors

2026

Abs DOI

Phone recognition (PR) serves as the atomic interface for language-agnostic modeling for cross-lingual speech processing and phonetic analysis. Despite prolonged efforts in developing PR systems, current evaluations only measure surface-level transcription accuracy. We introduce PRiSM, the first open-source benchmark designed to expose blind spots in phonetic perception through intrinsic and extrinsic evaluation of PR systems. PRiSM standardizes transcription-based evaluation and assesses downstream utility in clinical, educational, and multilingual settings with transcription and representation probes. We find that diverse language exposure during training is key to PR performance, encoder-CTC models are the most stable, and specialized PR models still outperform Large Audio Language Models. PRiSM releases code, recipes, and datasets to move the field toward multilingual speech models with robust phonetic ability.

2025

Preprint

POWSM: A Phonetic Open Whisper-Style Speech Foundation Model

Chin-Jou Li^*, Kalvin Chang^*, Shikhar Bharadwaj, Eunjung Yeo, Kwanghee Choi, and 3 more authors

2025

Abs DOI

Recent advances in spoken language processing have led to substantial progress in phonetic tasks such as automatic speech recognition (ASR), phone recognition (PR), grapheme-to-phoneme conversion (G2P), and phoneme-to-grapheme conversion (P2G). Despite their conceptual similarity, these tasks have largely been studied in isolation, each relying on task specific architectures and datasets. In this paper, we introduce POWSM (Phonetic Open Whisper-style Speech Model), the first unified framework capable of jointly perform014 ing multiple phone-related tasks. POWSM enables seamless conversion between audio, text (graphemes), and phones, opening up new possibilities for universal and low-resource speech processing. Our model outperforms or matches specialized PR models of similar size (Wav2Vec2Phoneme and ZIPA) while jointly supporting G2P, P2G, and ASR. We release our training data, models, and code to foster open science.
Interspeech

Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages

Chin-Jou Li^*, Eunjung Yeo^*, Kwanghee Choi, Paula Andrea Pérez-Toro, Masao Someki, and 5 more authors

In Interspeech, 2025

Abs DOI

Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in nonEnglish languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech. The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingual Speech (MMS), for improved dysarthric speech recognition. Evaluation on PC-GITA (Spanish), EasyCall (Italian), and SSNCE (Tamil) demonstrates that VC with both speaker and prosody conversion significantly outperforms the off-the-shelf MMS performance and conventional augmentation techniques such as speed and tempo perturbation. Objective and subjective analyses of the generated data further confirm that the generated speech simulates dysarthric characteristics.
ICLR

Prompt-MII: Meta-Learning Instruction Induction for LLMs

Emily Xiao, Yixiao Zeng, Ada Chen, Chin-Jou Li, Amanda Bertsch, and 1 more author

2025

Abs DOI

A popular method to adapt large language models (LLMs) to new tasks is in-context learning (ICL), which is effective but incurs high inference costs as context length grows. In this paper we propose a method to perform instruction induction, where we take training examples and reduce them to a compact but descriptive prompt that can achieve performance comparable to ICL over the full training set. Specifically, we propose PROMPT-MII, a reinforcement learning (RL) based framework to meta-learn an instruction induction model that can generate compact instructions on the fly for an arbitrary new dataset. We train on over 3,000 diverse classification datasets from the HuggingFace hub, and evaluate on 90 unseen tasks. PROMPT-MII improves downstream model quality by 4-9 F1 points (10-20% relative), matching ICL performance while requiring 3-13x fewer tokens.
ACL

Efficient Many-Shot In-Context Learning with Dynamic Block-Sparse Attention

Emily Xiao, Chin-Jou Li, Yilin Zhang, Graham Neubig, and Amanda Bertsch

In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

Abs DOI

Many-shot in-context learning has recently shown promise as an alternative to finetuning, with the major advantage that the same model can be served for multiple tasks. However, this shifts the computational burden from training-time to inference-time, making deployment of many-shot ICL challenging to justify in-practice. This cost is further increased if a custom demonstration set is retrieved for each inference example. We present Dynamic Block-Sparse Attention, a training-free framework for retrieval-based many-shot in-context learning. By combining carefully designed block-sparse attention and retrieval of cached groups of demonstrations, we achieve comparable per-example latency to finetuning while maintaining on average >95% of the best method’s accuracy across strong ICL and finetuning baselines. We hope that this will further enable the deployment of many-shot ICL at scale.

2024

EMBC

Epileptic Seizure Classification with Patient-level and Video-level Contrastive Pretraining

Chin-Jou Li, Chien-Chen Chou, Yen-Cheng Shih, Li-Chuan Kuo, Yu-Te Wang, and 4 more authors

In 2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2024

Abs DOI

Accurate classification of epileptic seizure types through seizure semiology analysis demands significant clinical expertise. While previous studies have employed various action recognition modules, the scarcity of labeled clinical videos has hindered the deployment of larger models. In this study, we explore unlabeled data to pretrain a transformer-based model with contrastive loss, taking advantage of the information that circumvents the need for additional annotation from medical professionals. We maximize the similarity between embeddings from the same patient and video while minimizing those from different patients and videos. Subsequently, a classification head was finetuned to distinguishing temporal lobe epilepsy (TLE) and extratemporal lobe epilepsy (exTLE), achieving a 5-fold accuracy of 0.93 and an F1 score of 0.88 on the video level (N = 57). Our results outperformed other state-of-the-art seizure classification models, demonstrating the efficacy of our approach. This suggests potential applications in clinical practice, where unlabeled data could serve as a valuable aid in improving seizure classification accuracy and patient care.
Interspeech Satellite

Deep Complex U-Net with Conformer for Audio-Visual Speech Enhancement

Shafique Ahmed, Chia-Wei Chen, WenZe Ren, Chin-Jou Li, Ernie Chu, and 5 more authors

In 3rd COG-MHEAR Workshop on Audio-Visual Speech Enhancement (AVSEC), 2024

Abs DOI

Recent studies have acknowledged the advantages of incorporating visual data into SE systems. We introduce a novel audio-visual SE approach, termed DCUC-Net (deep complex U-Net with conformer network). DCUC-Net leverages complex domain features and conformer blocks. The encoder and decoder of DCUC-Net use a complex U-Net-based framework. Audio and visual signals are processed using a complex encoder and ResNet-18 model. These signals are fused using conformer blocks and transformed into enhanced speech waveforms via a complex decoder. The conformer blocks consist of self-attention mechanisms and convolutional operations, enabling DCUC-Net to capture global and local audio-visual dependencies. Our experimental results show DCUC-Net outperforms the baseline model from the COG-MHEAR AVSE Challenge 2023 by 0.14 in terms of PESQ. Additionally, DCUC-Net performs comparably to a state-of-the-art model and outperforms other models on the Taiwan Mandarin speech with video (TMSV) dataset.
Face swapping in seizure videos for patient deidentification

Chin-Jou Li, Jen-Cheng Hou, Chien-Chen Chou, Yen-Cheng Shih, Stephane Dufau, and 3 more authors

2024

Abs DOI

This study aimed to test different AI-based face-swapping models applied to videos of epileptic seizures, with the goal of protecting patient privacy while retaining clinically useful seizure semiology. We hypothesized that specific models would show differences in semiologic fidelity compared to the original clinical videos. This is the first study evaluating AI face swapping models in epileptic seizure video clips. Optimization of AI face-swapping models could enhance the accessibility of seizure videos for education and research while protecting patient privacy and maintaining semiology.

2023

Artificial Intelligence-Based Face Transformation in Patient Seizure Videos for Privacy Protection

Jen-Cheng Hou, Chin-Jou Li, Chien-Chen Chou, Yen-Cheng Shih, Si-Lei Fong, and 5 more authors

2023

Abs DOI

We aim to investigate the feasibility and accuracy of artificial intelligence (AI) methods of facial deidentification in hospital-recorded epileptic seizure videos, for improved patient privacy protection while preserving clinically important features of seizure semiology.