Skip to content

Zipformer

Zipformer is a novel speech encoder developed by the Next-Gen Kaldi team at Xiaomi. It offers superior accuracy, faster computation, and lower memory usage. It is the first known speech encoder to surpass the Conformer paper on a single dataset (LibriSpeech) after Google's Conformer was published. Zipformer was accepted as an Oral paper (top 1.2%) at ICLR 2024.

Overview

Zipformer introduces numerous innovations, including:

  • Efficient model architecture: Downsampled encoder structure and Zipformer block
  • New normalization: BiasNorm
  • New activation function: Swoosh
  • New optimizer: ScaledAdam
  • Activation value limiting strategies: Balancer and Whitener

For more details, please refer to the paper. Chinese readers can also check our blog post.

Quick Start

pip install zipformer

Note

The examples below use the non-streaming medium model. For more models, please refer to the documentation.

Command Line

# Use jit scripted model
# Transducer
zipformer inference --ms-model pkufool/zipformer-medium --model-type jit --ctc 0 en.wav zh.wav

# CTC
zipformer inference --ms-model pkufool/zipformer-medium --model-type jit --ctc 1 en.wav zh.wav

# Use onnx model
# Transducer
zipformer inference --ms-model pkufool/zipformer-medium --model-type onnx --ctc 0 en.wav zh.wav

# CTC
zipformer inference --ms-model pkufool/zipformer-medium --model-type onnx --ctc 1 en.wav zh.wav

Python API

from zipformer import inference

# jit scripted model
result = inference([en.wav, zh.wav], ms_model='pkufool/zipformer-medium', model_type='jit', ctc=False)

result = inference([en.wav, zh.wav], ms_model='pkufool/zipformer-medium', model_type='jit', ctc=True)

# onnx model
result = inference([en.wav, zh.wav], ms_model='pkufool/zipformer-medium', model_type='onnx', ctc=False)

result = inference([en.wav, zh.wav], ms_model='pkufool/zipformer-medium', model_type='onnx', ctc=True)

# fp16 model
result = inference([en.wav, zh.wav], ms_model='pkufool/zipformer-medium', model_type='onnx', ctc=False, dtype='fp16')

result = inference([en.wav, zh.wav], ms_model='pkufool/zipformer-medium', model_type='onnx', ctc=True, dtype='fp16')

Comments