Skip to content

Training

This page describes how to train a Zipformer speech recognition model, including data format, training scripts, and evaluation scripts.

Data

This repository uses atdataset as the dataloader. atdataset is a dataloader built on top of webdataset.

Training

The examples below use the medium model. For parameter settings of other variants, see the Model Documentation.

Single-node multi-GPU

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"

zipformer train \
    --world-size 8 \
    --exp-dir zipformer/exp_medium \
    --num-encoder-layers 2,2,3,4,3,2 \
    --feedforward-dim 512,768,1024,1536,1024,768 \
    --encoder-dim 192,256,384,512,384,256 \
    --encoder-unmasked-dim 192,192,256,256,256,192 \
    --bpe-model zh-en-8776 \
    --training-sets data/training_set.lst
    --num-epochs 20 \
    --use-fp16 1 \
    --start-epoch 1 \
    --use-cr-ctc 1 \
    --use-ctc 1 \
    --base-lr 0.045 \
    --use-transducer 1 \
    --use-attention-decoder 0 \
    --enable-spec-aug 0 \
    --ctc-loss-scale 0.2 \
    --cr-loss-scale 0.02 \
    --time-mask-ratio 2.5 \
    --lr-hours 50000 \
    --num-workers 2 \
    --max-duration 600

Note

To train a streaming model, simply add the --causal 1 argument.

Multi-node multi-GPU

Note

Note that all nodes must have identical arguments except for --world-size, --local-rank-start, and --local-world-size.

Assume using 2 machines, each with 8 GPUs.

  • First machine (assume IP is 127.0.0.3, serving as the master node)
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"

zipformer train \
    --world-size 16 \
    --master-addr 127.0.0.3 \
    --master-port 8808 \
    --local-rank-start 0 \
    --local-world-size 8 \
    --exp-dir zipformer/exp_medium \
    --num-encoder-layers 2,2,3,4,3,2 \
    --feedforward-dim 512,768,1024,1536,1024,768 \
    --encoder-dim 192,256,384,512,384,256 \
    --encoder-unmasked-dim 192,192,256,256,256,192 \
    --bpe-model zh-en-8776 \
    --training-sets data/training_set.lst
    --num-epochs 20 \
    --use-fp16 1 \
    --start-epoch 1 \
    --use-cr-ctc 1 \
    --use-ctc 1 \
    --base-lr 0.045 \
    --use-transducer 1 \
    --use-attention-decoder 0 \
    --enable-spec-aug 0 \
    --ctc-loss-scale 0.2 \
    --cr-loss-scale 0.02 \
    --time-mask-ratio 2.5 \
    --lr-hours 50000 \
    --num-workers 2 \
    --max-duration 600
  • Second machine
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"

zipformer train \
    --world-size 16 \
    --master-addr 127.0.0.3 \
    --master-port 8808 \
    --local-rank-start 8 \
    --local-world-size 8 \
    --exp-dir zipformer/exp_medium \
    --num-encoder-layers 2,2,3,4,3,2 \
    --feedforward-dim 512,768,1024,1536,1024,768 \
    --encoder-dim 192,256,384,512,384,256 \
    --encoder-unmasked-dim 192,192,256,256,256,192 \
    --bpe-model zh-en-8776 \
    --training-sets data/training_set.lst
    --num-epochs 20 \
    --use-fp16 1 \
    --start-epoch 1 \
    --use-cr-ctc 1 \
    --use-ctc 1 \
    --base-lr 0.045 \
    --use-transducer 1 \
    --use-attention-decoder 0 \
    --enable-spec-aug 0 \
    --ctc-loss-scale 0.2 \
    --cr-loss-scale 0.02 \
    --time-mask-ratio 2.5 \
    --lr-hours 50000 \
    --num-workers 2 \
    --max-duration 600

Evaluation

zipformer decode \
    --exp-dir zipformer/exp_medium \
    --num-encoder-layers 2,2,3,4,3,2 \
    --feedforward-dim 512,768,1024,1536,1024,768 \
    --encoder-dim 192,256,384,512,384,256 \
    --encoder-unmasked-dim 192,192,256,256,256,192 \
    --epoch ITER \
    --avg AVG \
    --bpe-model zh-en-8776 \
    --test-sets test_clean,data/librispeech_test_clean.lst \
                test_other,data/librispeech_test_other.lst \
    --decoding-method rnnt-greedy-search

Export

zipformer export \
    --use-ctc 1 \
    --use-transducer 1 \
    --num-encoder-layers 2,2,3,4,3,2 \
    --feedforward-dim 512,768,1024,1536,1024,768 \
    --encoder-dim 192,256,384,512,384,256 \
    --encoder-unmasked-dim 192,192,256,256,256,192 \
    --exp-dir zipformer/exp_medium \
    --bpe-model zh-en-8776 \
    --iter ITER \
    --avg AVG

Comments