语音识别中的术语 – 360converter博客

LM：语言模型

MFCC：Mel频谱特征

PLP: Perceptual Linear Prediction, PLP特征

fBank:　fBank特征

CMVN：
Cepstral Mean and Variance Normalization
倒谱均值方差归一化

Mono：Mono phone，单音素模型训练

Triphone：三音素模型训练，一般 tri1: deltas; tri2: delta+delta-delta; tri3a: lda+mllt

GMM：高斯混合模型

HMM：隐马尔可夫

sGMM：子空间高斯混合模型（subspace GMM)，可有效减少GMM参数

GMM-HMM：MFCC+Mono+Triphone

MLLT：Maximum Likely Linear Transform, 最大似然线性变换，用在training阶段

CMLLR/fMLLR：Contraint/feature Maximum Likelyhood Linear Regression, 约束最大似然线性回归/特征空间最大似然线性回归（feature-space maximum likelihood linear regression），针对说话人特征的鲁棒性，用在alignment阶段

SAT：Speaker Adaptive Training, 说话人自适应

VTLN：Vocal Tract Length Normalisation，声道长度归一化。主要用于语音识别，消除男，女的声道长度的差异。在HTK中有源码，HTK book中有介绍。修改了MEL频率中的中心频率。

LDA：Linear Discriminated Analysis, 线性判别分析

PLDA：Probality Linear Discriminated Analysis概率线性判别分析

MMI/BMMI：Maximum Mutual Information / Boosted MMI 最大互信息(最小化句子错误率?)，steps/train_mmi.sh

LF-MMI: Lattice Free – Maximum Mutual Information

MPE：Minimum Phone Error, 最小化各种粒度指标的错误率，steps/train_mpe.sh

sMBR：state-level Minimum Bayes Risk, 最小化状态错误率

lattice：词格，lmrescore会用到

EM: Expection Maximumization

LMWT: language model weights, 语言模型权重

acwt: Acoustic weight(acoustic scale), 声学模型权重

下面是看kaldi脚本的时候遇到的一些术语和缩写

hires: hi-res , high resolution, to depict mfcc

scp: script file, content is of format: each line is pair of [utterence id] and [wav file or zipped wav file]

ark: archive file, token1 [something]token2 [something]token3 [something] ….

dur: duration, for example, utt2dur file is to specify pair of [utterance id] and [duration]

feats: features, like feats.scp which includes pair of [utterance id] and [mfcc feature ark file]

phones: phonemes, like phones.txt

int and txt: file extension, txt is like #1, #2, #3, while int include integer inside, for example, disambig.int and disambig.txt

disambig: it is short for disambiguation which is used for minimization and determinization of fst

lat: lattic, e.g. lat.1.gz

CTM: stands for time-marked conversation file and contains a time-aligned phoneme transcription of the utterances. Its format is:
utt_id channel_num start_time phone_dur phone_id

egs: Examples

rm: Resource Management

wsj: Wall Street Journal

s5: Script version 5

exp: Experiments

acc: Accumulate

accs: Accumulate states

ali: Alignment

mdl: Model

occs: Occurrence counts/occupancy

am: Acoustic model

csl: colon seperated list files