Skip to main content

Seq2seq 与 Transformer模型详解：架构、应用与优化技巧 | ML Transformer

July 2, 2025 · 6 min read

Sequence-to-sequence model (Seq2seq)

The output length is determined by model.
应用: Speech Recognition, Machine Translation, Speech Translation

Applications

Application 1 - Speech Recognition, Hokkien (閩南語，台语)

Application 2 - Text-to-Speech (TTS) Synthesis

Application 3 - Chatbot

大部分自然语言处理都是一个 QA(Question Answer) 问题。

另外，使用定制化的模型可以获得比 Seq2seq 模型更好的效果。

Application 4 - Syntactic Parsing

使用 Seq2seq 模型硬解语法分析的应用:

Application 5 - Multi-label classification

Multi-label classification vs. Multi-class classification

Multi-class classification: 从多个 class 选择某个 class
Multi-label classification: 同一个东西可以属于多个 class

用 Seq2seq 硬做

Application 6 - Object Detection

其他 - 关于应用应该使用某种类型的模型说明

How to make a Seq2seq model

Architecture

Transformer

Encoder

Residaul connection: 残差连接
Layer Norm: 将 error surface 分布变得更均匀以利于训练，对同一个 features, 同多一个 example 里面不同的 dimension 去算 mean 和 deviation $\sigma$ 做 Normalization
Batch Norm (此处用于对比): 是对同一个 dimension，不同的 features 做 Normalization

对应之前的 transformer 的 Encoder 部分

Decoder - Autoregressive (AT)

(Speech Recognition as example)

Encoder 编码信息，Decoder 产生输出。

Ignore the input from the encoder here.

Self-attention -> Masked Self-attention

算第二个位置的时候，智能知道之前的所有位置(左边)，无法考虑右边的东西。

AT vs Decoder - Not-autoregressive (NAT)

如何预测输出 NAT decoder 应该输出的长度？

NAT 更快，更好控制输出数量以控制输出速度。
NAT 性能比 AT 更差，要达到 AT 的程度需要花费很多功夫。

Encoder - Decoder

Cross attention

Encoder 与 Decoder 可以有不同的连接方式

Mismatch between training and testing

如何解决？可以考虑训练过程中给一些错误的东西。

[[blog/2025-07-02-blog-085-transformer/index#How to make a Seq2seq model#Training#Tips#Scheduled Sampling]]

Training

Process

对于单字输出，类似于一个分类问题。

Tips

Copy Mechanism

应用: Chat-bot, summary

做摘要场景

Guided Attention

应用: 语音辨识，语音合成

问题: seq2seq 模型有时会漏掉某些发音或者漏掉某些语音辨识

训练时保证 attention 的分析读取方向。

Beam Search

问题

这个方法有时候有用，有时候没有用。
对于创造性任务(比如 sentence completion, TTS)，需要更多的随机性。对于确定性任务，Beam Search 可能能够有效。

Optimizing Evaluation Metrics

训练的时候看 cross-entropy, 推理的时候看 ??BLEU score??。
可是如果把 BLEU score 的负数当作 Loss 函数，即求最小值等于 BLEU score 最大值。你无法做微分来计算 Loss。

Scheduled Sampling

术语

one-hard vector: 只有一个维度是 1，其他为 0 的向量