[EMNLP 2019] Can You Unpack That? Learning to Rewrite Questions-in-Context (Elgohary et al., 2019)
关注于解决Question Answer任务中的Coreference和Ellipsi, 引入了基于上下文的问题改写任务 (Task of question-in-context rewriting), 文中又称de-contextualization
We introduce the task of question-in-context rewriting: given the context of a conversation’s history, rewrite a context-dependent into a selfcontained question with the same answer.
根据上图的四种情况, 分别计算generation probability $p_{g}\left(y_{t} \mid \cdot\right)$ 和copy probability $p_{c}\left(y_{t} \mid \cdot\right)$:
[p_{g}\left(y_{t} \mid \cdot\right)=\left{\begin{array}{ll}\frac{1}{Z} e^{\psi_{g}\left(y_{t}\right)} & \text { if } y_{t} \in V \ 0 & \text { if } y_{t} \in X \text { and } y_{t} \notin V \ \frac{1}{Z} e^{\psi_{g}(U N K)} & \text { if } y_{t} \notin X \cup V \end{array}\right.]
(L/T)-Gen: Pure generation-based model. Words are generated from a fixed vocabulary. (worst)
(L/T)-Ptr-Net: Pure pointer-based model. Words can only becopied from the input
(L/T)-Ptr-Gen: Hybrid pointer+generation model. Words can be either copied from the input or generatedfrom a fixed vocabulary.
(L/T)-Ptr-λ: Our proposed model which split the attention by a coefficient λ. (best)
[EMNLP 2019] Improving Open-Domain Dialogue Systems via Multi-Turn Incomplete Utterance Restoration (Pan et al., 2019)
来自腾讯AI Lab的工作, 文章开源了一个大型多轮对话数据集Restoration-200K, 提出了“pick-and-combine”的方法试图从对话系统中的上下文内容中还原不完整的语句, 并对比其和Syntactic(Kumar and Joshi, 2016), Sequence-to-Sequence model (Seq2Seq)和Pointer Generative Network (See et al., 2017)的模型表现.
把span分割任务再再再当成序列标注任务(Split or Retain). 使用BiDAF (Seo et al., 2017)获取$x,y$之间的语义交互. Embedding由character, word and sentence拼接而来: $\phi=\left[\phi_{c} ; \phi_{w} ; \phi_{s}\right]$
Context layer: GloVe + BiLSTM (c and x are jointly encoded)
Encoding layer (concentrate on local rather than global information): concatenating 1. element-wise similarity (Ele Sim.) 2. cosine similarity (Cos Sim.) and 3. learned bi-linear similarity (Bi-Linear Sim.) -> D-dimensional feature vector $ F \in \R^{M \times N \times D}$
Segmentation layer (To capture global information): Conv + pool + skip connect + ffn -> $ Y \in \R^{M \times N}$
上述架构同样可以使用基于预训练模型(e.g. BERT, etc)获取分布式表征信息, RUN + BERT表现出相较RUN更有说服力的实验结果
生成前需要进行一步standardization确保所有的$Y$都是长方形, based on Hoshen–Kopelman, 并添加Connection Words保证句子流畅度
在REWRITER数据集 (Su et al., 2019)上的结果:
[EMNLP 2019] Unsupervised Context Rewriting for Open Domain Conversation (Zhou et al., 2019)
Pseudo Data Generation: 在生成阶段, 每一个关键字前后2个位置的临近词都被考虑作为插入词. 在如何选择插入位置这个问题上, 语言模型多层RNN被选择作为解决方案, 同时保留得分最高的三个生成句. 这三个生成句被输送到一个encoder-decoder的生成模型$M_{s2s}$以及一个回复选择模型$M_{ir}$当中:
[L_{M_{s 2 s}}\left(r \mid s^{}\right)=-\frac{1}{n} \sum_{i=1} \log p\left(r_{1}, \ldots, r_{n} \mid s^{}\right) L_{M_{i r}}\left(p o, n e, s^{}\right)=M_{i r}\left(p o, s^{}\right)-M_{i r}\left(n e, s^{*}\right)]
[e_{i, t, l}= \left{ \begin{aligned} 1 & \text { if } x_{i, t, l} \text { is an entity and } x_{i, t, l} \in \mathcal{Y}{i, t}^{*} -1 & \text { if } x{i, t, l} \text { is an entity and } x_{i, t, l} \notin \mathcal{Y}_{i, t}^{*} 0 & \text { Otherwise } \end{aligned} \right.]
[EMNLP 2019] GECOR: An End-to-End Generative Ellipsis and Co-reference Resolution Model for Task-Oriented Dialogue (Quan et al., 2020)
Code: https://github.com/terryqj0107/GECOR
Oral: https://vimeo.com/424412465
Web: https://multinlp.github.io/GECOR/
提出了Generative Ellipsis and CO-reference Resolution model (GECOR), 主要思想是在生成模式和复制模型之间切换来生成一个语义完善的话语. 又搞了一个数据集CamRest676, CamRest676数据集包含676个对话,有2744个用户话语. 这2744个用户话语中有1,174 ellipsis 和1,209 Co-reference 被标注出来. 参考Su et al., 2019的数据分割方案, CamRest676包含1,331 个包含ellipsis或co-reference的正例, 以及不需要改写的负例1413个.
GECOR不考虑ellipsis或co-reference的语法特性, 它们可以是字词, 短语甚至短句. 同时这种改写方式不需要提供一组要解析的候选指代, 以往的研究, such as Lee et al., 2018往往需要在存在多个ellipsis或co-reference的情况下遍历文本,计算复杂度较高。
embedding layer采用Glove, Utterance and Context Encoder负责分别对待改写句子和对话内容进行编码. 在Decoder中, 需要分别计算与Gu et al., 2016类似的generation probability $p_{g}\left(y_{t} \mid \cdot\right)$ 和copy probability $p_{c}\left(y_{t} \mid \cdot\right)$:
上述copy机制(Gu et al., 2016)可以替换为基于See et al., (2017)修改而来的gated copy mechanism, 即引入一个$p_{g e n} \in (0,1)$:
[\begin{array}{l} p_{g e n}=\sigma\left(W_{h} h_{L}^{*}+W_{s} s_{\ell}+W_{y} y_{l-1}+b_{l}\right) P\left(y_{l}\right)=p_{g e n} P^{g}\left(y_{l}\right)+\left(1-p_{g e n}\right) P^{c}\left(y_{l}\right) \end{array}]
[ACL 2020] Generate, Delete and Rewrite: A Three-Stage Framework for Improving Persona Consistency of Dialogue Generation (Song et al., 2020)
腾讯AI又来了, 真的高产.
[EMNLP 2020] Semantic role labeling guided multi-turn dialogue rewriter (Xu et al., 2020)
腾讯AI三杀了. 这篇文章指出, 上述前序工作的decoder主要使用global attention从而关注对话语境中的所有单词. 由于没有预先引入先验焦点(prior focus), 上述atention机制的注意力可能会对一些无关紧要的字词吸引. 很自然地可以想到— 使用Semantic Role Labeling (SRL)识别句子的谓词-实参(Predicate-Argument)结构,从而捕获 who did what to whom这样的语义信息作为先验辅助decoder. 下图这个例子很好地体现了语义成分在Utterenace中起到的作用, “粤语”和“普通话”被识别为两个不同的实参, 可以在SRL的指导下获得更多的关注.
腾讯AI的标注团队从Duconv(Wu et al., 2019)数据集上, 标注了3000个 dialogue sessions, 包含33,673个谓词, 27,198个话语. 选用了一个额外的BERT-based SRL model (Shi and Lin, 2019)作为SRL parser用来完成这个谓词-实参结构的识别任务, 并且在CoNLL 2012 (117,089 examples)进行初步预训练.
在Restoration-200K(Pan et al., 2019)数据集上实验结果如下, 还不如Pan et al., (2019)提出的PAC Model.
we find that the SRL information mainly improves the performance on the dialogues that require information completion. One omitted information is considered as properly completed if the rewritten utterance recovers the omitted words. We find the SRL parser naturally offers important guidance into the selection of omitted words
[EACL 2021] Ellipsis Resolution as Question Answering: An Evaluation (Aralikatte et al., 2021)
先前的工作往往对于ellipsis和co-reference采取不同的策略, 往往是Insert, Repalce. R~A~ST将Insert和Repalce这两个操作用一个先deletion后Insetion的操作代替. 标注数据由longest common sub-sequence (LCS)算法生成, 剔除REWRITE(Su et al., 2019)和RESTORATION-200K(Pan et al., 2019中不满足要求的.
Deletion $\in {0, 1}$: the word $x_n$ is deleted (i.e. 1) or not (i.e. 0);
Insertion: [start, end]: 对话内容中的位置信息, If no phrase is inserted, the span is [-1, -1].
[EMNLP 2019] Can You Unpack That? Learning to Rewrite Questions-in-Context (Elgohary et al., 2019)
关注于解决Question Answer任务中的Coreference和Ellipsi, 引入了基于上下文的问题改写任务 (Task of question-in-context rewriting), 文中又称de-contextualization
We introduce the task of question-in-context rewriting: given the context of a conversation’s history, rewrite a context-dependent into a selfcontained question with the same answer.
根据上图的四种情况, 分别计算generation probability $p_{g}\left(y_{t} \mid \cdot\right)$ 和copy probability $p_{c}\left(y_{t} \mid \cdot\right)$:
[p_{g}\left(y_{t} \mid \cdot\right)=\left{\begin{array}{ll}\frac{1}{Z} e^{\psi_{g}\left(y_{t}\right)} & \text { if } y_{t} \in V \ 0 & \text { if } y_{t} \in X \text { and } y_{t} \notin V \ \frac{1}{Z} e^{\psi_{g}(U N K)} & \text { if } y_{t} \notin X \cup V \end{array}\right.]
(L/T)-Gen: Pure generation-based model. Words are generated from a fixed vocabulary. (worst)
(L/T)-Ptr-Net: Pure pointer-based model. Words can only becopied from the input
(L/T)-Ptr-Gen: Hybrid pointer+generation model. Words can be either copied from the input or generatedfrom a fixed vocabulary.
(L/T)-Ptr-λ: Our proposed model which split the attention by a coefficient λ. (best)
[EMNLP 2019] Improving Open-Domain Dialogue Systems via Multi-Turn Incomplete Utterance Restoration (Pan et al., 2019)
来自腾讯AI Lab的工作, 文章开源了一个大型多轮对话数据集Restoration-200K, 提出了“pick-and-combine”的方法试图从对话系统中的上下文内容中还原不完整的语句, 并对比其和Syntactic(Kumar and Joshi, 2016), Sequence-to-Sequence model (Seq2Seq)和Pointer Generative Network (See et al., 2017)的模型表现.
把span分割任务再再再当成序列标注任务(Split or Retain). 使用BiDAF (Seo et al., 2017)获取$x,y$之间的语义交互. Embedding由character, word and sentence拼接而来: $\phi=\left[\phi_{c} ; \phi_{w} ; \phi_{s}\right]$
Context layer: GloVe + BiLSTM (c and x are jointly encoded)
Encoding layer (concentrate on local rather than global information): concatenating 1. element-wise similarity (Ele Sim.) 2. cosine similarity (Cos Sim.) and 3. learned bi-linear similarity (Bi-Linear Sim.) -> D-dimensional feature vector $ F \in \R^{M \times N \times D}$
Segmentation layer (To capture global information): Conv + pool + skip connect + ffn -> $ Y \in \R^{M \times N}$
上述架构同样可以使用基于预训练模型(e.g. BERT, etc)获取分布式表征信息, RUN + BERT表现出相较RUN更有说服力的实验结果
生成前需要进行一步standardization确保所有的$Y$都是长方形, based on Hoshen–Kopelman, 并添加Connection Words保证句子流畅度
在REWRITER数据集 (Su et al., 2019)上的结果:
[EMNLP 2019] Unsupervised Context Rewriting for Open Domain Conversation (Zhou et al., 2019)
Pseudo Data Generation: 在生成阶段, 每一个关键字前后2个位置的临近词都被考虑作为插入词. 在如何选择插入位置这个问题上, 语言模型多层RNN被选择作为解决方案, 同时保留得分最高的三个生成句. 这三个生成句被输送到一个encoder-decoder的生成模型$M_{s2s}$以及一个回复选择模型$M_{ir}$当中:
[L_{M_{s 2 s}}\left(r \mid s^{}\right)=-\frac{1}{n} \sum_{i=1} \log p\left(r_{1}, \ldots, r_{n} \mid s^{}\right) L_{M_{i r}}\left(p o, n e, s^{}\right)=M_{i r}\left(p o, s^{}\right)-M_{i r}\left(n e, s^{*}\right)]
[e_{i, t, l}= \left{ \begin{aligned} 1 & \text { if } x_{i, t, l} \text { is an entity and } x_{i, t, l} \in \mathcal{Y}{i, t}^{*} -1 & \text { if } x{i, t, l} \text { is an entity and } x_{i, t, l} \notin \mathcal{Y}_{i, t}^{*} 0 & \text { Otherwise } \end{aligned} \right.]
[EMNLP 2019] GECOR: An End-to-End Generative Ellipsis and Co-reference Resolution Model for Task-Oriented Dialogue (Quan et al., 2020)
Code: https://github.com/terryqj0107/GECOR
Oral: https://vimeo.com/424412465
Web: https://multinlp.github.io/GECOR/
提出了Generative Ellipsis and CO-reference Resolution model (GECOR), 主要思想是在生成模式和复制模型之间切换来生成一个语义完善的话语. 又搞了一个数据集CamRest676, CamRest676数据集包含676个对话,有2744个用户话语. 这2744个用户话语中有1,174 ellipsis 和1,209 Co-reference 被标注出来. 参考Su et al., 2019的数据分割方案, CamRest676包含1,331 个包含ellipsis或co-reference的正例, 以及不需要改写的负例1413个.
GECOR不考虑ellipsis或co-reference的语法特性, 它们可以是字词, 短语甚至短句. 同时这种改写方式不需要提供一组要解析的候选指代, 以往的研究, such as Lee et al., 2018往往需要在存在多个ellipsis或co-reference的情况下遍历文本,计算复杂度较高。
embedding layer采用Glove, Utterance and Context Encoder负责分别对待改写句子和对话内容进行编码. 在Decoder中, 需要分别计算与Gu et al., 2016类似的generation probability $p_{g}\left(y_{t} \mid \cdot\right)$ 和copy probability $p_{c}\left(y_{t} \mid \cdot\right)$:
上述copy机制(Gu et al., 2016)可以替换为基于See et al., (2017)修改而来的gated copy mechanism, 即引入一个$p_{g e n} \in (0,1)$:
[\begin{array}{l} p_{g e n}=\sigma\left(W_{h} h_{L}^{*}+W_{s} s_{\ell}+W_{y} y_{l-1}+b_{l}\right) P\left(y_{l}\right)=p_{g e n} P^{g}\left(y_{l}\right)+\left(1-p_{g e n}\right) P^{c}\left(y_{l}\right) \end{array}]
[ACL 2020] Generate, Delete and Rewrite: A Three-Stage Framework for Improving Persona Consistency of Dialogue Generation (Song et al., 2020)
腾讯AI又来了, 真的高产.
[EMNLP 2020] Semantic role labeling guided multi-turn dialogue rewriter (Xu et al., 2020)
腾讯AI三杀了. 这篇文章指出, 上述前序工作的decoder主要使用global attention从而关注对话语境中的所有单词. 由于没有预先引入先验焦点(prior focus), 上述atention机制的注意力可能会对一些无关紧要的字词吸引. 很自然地可以想到— 使用Semantic Role Labeling (SRL)识别句子的谓词-实参(Predicate-Argument)结构,从而捕获 who did what to whom这样的语义信息作为先验辅助decoder. 下图这个例子很好地体现了语义成分在Utterenace中起到的作用, “粤语”和“普通话”被识别为两个不同的实参, 可以在SRL的指导下获得更多的关注.
腾讯AI的标注团队从Duconv(Wu et al., 2019)数据集上, 标注了3000个 dialogue sessions, 包含33,673个谓词, 27,198个话语. 选用了一个额外的BERT-based SRL model (Shi and Lin, 2019)作为SRL parser用来完成这个谓词-实参结构的识别任务, 并且在CoNLL 2012 (117,089 examples)进行初步预训练.
在Restoration-200K(Pan et al., 2019)数据集上实验结果如下, 还不如Pan et al., (2019)提出的PAC Model.
we find that the SRL information mainly improves the performance on the dialogues that require information completion. One omitted information is considered as properly completed if the rewritten utterance recovers the omitted words. We find the SRL parser naturally offers important guidance into the selection of omitted words
[EACL 2021] Ellipsis Resolution as Question Answering: An Evaluation (Aralikatte et al., 2021)
先前的工作往往对于ellipsis和co-reference采取不同的策略, 往往是Insert, Repalce. R~A~ST将Insert和Repalce这两个操作用一个先deletion后Insetion的操作代替. 标注数据由longest common sub-sequence (LCS)算法生成, 剔除REWRITE(Su et al., 2019)和RESTORATION-200K(Pan et al., 2019中不满足要求的.
Deletion $\in {0, 1}$: the word $x_n$ is deleted (i.e. 1) or not (i.e. 0);
Insertion: [start, end]: 对话内容中的位置信息, If no phrase is inserted, the span is [-1, -1].
Adam作为一种自适应的优化算法, 结合了Momentum以及RMSprop算法, 一方面参考动量作为参数更新方向, 一方面计算梯度的指数加权平方. Adam在深度学习领域有广泛的实用性, 同时也是过去五年来被cite数量最多的scientific paper, 根据Nature Index and Google Scholar, 被戏称为AI Paper计数器.