A Survey of Existing Transfromer variants

2 minute read

Published:

现有Transfromer中的multi-head self-attention机制会消耗大量空间时间资源来计算相似度矩阵(attention matrix), 与文本长度L^2成正比. 自从Transformer以及BERT面世以来,NLP社区已经涌现了一大堆方法来解决这个问题的复杂性,这篇文章对其中promising的, 有借鉴意义的几个方向数十种方法做了一个简要的罗列.

Solutions

  • Memory caching
    • Transformer-XL [Dai et al., 2019]
      • 将前段segment作为当前segment的输入
      • 相对位置编码
  • Restrict to local neighborhoods
    • Image Transformer [Parmar et al., 2018]
  • Structural Prior
    • Sparsity:
      • Sparse Transformer [Child et al., 2019]
        • Sparse attention通过计算一个sequence中有限的相似度评分,而不是计算所有可能的相似度评分,从而减少了计算时间和注意机制的内存需求,从而得到一个稀疏矩阵而不是一个完整矩阵. 这些稀疏的条目可能是手动定制的, 或者随机的.
        • 让每个位的 token 去注意重要的局部(对角线附近),而不是整个序列,从而把稀疏性引入注意力层
    • Compression:
      • Compressive Transformer [Rae et al., 2020]
        • 以stack的形式存储: 丢弃最老的compressed memory, 对最老的n个状态进行压缩加入
    • Clustering:
      • Routing Transformer [Roy et al., 2020]
        • 公共的随机权重矩阵对Query和Key的值进行投影
        • K-means聚类
      • Reformer [Kitaev et al., 2020]
        • 局部敏感性哈希(LSH, Locality sensitive hashing)
        • Chunking 并行计算
        • Reversible Residual Network
    • Sliding windows:
      • Longformer [Beltagy et al., 2020]
        • 窗口化的局部上下文的self attention + 由终端任务激活的全局attention
    • Truncated targeting:
      • Faster transformer [Chelba et al., 2020]
    • Low-rank kernels substituting softmax:
      • Transformers are RNNs [Katharopoulos et al., 2020]
      • Efficient Attention [Shen et al., 2018]

      • Linformer [Wang et al., 2020]
        • 自注意力是低秩的
        • Linearly Approximation, 奇异值分解 SVD 近似低秩矩阵, 有偏估计
      • Performer [Choromanski et al., 2020]
        • Fast Attention Via positive Orthogonal Random features (FAVOR+).
        • 在线性复杂度下通过随机特征实现对于任意注意力矩阵的有效近似. 是一种scalable kernel methods, 完全兼容所有regular Transformers.
          • Provably accurate (unbiased) estimation, uniform convergence, low variance.
          • Linear complexity.
          • Do not rely on any priors.

Limitations of Existing variants before Performer

  • Do not approximate regular attention
  • Simpler and more tractable attention.
    • 主要针对Transformer models和generative pre-training进行优化
    • 通常会叠加更多的attention layers来弥补稀疏表示,可复用性低
  • Lack or theoretical guarantees for the representation power.

References

  • Beltagy, I., Peters, M. E., and Cohan, A. (2020). Longformer: The longdocument transformer. Child, R., Gray, S., Radford, A., and Sutskever, I. (2019). Generating long sequences with sparse transformers.

  • Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., Belanger, D., Colwell, L., and Weller, A. (2020). Rethinking attention with performers.

  • Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., and Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.

  • Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. (2020). Transformers are rnns: Fast autoregressive transformers with linear attention.

  • Kitaev, N., Kaiser, L., and Levskaya, A. (2020). Reformer: The efficient transformer. In International Conference on Learning Representations.

  • Parmar, N., Vaswani, A., Uszkoreit, J., Lukasz Kaiser, Shazeer, N., Ku, A., and Tran, D. (2018). Image transformer.

  • Rae, J. W., Potapenko, A., Jayakumar, S. M., and Lillicrap, T. P. (2019). Compressive transformers for long-range sequence modelling.

  • Roy, A., Saffar, M., Vaswani, A., and Grangier, D. (2020). Efficient contentbased sparse attention with routing transformers.

  • Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. (2020). Linformer: Self-attention with linear complexity.