Exploring Controllable Text Generation Techniques - 2020

4 minute read

Information

Link: Arxiv

Paper from: Carnegie Mellon University (CMU)

Why is this paper important?: Puts different mechanisms to control generated text into a framework.

Code: NA

Summary

Controllable text generation is the task of generating text whose attributes can be controlled. These attributes are, but not limited to, stylistic, demographic and content. Stylistic attributes can be politeness, sentiment, formality; demographic attributes can be gender and age; content attributes can be information, keywords, entities, ordering of information, events. Controllable text generation can be modeled as conditional language generation task however the means of changing the attributes can be in five ways:

  • External Input
  • Sequential Input
  • Generator Operations
  • Output
  • Training Objective

All of these fit into the generation process as follows:

Figure 1
Figure 1 - Different Mechanisms for Control

External Input

Standard (non-controlled) mechanism: \(h_{0} = h_{e}\) where \(h_0\) is the generator initialization and \(h_e\) is the representation of the input sentence by the encoder.

Controlled Mechanisms:

Arithmetic or Linear Transform

  1. Concatenate control vector \(s\) to \(h_{e}\): \([h_{e};s]\). Some examples:
    • To control the style in text generation
    • To control dialogue response generation with information from external sources
    • To control visual stories with a personality control vector derived from separate corpus.
  2. Arithmetic operations on \(h_{e}\): Paper on visual stories also experimented with arithmetic operations instead of concatenation. \(h_{0} = h_{e} - S + P\) where \(S\) is the average representation of the story and \(P\) is the personality representation.
  3. Linear Transform: \(h_{0} = tanh(W_{1}h_{e} + W_{2}s + b)\). This has been reported to be performing better than the first two options.

Stochastic Changes

Stochastically draw a latent variable \(z\) from a Gaussian distribution and base \(h_{0}\) on this \(z\). This has been used to guide the generation with topics of a document. Also used to control for sentiment in style transfer task.

Decompose

This mechanism decomposes \(h_{e}\) into multiple subspaces where each subspace signifies a different attribute you wish to control. Some examples:

  • Structure & semantic information of a document.
  • Force the first \(n\) dimensions to capture meaning and the remaining to capture structure.
  • Decompose into a form vector \(f\) and a meaning vector \(m\). A discriminator ensures \(m\) to not have any form information while another loss is used to encourage \(f\) to carry form information. \(m = tanh(W_{m}h_{e} + b_{m})\) and \(f = tanh(W_{f}h_{e} + b_{f})\).

External Feedback

An adversarial loss to alter the latent space. Some examples:

  • A multi-layer perceptron is used for predicting style labels
  • Multiple losses to control style and content.

Sequential Input

Standard (non-controlled) mechanism: \(x_{t}\) which is the input to the decoder at time \(t\) is not altered: \(x_{t} = y_{t-1}\) where \(y_{t-1}\) is the generated token at the previous time step.

Controlled Mechanism:

Arithmetic or Linear Transform

Concatenate a control vector \(s\) to the input of the generator at each time step: \(\tilde{x_{t}} = [x_{t};s]\). Some examples:

  • Representation of external information sources concatenated as \(s\) to input of the generator.
  • Concatenate side constraints that represent style and personality
  • Concatenate personality representation \(P\) at each time step: \(\tilde{x_{t}} = x_{t} - S + P\) where again \(S\) is the average representation of the story.

Generator Operations

These are variant architectures used as the generator.

  • Recurrent Neural Networks (RNNs)
  • Transformer
  • Pre-trained Models

Output

\(o_{t}\) is the output of the generator at time step \(t\) that is projected to the vocabulary space to predict the token.

Attention

  • Global attention: computationally expensive for long sequences
  • Local attention: much more efficient because calculated over a window size \(D\)
  • Self attention: used to control for style via adding a style token to the source sequence.

External Feedback

Output latent space can be changed using adversarial loss.

  • One paper encourages realistic generation and attribute compatible sentences by trying to match the distribution of sentence and vector pairs \((x,s)\)
  • Another paper provides different rewards for style, semantics and fluency within a reinforcement learning setup.

Arithmetic or Linear Transform

Some examples:

  • \(\tilde{o_{t}} = o_{t} + s\) where \(o_{t}\) is the output of an RNN.
  • \(\tilde{o_{t}} = tanh(W_{o}o_{t} + W_{s}s + b_{o}\) .
  • \(\tilde{o_{t}} = [o_{t};s]\) .

Training Objective

General Loss Objective

These objectives do not control for any attribute.

Cross Entropy Loss

The classic categorical cross entropy loss.

Unlikelihood Loss

Keeps a set of negative candidates based on repeating tokens or n-grams which are updated at each time step. This loss minimizes the repetitions and used to augment the maximum likelihood objective.

Diversity-Promoting Objective

Aims to generate varied sentences given similar inputs. Maximum Mutual Information also tries to reduce generic responses. The formula is \(\hat{\textbf{T}} = argmax_{T}{log_{p}(\textbf{T}|\textbf{S}) - \lambda logp(\textbf{T})}\)

KL Divergence

Quantifies how much one probability distribution differs from another. For prob. distributions \(P\) and \(Q\): \(KL(P||Q)\). In text domain KL Divergence is combined with evidence lower bound (ELBO).

Classifier Loss

Used to ensure that the generated tokens are inline with control attributes. This operates on the token level and is not as effective when the number of styles increase.

Task Specific Loss

Strategy Loss

Uses ground truth strategies to lead to better responses in dialogue.

Coverage Loss

Penalizes repeatedly attending to the same locations of the source document.

Structure Loss

Structural compression loss is used to generate a sentence by compressing several sentences from a specific source for the abstractive summarization task. Structural coverage is used to cover more salient information of the original document.

Comments