Does String-Based Neural MT Learn Source Syntax?

2 minute read

Paper Authors: Xing Shi, Inkit Padhi, and Kevin Knight
Author Affliation: Information Sciences Institute & Computer Science Department, University of Southern California

Paper Abstract:
We investigate whether a neural, encoder decoder translation system learns syntactic information on the source side as a by-product of training. We propose two methods to detect whether the encoder has learned local and global source syntax. A fine-grained analysis of the syntactic structure learned by the encoder reveals which kinds of syntax are learned and which are missing.

Goals

To investigate if neural machine translation systems learn any syntactic features about the source during training (as a by-product)?
To see what kind of syntactic information is being learnt, if any?
Evaluate both global and local syntactic information that is being learnt

Why is this important ?

Non-neural systems rely on a lot of information for machine translation. Apart from other information(wordvecs, wordtranslations), syntactic information may be provided by:

Adding source syntax [Tree-to-String]
Adding target syntax [String-to-Tree]
Adding both syntax [String-to-Tree]

Neural models (Here: Vanilla seq2seq model) don’t require such artillery to provide competitive translation systems.

This paper is a step towards understanding what is going inside such models and how are they performing well.

A nice example

Setup

English-French NMT system trained on 110M tokens of bilingual data (English-side).
English-English autoencoder system.

Steps

Take 10K separate english and label their voice (active/passive)
Use trained NMT/autoencoder to convert into 10k corresponding 1000-dimension encoding vectors (Seq to vec to seq)
Use 9k sentences for training a logistic regression model to predict voice
1k for testing

Observation

image-center

Accuracy a: 92.7% b: 82.6%

Verdict: Syntactic information is learnt in NMT but not so in autoencoder.

Experiments

image-center

Model: Two layered LSTM based models

E2E (English-to-english) [LOWER BOUND MODEL]
PE2PE (PermutedEnglish-to-Permutedenglish) [LOWER BOUND MODEL]
E2P (English-to-ParseTree) [UPPER BOUND MODEL]
E2F (English-to-French)
E2G (English-to-German)

An improvement of syntactic tree generation over the lower bound models indicates that the encoder learns syntactic information, whereas a decline from the upper bound model shows that the encoder loses certain syntactic information

Global Syntax

Voice : active or passive
Tense : past or non-past
TSS : Top-level syntactic sequence of the constituent tree. (Only top 19. Rest are treated as others)

Local Syntax

POS : Tags for each word
SPC: Smallest phrase constituent above each word (HIGHER LEVEL POS TAGS)

Results

image-center

TAKE AWAY

NMT encoder learns significant sentence-level syntactic-information. It can distinguish voice and tense of the source sentence and it knows the sentence's structure to some extent.
Different syntactic information is captured in different layers as can be see from Figure 3. Local features are preserved in lower layers whereas global abstract information gets stored in upper layers. Model is two layered.
Units = Hidden states. While predicting POS, the gap of neural parser (E2P) on the lower layer (C0) is much smaller. Small subset of units explicitly takes charge of POS tags in the neural parse, whereas for NMT the POS information is more distributed and implicit.
No large difference between E2F and E2G

Extracting Syntactic Tree from Encoder

image-center

Evaluation Tools

EVALB tool to calculate labeled bracketing F1-score
zxx package for Tree edit distance
Berkeley Parser Analyser to analyse parsing error types.

image-center

Share on

Twitter Facebook LinkedIn

Does String-Based Neural MT Learn Source Syntax?

Goals

Why is this important ?

A nice example

Setup

Steps

Observation

Experiments

Global Syntax

Local Syntax

Results

Extracting Syntactic Tree from Encoder

Share on

Leave a comment

You may also enjoy