Skip to content

Conversation

@graemenail
Copy link
Owner

Description

NOTE!

Motivation

This is a port of PR #2 on to the browsermt fork of marian.
This is necessary to use models that require intgemm.

Caveats

  1. --gemm-precision int8shiftAlphaAll currently doesn't work with marian-scorer; int8, int8shift do however. The problem is related to the loading of the precomputed alphas. Work is ongoing here.

This PR adds the possibility to use shortlist during (re)scoring in Marian Scorer. Its aim is to achieve word-scores from marian-scorer which are comparable to those obtained during decoding.

During decoding, tensor indices corresponding to non-shortlist tokens are discarded. This reduction in tensor size reduces the computational cost of later computations, and improves decoder performance. As such, the softmax+cross-entropy operation only ever sees shortlisted tokens. In order to imitate this in marian-rescorer, we perform a modified softmax which has a normalisation factor calculated from the sum of the subset defined by the shortlist. The sum of shortlist-only is correctly normalised to unity, while the sum over the full vocabulary is greater than (or equal to) unity. When we encounter tokens in scoring that are not in the shortlist, their value is not bounded above by 0, and therefore, may be positive.

You must maintain the same batching as used in decoding! The size of the generated shortlist depends on the contents of a particular batch, specifically the different tokens it contains.

For performance, the cross-entropy operation in Marian implements the softmax sum as part of it's node operation. This implementation is different, and uses several node operations to accomplish its result.

Finally, decoding and Scoring are two distinct modes of operation, utilising different code paths and therefore expression graphs, with decoder generating tokens sequentially, and scorer having them provided ahead of time. As such, floating point errors will propagate differently, and results may be numerically different.

Added dependencies: none

How to test

Using the same shortlist settings (e.g. --shortlist lex.s2t.gz 100 100), you should receive a roughly similar word-score when rescoring on decoder output.

Checklist

  • I have tested the code manually
  • I have run regression tests
  • I have read and followed CONTRIBUTING.md
  • I have updated CHANGELOG.md

The cross_entropy_shortlist operation implements cross-entropy with a
modified softmax stage. This modified softmax uses the shortlist indices
to define the subset over which the softmax is normalized.

The motivation is to have entries inside the shortlist with the result
they would have in the absence of non-shortlist entries. This should
offer some comparison between results in scoring and decoding modes.
The default behaviour of shortlist performed an index select to retain
entries corresponding to shortlist candidates. This is desired for
decoding, where this reduction in the size of tensors offers an increase
in performance.

The new behaviour retains the full size of the tensors, and is designed
to be used with marian scorer. The list of indices for shortlist
candidates can be later used during loss computation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants