Experimental: Allow shortlists in marian-scorer (browsermt)
#3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
NOTE!
Motivation
This is a port of PR #2 on to the browsermt fork of marian.
This is necessary to use models that require intgemm.
Caveats
--gemm-precision int8shiftAlphaAllcurrently doesn't work withmarian-scorer;int8,int8shiftdo however. The problem is related to the loading of the precomputed alphas. Work is ongoing here.This PR adds the possibility to use shortlist during (re)scoring in Marian Scorer. Its aim is to achieve word-scores from marian-scorer which are comparable to those obtained during decoding.
During decoding, tensor indices corresponding to non-shortlist tokens are discarded. This reduction in tensor size reduces the computational cost of later computations, and improves decoder performance. As such, the softmax+cross-entropy operation only ever sees shortlisted tokens. In order to imitate this in marian-rescorer, we perform a modified softmax which has a normalisation factor calculated from the sum of the subset defined by the shortlist. The sum of shortlist-only is correctly normalised to unity, while the sum over the full vocabulary is greater than (or equal to) unity. When we encounter tokens in scoring that are not in the shortlist, their value is not bounded above by 0, and therefore, may be positive.
You must maintain the same batching as used in decoding! The size of the generated shortlist depends on the contents of a particular batch, specifically the different tokens it contains.
For performance, the cross-entropy operation in Marian implements the softmax sum as part of it's node operation. This implementation is different, and uses several node operations to accomplish its result.
Finally, decoding and Scoring are two distinct modes of operation, utilising different code paths and therefore expression graphs, with decoder generating tokens sequentially, and scorer having them provided ahead of time. As such, floating point errors will propagate differently, and results may be numerically different.
Added dependencies: none
How to test
Using the same shortlist settings (e.g.
--shortlist lex.s2t.gz 100 100), you should receive a roughly similar word-score when rescoring on decoder output.Checklist