-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Currently, the spellcheck-autoencoder is conditioned to only reconstruct lexical identities of the input tokens, with no internal conditioning towards more human-like assessments of reconstruction correctness.
Since the decoder works left-to-right, it is currently mostly learning to represent string length, and then gradually working through greedily reconstructing longer and longer identical prefixes. For example, decoder output over epochs currently looks like this:
| Original Token | Reconstruction at Epoch 1 | Reconstruction at Epoch 2 | Reconstruction at Epoch 3 |
|---|---|---|---|
| rue de la 24 di septiembre | rue de laiji97jfoefokp | rue de la 22 de septxcg | rue de la 24 de septiembdr |
| chicago | chicgxha | chicagxo | chicago |
As such, latent coordinates currently represent mostly string length and prefix information. The decoding process should be changed as follows:
- Make length-decoding independent from lexical decoding. In a preflight process, a decoder RNN should create a "decoding bed", which relieves the lexical decoder from having to learn string length decoding.
- Instead of lexical decoding, try phonetic decoding. However, it is tbd. which phonetic labels to use. As a start, spellfix1 transcription could be implemented. [2]
- Make lexical decoding bidirectional. In order to prevent greedy prefix learning, lexical unrolling will be performed right-to-left first, and the resulting sequence then concatenated with the decoding bed. [1]
[1] Stacked bidirectional decoder architecture
Encoded Vector ' Decoder
---------------'------------------------------------------------------------------------------
(Forward Decoder) "s" "t" " " "l" "o" "u" "i" "s" "$"
' ∧ ∧ ∧ ∧ ∧ ∧ ∧ ∧ ∧
' | | | | | | | | |
' [ ]--->[ ]--->[ ]--->[ ]--->[ ]--->[ ]--->[ ]--->[ ]--->[ ]
' /| /| /| /| /| /| /| /| /|
' / | / | / | / | / | / | / | / | / |
' | | | | | | | | | | | | | | | | | |
(Backward Decoder) | "?" | "?" | " " | "?" | "o" | "u" | "i" | "s" | "$"
' | ∧ | ∧ | ∧ | ∧ | ∧ | ∧ | ∧ | ∧ | ∧
' | | | | | | | | | | | | | | | | | |
' | [ ]<---[ ]<---[ ]<---[ ]<---[ ]<---[ ]<---[ ]<---[ ]<---[ ]
___'__________________|__|___|__|___|__|___|__|___|__|___|__|___|__|___|__|___|__|
[0] | ' | | | | | | | | |
[0] | (Preflight Decoder) "X" "X" "~" "X" "X" "X" "X" "X" "$"
[0] | ' ∧ ∧ ∧ ∧ ∧ ∧ ∧ ∧ ∧
[0] ------>| ' | | | | | | | | |
[0] | ' [ ]--->[ ]--->[ ]--->[ ]--->[ ]--->[ ]--->[ ]--->[ ]--->[ ]
[0] | ' ∧ ∧ ∧ ∧ ∧ ∧ ∧ ∧ ∧
[0] |___'_____________________|______|______|______|______|______|______|______|______|
[0] '
[2] Spellfix1 Phonetic Hashing (spellfix1_phonehash)
Phonetic replacements:
- A, E, I, O, U, Y ⟶ A
- B, F, P, V ⟶ B
- C, G, J, K, Q, S, X, Z ⟶ C
- D, T ⟶ D
- L ⟶ L
- M, N ⟶ M
- R ⟶ R
- H, W ⟶ _
Other rules (to be ignored/amended/selected):
- Omit double letters
- Omit vowels beside L and R
- Omit T before CH
- Omit W before R
- Omit D before J
- Omit K or G before N at beginning of work
Alternatively, try CMU Logios (http://www.speech.cs.cmu.edu/tools/lextool.html).
Alternatively, try CMU G2P (https://github.com/cmusphinx/g2p-seq2seq)
Notably, all these phonetic transcriptions are optimized for English/Latin languages.