BLISS is a dataset for testing the generalization capabilities of artificial models for language induction. The benchmark score represent how well a model generalizes in inverse relation how little data it was trained on.
This repository contains the datasets and data generation scripts for training and testing a model on BLISS.
For the full method and specs see the paper Benchmarking Neural Network Generalization for Language Induction.
- aⁿbⁿ
- aⁿbⁿcⁿ
- aⁿbⁿcⁿdⁿ
- aⁿbᵐcⁿ⁺ᵐ
- Dyck-1
- Dyck-2
Please use the following citation if you use the datasets in your work:
@inproceedings{Lan_Chemla_Katzir_2023,
title={Benchmarking Neural Network Generalization for Grammar Induction},
author={Lan, Nur and Chemla, Emmanuel and Katzir, Roni},
booktitle={Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD)},
pages={131--140},
year={2023}
}
Following Gers & Schmidhuber (2001), all sequences start and end with the symbol #. This makes it possible to test for strict acceptance/rejection.
All files contain strings surrounded with # from both sides. Inputs and targets need to be trimmed accordingly.
Example:
| aⁿbⁿ | |
| Input string | #aaabbb |
| Target string | aaabbb# |
All datasets are provided with boolean mask tensors for testing model outputs:
-
Deterministic step masks - some languages have deterministic phases where a model's accuracy can be tested. For example,
aⁿbⁿsequences become deterministic after seeing the firstb. A good model will not assign any probability toaafter seeing the firstb. -
Valid symbol masks - languages like Dyck don't have any deterministic parts (a new parenthesis can always be opened). But the set of valid symbols at each time step is limited. For example, for a Dyck-1 sequence, after seeing
#((, a good model must not assign any probability to the end-of-sequence symbol.
| aⁿbⁿ | |
| String example | aaabbb |
| Input sequence |
[#,a,a,a,b,b,b] |
| Target sequence | [a,a,a,b,b,b,#] |
| Vocabulary | {"#": 0, "a": 1, "b": 2} |
| Deterministic steps mask (boolean) | [0,0,0,0,1,1,1] |
| Deterministic step mask shape | (batch_size, sequence_length) |
| Dyck-1 | |
| String example | (())() |
| Input sequence |
[#,(,(,),),(,)] |
| Target sequence | [(,(,),),(,),#] |
| Vocabulary | {"#": 0, "(": 1, ")": 2} |
| Valid symbols mask (boolean) | [[1,1,0], [0,1,1], [0,1,1], [0,1,1], [1,1,0], [0,1,1], [1,1,0]] |
| Valid symbol mask shape | (batch_size, sequence_length, vocabulary_size) |
Each folder in datasets has the following structure:
<language_name>train_<batch_size>_p_<prior>_seed_<seed>.txt.zip– train set of sizebatch_sizesampled using probabilitypriorand using the randomseed.test.txt.zip– first 15,000 strings of the language sorted by length.aⁿbᵐcⁿ⁺ᵐis sorted byn+mvalues. Dyck are sorted by length+lexicographically.preview.txt– first 10 strings of the language.test_deterministic_mask.npz– boolean mask for deterministic time steps, for relevant languages (all but Dyck languages). Shape:(batch_size, sequence_length).test_valid_next_symbols.npz– boolean mask for relevant symbols, for Dyck languages. Shape:(batch_size, sequence_length, vocabulary_size).
Load npz mask files using :
np.load(filename)["data"]️🚨 The password to all zip files is 1234. Why?
To generate new training data using a different seed, prior, or batch size, run:
python generate_dataset.py --lang [language-name] --seed [seed] --prior [prior]
Example:
python generate_dataset.py --lang an_bn --seed 100 --prior 0.3To prevent test set contamination by large language models who train on crawled data and then test on it, all dataset files except previews are zipped and password-protected.
The password to all zip files is 1234.
Each dataset folder contains preview.txt for easy inspection of the data.
- Python ≥ 3.5
Quick setup:
pip install -r requirements.txt