🧘 BLISS – a Benchmark for Language Induction from Small Sets

BLISS is a dataset for testing the generalization capabilities of artificial models for language induction. The benchmark score represent how well a model generalizes in inverse relation how little data it was trained on.

This repository contains the datasets and data generation scripts for training and testing a model on BLISS.

For the full method and specs see the paper Benchmarking Neural Network Generalization for Language Induction.

Languages

aⁿbⁿ
aⁿbⁿcⁿ
aⁿbⁿcⁿdⁿ
aⁿbᵐcⁿ⁺ᵐ
Dyck-1
Dyck-2

Citing this work

Please use the following citation if you use the datasets in your work:

@inproceedings{Lan_Chemla_Katzir_2023,
  title={Benchmarking Neural Network Generalization for Grammar Induction},
  author={Lan, Nur and Chemla, Emmanuel and Katzir, Roni},
  booktitle={Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD)},
  pages={131--140},
  year={2023}
}

String structure

Following Gers & Schmidhuber (2001), all sequences start and end with the symbol #. This makes it possible to test for strict acceptance/rejection.

All files contain strings surrounded with # from both sides. Inputs and targets need to be trimmed accordingly.

Example:

aⁿbⁿ
Input string	`#aaabbb`
Target string	`aaabbb#`

Deterministic and valid symbol masks

All datasets are provided with boolean mask tensors for testing model outputs:

Deterministic step masks - some languages have deterministic phases where a model's accuracy can be tested. For example, aⁿbⁿ sequences become deterministic after seeing the first b. A good model will not assign any probability to a after seeing the first b.
Valid symbol masks - languages like Dyck don't have any deterministic parts (a new parenthesis can always be opened). But the set of valid symbols at each time step is limited. For example, for a Dyck-1 sequence, after seeing #((, a good model must not assign any probability to the end-of-sequence symbol.

Examples:

aⁿbⁿ
String example	`aaabbb`
Input sequence	`[#,a,a,a,b,b,b]`
Target sequence	`[a,a,a,b,b,b,#]`
Vocabulary	`{"#": 0, "a": 1, "b": 2}`
Deterministic steps mask (boolean)	`[0,0,0,0,1,1,1]`
Deterministic step mask shape	`(batch_size, sequence_length)`

Dyck-1
String example	`(())()`
Input sequence	`[#,(,(,),),(,)]`
Target sequence	`[(,(,),),(,),#]`
Vocabulary	`{"#": 0, "(": 1, ")": 2}`
Valid symbols mask (boolean)	`[[1,1,0], [0,1,1], [0,1,1], [0,1,1], [1,1,0], [0,1,1], [1,1,0]]`
Valid symbol mask shape	`(batch_size, sequence_length, vocabulary_size)`

Folder structure

Each folder in datasets has the following structure:

<language_name>
- train_<batch_size>_p_<prior>_seed_<seed>.txt.zip – train set of size batch_sizesampled using probability prior and using the random seed.
- test.txt.zip – first 15,000 strings of the language sorted by length. aⁿbᵐcⁿ⁺ᵐ is sorted by n+m values. Dyck are sorted by length+lexicographically.
- preview.txt – first 10 strings of the language.
- test_deterministic_mask.npz – boolean mask for deterministic time steps, for relevant languages (all but Dyck languages). Shape: (batch_size, sequence_length).
- test_valid_next_symbols.npz – boolean mask for relevant symbols, for Dyck languages. Shape: (batch_size, sequence_length, vocabulary_size).

Load npz mask files using :

np.load(filename)["data"]

️🚨 The password to all zip files is `1234`. Why?

Generating new data

To generate new training data using a different seed, prior, or batch size, run:

python generate_dataset.py --lang [language-name] --seed [seed] --prior [prior]

Example:

python generate_dataset.py --lang an_bn --seed 100 --prior 0.3

Test contamination protection

To prevent test set contamination by large language models who train on crawled data and then test on it, all dataset files except previews are zipped and password-protected.

The password to all zip files is 1234.

See Jacovi et al., 2022 – Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks.

Each dataset folder contains preview.txt for easy inspection of the data.

Requirements

Python ≥ 3.5

Quick setup:

pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
datasets		datasets
LICENSE.md		LICENSE.md
README.md		README.md
generate_dataset.py		generate_dataset.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧘 BLISS – a Benchmark for Language Induction from Small Sets

Languages

Citing this work

String structure

Deterministic and valid symbol masks

Examples:

Folder structure

️🚨 The password to all zip files is `1234`. Why?

Generating new data

Test contamination protection

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

taucompling/bliss

Folders and files

Latest commit

History

Repository files navigation

🧘 BLISS – a Benchmark for Language Induction from Small Sets

Languages

Citing this work

String structure

Deterministic and valid symbol masks

Examples:

Folder structure

️🚨 The password to all zip files is 1234. Why?

Generating new data

Test contamination protection

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

️🚨 The password to all zip files is `1234`. Why?

Packages