LeCarnet: A French Dataset for Small Language Models

1. Introduction

LeCarnet is a text dataset of 2 million children's stories in French using very simple vocabulary, inspired by the TinyStories dataset. The purpose of this work is to provide a reliable, high-quality resource for training and evaluating small language models (SLMs) in French. It is aimed at educational and experimental use. This repository contains a minimalist code for data generation, training, evaluation, and inference.

This dataset was created by synthetically generating French short stories using Mistral-Large-Instruct-2411.

The dataset and models are available on Hugging Face:

2. Quick Setup

Using uv for fast and reliable dependency management.

# Basic environment setup
make env

That's it, you can now run any command you want!

⚠️ You might need to perform the following two steps manually before running make env:

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

3. Training

The training pipeline supports Weights & Biases (WandB) for tracking training and validation losses, as well as perplexity.

Task	Make Command	Equivalent CLI Command	Default Values
Training	`make train`	`python src/train/train.py --model_config MODEL_CONFIG`	`MODEL_CONFIG=3M`
Distributed Training	`make train-dist`	`python src/train/train_dist.py --model_config MODEL_CONFIG`	`MODEL_CONFIG=3M`, `CUDA_VISIBLE_DEVICES=0`
Push Model to HF	`make push-model`	`python src/inference/push-model.py --repo_name HF_REPO --model_dir MODEL_DIR`	`HF_REPO=MaxLSB/LeCarnet-3M`, `MODEL_DIR=LeCarnet-3M/model_weights/`

⚠️ Check src/train/configs.py for fine-grained hyperparameter tuning. MODEL_CONFIG=custom to use your own custom model config.

4. Data Generation

For Generation tasks set your API key:

# Linux/MacOS
export MISTRAL_API_KEY=your_api_key
# Windows
$env:MISTRAL_API_KEY="your_api_key"

# Linux/MacOS
export OPENAI_API_KEY=your_api_key
# Windows
$env:OPENAI_API_KEY="your_api_key"

Task	Make Command	Equivalent CLI Command	Default Values
Generate with Mistral	`make generate-mistral`	`python src/data/mistral.py --model_name MISTRAL_MODEL --total_requests MISTRAL_REQUESTS, --num_workers NUM_WORKERS`	`MISTRAL_MODEL=mistral-large-2411`, `MISTRAL_REQUESTS=100000`, `NUM_WORKERS=4`
Generate with OpenAI	`make generate-openai`	`python src/data/openai.py --model_name OPENAI_MODEL --total_requests OPENAI_REQUESTS`	`OPENAI_MODEL=gpt-3.5-turbo`, `OPENAI_REQUESTS=100000`
Push Dataset to HF	`make push-dataset`	`python src/data/push_dataset.py --folder_path FOLDER_PATH --repo_name REPO_NAME`	`FOLDER_PATH=./dataset/`, `REPO_NAME=MaxLSB/LeCarnet`

5. Evaluation & Inference

To run the evaluation you also need to set up your Mistral API key.

Task	Make Command	Equivalent CLI Command	Default Values
Evaluation	`make eval`	`python src/eval/eval.py --model_name EVAL_MODEL --judge_model_name JUDGE_MODEL`	`EVAL_MODEL=MaxLSB/LeCarnet-3M`, `JUDGE_MODEL=mistral-large-2411`
Inference	`make inference`	`python src/inference/inference.py --model_name MODEL_NAME --prompt PROMPT --max_new_tokens MAX_NEW_TOKENS`	`MODEL_NAME=MaxLSB/LeCarnet-3M`, `PROMPT="Il était une fois"`, `MAX_NEW_TOKENS=512`

6. Results

Model	Judge	Grammar	Creativity	Coherence	Logic
LeCarnet-3M	mistral-large-2411	6.12	6.42	5.94	5.90
LeCarnet-8M	mistral-large-2411	7.06	7.20	7.56	7.28
LeCarnet-21M	mistral-large-2411	7.72	7.48	8.32	7.90

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
media		media
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LeCarnet: A French Dataset for Small Language Models

1. Introduction

2. Quick Setup

3. Training

4. Data Generation

5. Evaluation & Inference

6. Results

7. References

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

MaxLSB/le-carnet

Folders and files

Latest commit

History

Repository files navigation

LeCarnet: A French Dataset for Small Language Models

1. Introduction

2. Quick Setup

3. Training

4. Data Generation

5. Evaluation & Inference

6. Results

7. References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages