Protobuf Ethics Dataset

Intent

This repository focuses on collecting, cleaning, and structuring the ETHICS dataset for efficient use in generative AI training. It provides tools to export raw data from Hugging Face, clean and prune it, compute statistics, and convert it into dense, machine‑readable formats (JSONL → Protocol Buffers) for model ingestion.

Purpose

Large language models benefit from curated ethical reasoning data, but the original ETHICS dataset is distributed as CSVs that require cleaning and normalization. This project standardizes the full pipeline and provides pruning, statistics, and schema‑based serialization for high‑throughput training.

ETHICS Dataset Overview

The ETHICS dataset (Hendrycks et al., 2021) evaluates a model’s ability to understand human moral reasoning across five subsets: Commonsense, Deontology, Virtue Ethics, Utilitarianism, and Justice. This repository retrieves the dataset from Hugging Face, cleans it, prunes long examples, analyzes distributions, and serializes the data to Protocol Buffers for efficient training.

Dataset source:
Hendrycks et al. (2021). ETHICS: Aligning AI With Shared Human Values.
https://arxiv.org/abs/2008.02275

Setup & Pipeline

Steps are ordered to match the actual workflow:

Download raw ETHICS csv and format it as JSONL
Calculate the text‑length distribution
Choose a cutoff
Prune examples exceeding the cutoff
Convert JSONL → Protobuf

1. Create environment

uv venv .venv
source .venv/bin/activate
uv sync

Optional dependency install:

uv add datasets protobuf grpcio-tools zstandard transformers tokenizers

2. Export ETHICS dataset to JSONL

uv run python scripts/get_raw_training_data.py --out data/raw

Outputs:

data/raw/
  commonsense-train.jsonl
  commonsense-test.jsonl
  commonsense-test_hard.jsonl
  deontology-train.jsonl
  deontology-test.jsonl
  deontology-test_hard.jsonl
  justice-train.jsonl
  justice-test.jsonl
  justice-test_hard.jsonl
  utilitarianism-train.jsonl
  utilitarianism-test.jsonl
  utilitarianism-test_hard.jsonl
  virtue-train.jsonl
  virtue-test.jsonl
  virtue-test_hard.jsonl

3. Calculate raw text‑length statistics (Rust)

Analyzes character‑length distributions for all commonsense examples to determine a pruning cutoff.

cargo run --bin calculate_text_length_stats

Output written to:

data/stats/commonsense_length_stats.toml

Use this file to choose a cutoff
(1,000 characters recommended).

4. Prune dataset with Rust

Prune all commonsense JSONL files:

cargo run --bin prune_data_by_length

Or prune specific files:

cargo run --bin prune_data_by_length -- data/raw/commonsense-train.jsonl

Pruned output is saved in:

data/filtered/

5. Convert JSONL → Protobuf (`ethics-pipeline`)

cargo run --bin ethics-pipeline

This pipeline:

Reads from data/filtered/
Applies the schema in proto/ethics.proto
Shards into 32–64 MB protobuf files
Compresses with zstd
Writes to:

data/processed/<subset>/<split>-00000.pb.zst

Optimized for training throughput.

6. (Optional) Generate Python protobuf classes

mkdir -p training/gen
uv run python -m grpc_tools.protoc -I proto --python_out=training/gen proto/ethics.proto

Project Structure

/scripts/              # Python exporters & utilities  
/proto/                # Protobuf schema  
/src/                  # Rust modules  
/src/bin/              # CLI tools  
/training/             # Python helpers  
/data/
  raw/                 # Raw JSONL  
  filtered/            # Pruned JSONL  
  processed/           # Protobuf shards  
  stats/               # Length statistics

Design Principles

Schema consistency
Compact encoding (protobuf + zstd)
Efficiency (pruning removes long low‑value stories)
Reproducibility

Goal: structured ethical‑reasoning data for efficient and reproducible model training.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
proto		proto
scripts		scripts
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
build.rs		build.rs
cargo.toml		cargo.toml
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Protobuf Ethics Dataset

Intent

Purpose

ETHICS Dataset Overview

Setup & Pipeline

1. Create environment

2. Export ETHICS dataset to JSONL

3. Calculate raw text‑length statistics (Rust)

4. Prune dataset with Rust

5. Convert JSONL → Protobuf (`ethics-pipeline`)

6. (Optional) Generate Python protobuf classes

Project Structure

Design Principles

About

Uh oh!

Releases

Packages

Languages

Bruta-Vis/protobuf-ethics

Folders and files

Latest commit

History

Repository files navigation

Protobuf Ethics Dataset

Intent

Purpose

ETHICS Dataset Overview

Setup & Pipeline

1. Create environment

2. Export ETHICS dataset to JSONL

3. Calculate raw text‑length statistics (Rust)

4. Prune dataset with Rust

5. Convert JSONL → Protobuf (ethics-pipeline)

6. (Optional) Generate Python protobuf classes

Project Structure

Design Principles

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

5. Convert JSONL → Protobuf (`ethics-pipeline`)

Packages