FinTag

AI-powered XBRL tag classification and financial anomaly detection, built on SEC EDGAR data.

Live demo: fintag.synthgl.com (coming soon)

What it does

Three tools for working with SEC financial data:

FinTag - XBRL Tag Classifier

Enter any financial line item (e.g. "Accounts receivable, net") and get the most likely US-GAAP XBRL tag.

Training data: 1.75M label-tag pairs extracted from 68 SEC quarterly datasets (44M raw rows, 16,647 filers)
Model: QLoRA fine-tuned Qwen3-4B (r=16, alpha=32, 3 epochs on A100-40GB)
Baseline: Bigram Dice coefficient matching against 1,000-tag vocabulary (86.4% coverage)
Input: Line item text + statement type (IS/BS/CF/EQ/CI) + optional SIC code
Output: Top-K tag predictions with confidence scores, balance type, period type, and measure

FinAnomaly - Financial Ratio Anomaly Detector

Compare any SEC filer against industry benchmarks across 10 financial ratios.

Data: 74,993 company-period records across 7,553 companies, 71 industries (2022-2024)
Ratios: Gross/operating/net margins, ROA, ROE, current ratio, D/E, D/A, R&D intensity, AR turnover
Method: IQR-based scoring by 2-digit SIC code (robust against fat-tailed financial data)
Source: SEC EDGAR Financial Statement Data Sets (num.txt + sub.txt)

Validator - Pre-Submission XBRL Error Catcher

Scan a draft filing for likely mis-tagged line items before submission.

Combines FinTag predictions with peer industry benchmarks
Flags mismatches by severity (CRITICAL / HIGH / MEDIUM / LOW)
Looks up real EDGAR filings by CIK to validate against actual submissions

Data Pipeline

68 SEC quarterly zips
    -> extract_fintag_data.py    -> 1.75M SFT training pairs
    -> build_finanomaly_db.py    -> DuckDB (74K company-period ratios)
    -> modal_fintag_train.py     -> QLoRA adapter (Modal A100-40GB)

Project Structure

.
├── train_qlora.py          # Local QLoRA fine-tuning (Qwen2.5-7B)
├── train_gemma4.py         # Gemma 4 fine-tuning via Unsloth
├── modal_train.py          # Modal GPU training (H100)
├── modal_benchmark.py      # Modal GPU benchmarking (A10G)
├── benchmark.py            # Local CPA FAR benchmark harness
├── serve_model.py          # Modal inference endpoint
├── prepare_cpa_sft.py      # CPA exam -> SFT format converter
├── requirements.txt        # Python training dependencies
├── setup_lambda.sh         # Lambda Cloud GPU setup
├── upload_to_lambda.sh     # Upload data to Lambda instance
├── recover_and_benchmark.sh # Recover checkpoint + run benchmark
├── check_training.sh       # Monitor training progress
└── demo/                   # Next.js 16 demo UI
    ├── src/app/fintag/     # FinTag classifier page
    ├── src/app/anomaly/    # FinAnomaly detector page
    ├── src/app/validator/  # XBRL validator page
    ├── src/lib/fintag.ts   # Bigram similarity matching engine
    ├── src/lib/edgar.ts    # EDGAR data access layer
    ├── src/lib/db.ts       # DuckDB integration
    └── Dockerfile          # Fly.io deployment

Quick Start

Demo UI

cd demo
pnpm install
pnpm dev
# Open http://localhost:3000

Training

# Install Python dependencies
pip install -r requirements.txt

# Train on Modal (recommended - no local GPU needed)
modal run modal_train.py --model qwen7b --run combined

# Or train locally with a GPU
python train_qlora.py --run combined --epochs 3

# Benchmark against CPA FAR exam
modal run modal_benchmark.py --n 500 --compare-base

Tech Stack

Layer	Tech
Training	PyTorch, Transformers, PEFT/QLoRA, TRL, bitsandbytes
GPU compute	Modal (H100/A10G), Lambda Cloud
Demo UI	Next.js 16, React 19, Tailwind CSS 4
Data source	SEC EDGAR Financial Statement Data Sets
Deployment	Fly.io (demo), Modal (inference)

Models

Model	Base	Data	GPU	Status
Qwen2.5-7B combined	Qwen/Qwen2.5-7B-Instruct	Journal + CPA	H100	Trained
Qwen3-4B FinTag	Qwen/Qwen3-4B	1.75M XBRL pairs	A100-40GB	Trained
Gemma 4 31B	google/gemma-4-31b-it	Combined	H100	Experimental
Gemma 4 E4B	google/gemma-4-E4b-it	Combined	H100	Experimental

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FinTag

What it does

FinTag - XBRL Tag Classifier

FinAnomaly - Financial Ratio Anomaly Detector

Validator - Pre-Submission XBRL Error Catcher

Data Pipeline

Project Structure

Quick Start

Demo UI

Training

Tech Stack

Models

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
demo		demo
.gitignore		.gitignore
README.md		README.md
benchmark.py		benchmark.py
check_training.sh		check_training.sh
modal_benchmark.py		modal_benchmark.py
modal_train.py		modal_train.py
prepare_cpa_sft.py		prepare_cpa_sft.py
recover_and_benchmark.sh		recover_and_benchmark.sh
requirements.txt		requirements.txt
serve_model.py		serve_model.py
setup_lambda.sh		setup_lambda.sh
train_gemma4.py		train_gemma4.py
train_qlora.py		train_qlora.py
upload_to_lambda.sh		upload_to_lambda.sh

Folders and files

Latest commit

History

Repository files navigation

FinTag

What it does

FinTag - XBRL Tag Classifier

FinAnomaly - Financial Ratio Anomaly Detector

Validator - Pre-Submission XBRL Error Catcher

Data Pipeline

Project Structure

Quick Start

Demo UI

Training

Tech Stack

Models

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages