Skip to content

SynthGL/fintag

Repository files navigation

FinTag

AI-powered XBRL tag classification and financial anomaly detection, built on SEC EDGAR data.

Live demo: fintag.synthgl.com (coming soon)

What it does

Three tools for working with SEC financial data:

FinTag - XBRL Tag Classifier

Enter any financial line item (e.g. "Accounts receivable, net") and get the most likely US-GAAP XBRL tag.

  • Training data: 1.75M label-tag pairs extracted from 68 SEC quarterly datasets (44M raw rows, 16,647 filers)
  • Model: QLoRA fine-tuned Qwen3-4B (r=16, alpha=32, 3 epochs on A100-40GB)
  • Baseline: Bigram Dice coefficient matching against 1,000-tag vocabulary (86.4% coverage)
  • Input: Line item text + statement type (IS/BS/CF/EQ/CI) + optional SIC code
  • Output: Top-K tag predictions with confidence scores, balance type, period type, and measure

FinAnomaly - Financial Ratio Anomaly Detector

Compare any SEC filer against industry benchmarks across 10 financial ratios.

  • Data: 74,993 company-period records across 7,553 companies, 71 industries (2022-2024)
  • Ratios: Gross/operating/net margins, ROA, ROE, current ratio, D/E, D/A, R&D intensity, AR turnover
  • Method: IQR-based scoring by 2-digit SIC code (robust against fat-tailed financial data)
  • Source: SEC EDGAR Financial Statement Data Sets (num.txt + sub.txt)

Validator - Pre-Submission XBRL Error Catcher

Scan a draft filing for likely mis-tagged line items before submission.

  • Combines FinTag predictions with peer industry benchmarks
  • Flags mismatches by severity (CRITICAL / HIGH / MEDIUM / LOW)
  • Looks up real EDGAR filings by CIK to validate against actual submissions

Data Pipeline

68 SEC quarterly zips
    -> extract_fintag_data.py    -> 1.75M SFT training pairs
    -> build_finanomaly_db.py    -> DuckDB (74K company-period ratios)
    -> modal_fintag_train.py     -> QLoRA adapter (Modal A100-40GB)

Project Structure

.
├── train_qlora.py          # Local QLoRA fine-tuning (Qwen2.5-7B)
├── train_gemma4.py         # Gemma 4 fine-tuning via Unsloth
├── modal_train.py          # Modal GPU training (H100)
├── modal_benchmark.py      # Modal GPU benchmarking (A10G)
├── benchmark.py            # Local CPA FAR benchmark harness
├── serve_model.py          # Modal inference endpoint
├── prepare_cpa_sft.py      # CPA exam -> SFT format converter
├── requirements.txt        # Python training dependencies
├── setup_lambda.sh         # Lambda Cloud GPU setup
├── upload_to_lambda.sh     # Upload data to Lambda instance
├── recover_and_benchmark.sh # Recover checkpoint + run benchmark
├── check_training.sh       # Monitor training progress
└── demo/                   # Next.js 16 demo UI
    ├── src/app/fintag/     # FinTag classifier page
    ├── src/app/anomaly/    # FinAnomaly detector page
    ├── src/app/validator/  # XBRL validator page
    ├── src/lib/fintag.ts   # Bigram similarity matching engine
    ├── src/lib/edgar.ts    # EDGAR data access layer
    ├── src/lib/db.ts       # DuckDB integration
    └── Dockerfile          # Fly.io deployment

Quick Start

Demo UI

cd demo
pnpm install
pnpm dev
# Open http://localhost:3000

Training

# Install Python dependencies
pip install -r requirements.txt

# Train on Modal (recommended - no local GPU needed)
modal run modal_train.py --model qwen7b --run combined

# Or train locally with a GPU
python train_qlora.py --run combined --epochs 3

# Benchmark against CPA FAR exam
modal run modal_benchmark.py --n 500 --compare-base

Tech Stack

Layer Tech
Training PyTorch, Transformers, PEFT/QLoRA, TRL, bitsandbytes
GPU compute Modal (H100/A10G), Lambda Cloud
Demo UI Next.js 16, React 19, Tailwind CSS 4
Data source SEC EDGAR Financial Statement Data Sets
Deployment Fly.io (demo), Modal (inference)

Models

Model Base Data GPU Status
Qwen2.5-7B combined Qwen/Qwen2.5-7B-Instruct Journal + CPA H100 Trained
Qwen3-4B FinTag Qwen/Qwen3-4B 1.75M XBRL pairs A100-40GB Trained
Gemma 4 31B google/gemma-4-31b-it Combined H100 Experimental
Gemma 4 E4B google/gemma-4-E4b-it Combined H100 Experimental

License

MIT

About

AI-powered XBRL tag classifier and financial anomaly detector. Fine-tuned Qwen3-4B on 1.75M SEC EDGAR label-tag pairs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors