AI-powered XBRL tag classification and financial anomaly detection, built on SEC EDGAR data.
Live demo: fintag.synthgl.com (coming soon)
Three tools for working with SEC financial data:
Enter any financial line item (e.g. "Accounts receivable, net") and get the most likely US-GAAP XBRL tag.
- Training data: 1.75M label-tag pairs extracted from 68 SEC quarterly datasets (44M raw rows, 16,647 filers)
- Model: QLoRA fine-tuned Qwen3-4B (r=16, alpha=32, 3 epochs on A100-40GB)
- Baseline: Bigram Dice coefficient matching against 1,000-tag vocabulary (86.4% coverage)
- Input: Line item text + statement type (IS/BS/CF/EQ/CI) + optional SIC code
- Output: Top-K tag predictions with confidence scores, balance type, period type, and measure
Compare any SEC filer against industry benchmarks across 10 financial ratios.
- Data: 74,993 company-period records across 7,553 companies, 71 industries (2022-2024)
- Ratios: Gross/operating/net margins, ROA, ROE, current ratio, D/E, D/A, R&D intensity, AR turnover
- Method: IQR-based scoring by 2-digit SIC code (robust against fat-tailed financial data)
- Source: SEC EDGAR Financial Statement Data Sets (num.txt + sub.txt)
Scan a draft filing for likely mis-tagged line items before submission.
- Combines FinTag predictions with peer industry benchmarks
- Flags mismatches by severity (CRITICAL / HIGH / MEDIUM / LOW)
- Looks up real EDGAR filings by CIK to validate against actual submissions
68 SEC quarterly zips
-> extract_fintag_data.py -> 1.75M SFT training pairs
-> build_finanomaly_db.py -> DuckDB (74K company-period ratios)
-> modal_fintag_train.py -> QLoRA adapter (Modal A100-40GB)
.
├── train_qlora.py # Local QLoRA fine-tuning (Qwen2.5-7B)
├── train_gemma4.py # Gemma 4 fine-tuning via Unsloth
├── modal_train.py # Modal GPU training (H100)
├── modal_benchmark.py # Modal GPU benchmarking (A10G)
├── benchmark.py # Local CPA FAR benchmark harness
├── serve_model.py # Modal inference endpoint
├── prepare_cpa_sft.py # CPA exam -> SFT format converter
├── requirements.txt # Python training dependencies
├── setup_lambda.sh # Lambda Cloud GPU setup
├── upload_to_lambda.sh # Upload data to Lambda instance
├── recover_and_benchmark.sh # Recover checkpoint + run benchmark
├── check_training.sh # Monitor training progress
└── demo/ # Next.js 16 demo UI
├── src/app/fintag/ # FinTag classifier page
├── src/app/anomaly/ # FinAnomaly detector page
├── src/app/validator/ # XBRL validator page
├── src/lib/fintag.ts # Bigram similarity matching engine
├── src/lib/edgar.ts # EDGAR data access layer
├── src/lib/db.ts # DuckDB integration
└── Dockerfile # Fly.io deployment
cd demo
pnpm install
pnpm dev
# Open http://localhost:3000# Install Python dependencies
pip install -r requirements.txt
# Train on Modal (recommended - no local GPU needed)
modal run modal_train.py --model qwen7b --run combined
# Or train locally with a GPU
python train_qlora.py --run combined --epochs 3
# Benchmark against CPA FAR exam
modal run modal_benchmark.py --n 500 --compare-base| Layer | Tech |
|---|---|
| Training | PyTorch, Transformers, PEFT/QLoRA, TRL, bitsandbytes |
| GPU compute | Modal (H100/A10G), Lambda Cloud |
| Demo UI | Next.js 16, React 19, Tailwind CSS 4 |
| Data source | SEC EDGAR Financial Statement Data Sets |
| Deployment | Fly.io (demo), Modal (inference) |
| Model | Base | Data | GPU | Status |
|---|---|---|---|---|
| Qwen2.5-7B combined | Qwen/Qwen2.5-7B-Instruct | Journal + CPA | H100 | Trained |
| Qwen3-4B FinTag | Qwen/Qwen3-4B | 1.75M XBRL pairs | A100-40GB | Trained |
| Gemma 4 31B | google/gemma-4-31b-it | Combined | H100 | Experimental |
| Gemma 4 E4B | google/gemma-4-E4b-it | Combined | H100 | Experimental |
MIT