StockPredict AI

Full-stack stock analytics & ML prediction platform for the S&P 100
Real-time data · 10+ sentiment sources · 77 engineered features · plain-English AI explanations

stockpredict.dev

Overview

StockPredict AI predicts stock prices for all 100 S&P 100 companies across three time horizons (1-day, 7-day, 30-day) using a LightGBM gradient-boosted decision tree trained on 42+ engineered features. Every prediction is explained in plain English by Google Gemini, backed by SHAP feature-importance decomposition.

A fully automated nightly pipeline (GitHub Actions) runs after market close — fetching data from 10+ sources, training models, generating predictions, running SHAP analysis, writing AI explanations, and evaluating accuracy — all stored in MongoDB and served through a three-tier architecture.

Live at stockpredict.dev

Key Features

Feature	Description
ML Price Predictions	LightGBM forecasts across 1-day, 7-day, and 30-day horizons with confidence scores, price ranges, and trade recommendations
AI Explanations	SHAP decomposes each prediction into bullish/bearish drivers; Gemini writes a structured plain-English report per stock
Multi-Source Sentiment	10+ sources (Finviz, Reddit, SEC filings, Finnhub, FMP, Marketaux, Yahoo RSS, Seeking Alpha) scored with FinBERT, RoBERTa, and VADER
Real-Time Market Data	Live quotes via Finnhub WebSocket, interactive TradingView charts, and computed technical indicators (RSI, MACD, Bollinger Bands)
Sankey Financial Flows	Interactive income-statement Sankey diagrams (Apache ECharts) showing revenue sources → expenses → profit paths per company
Watchlist & Alerts	Track symbols with real-time price updates; notification system for market events
Unified News Feed	Aggregated from Yahoo Finance, Seeking Alpha, Finnhub, Marketaux, TickerTick — with inline VADER sentiment scoring
Market Intelligence	Fear & Greed Index, market open/close status with holiday detection, trading hours timeline

How It Works — The Three "Models"

This is the most important concept. The project uses three very different "models" — only one predicts prices.

1. The Predictor — LightGBM

The only component that predicts stock prices. A gradient-boosted decision tree trained on 42+ numeric features (price history, sentiment scores, macro data, insider activity). Outputs a predicted log-return per stock per horizon.

2. The Explainer — SHAP + Gemini

SHAP mathematically decomposes why LightGBM made its prediction (which features pushed the price up or down). Gemini then reads the SHAP results, sentiment data, news headlines, macro indicators, insider trades, and short interest — and writes a structured plain-English explanation. Gemini does not predict prices.

3. The Sentiment Scorers — FinBERT, RoBERTa, VADER

Read news headlines, Reddit posts, and SEC filings; produce a sentiment score (−1 to +1). These scores become input features for the LightGBM predictor — they are not predictions themselves.

News / Reddit / SEC  →  FinBERT / RoBERTa / VADER  →  sentiment score (NUMBER)
                                                              │
Price history        →  Feature engineering          →  42+ features (NUMBERS)
                                                              │
                                                              ▼
                                                       LightGBM Predictor
                                                              │
                                                      predicted return (NUMBER)
                                                              │
                                                  ┌───────────┴───────────┐
                                                  ▼                       ▼
                                            SHAP Analysis            Stored in DB
                                                  │
                                            feature contributions
                                                  │
                                                  ▼
                                            Gemini AI Explainer
                                                  │
                                            plain-English explanation (TEXT)
                                                  │
                                                  ▼
                                            Shown to user in UI

System Architecture

┌──────────────────────────────────────────────────────────────────┐
│                         USER (Browser)                           │
│   Next.js App Router  ·  React  ·  TradingView  ·  ECharts      │
└───────────────────────────┬──────────────────────────────────────┘
                            │  HTTP / WebSocket
                            ▼
┌──────────────────────────────────────────────────────────────────┐
│                    NODE.JS BACKEND (Express)                      │
│   Port 5000  ·  API Gateway  ·  News Aggregation  ·  Watchlist   │
│   Proxies ML Backend  ·  Finnhub WebSocket  ·  Notifications     │
└──────────┬──────────────────────────────────────────┬────────────┘
           │  HTTP                                     │  WebSocket
           ▼                                           ▼
┌───────────────────────┐                ┌───────────────────────┐
│  ML BACKEND (FastAPI)  │                │  Finnhub WebSocket     │
│  Port 8000             │                │  wss://ws.finnhub.io   │
│  Predictions · SHAP    │                │  Real-time trades      │
│  Sentiment · Training  │                └───────────────────────┘
│  Gemini Explanations   │
└──────────┬─────────────┘
           │
           ▼
┌──────────────────────────────────────────────────────────────────┐
│                        MONGODB ATLAS                              │
│  historical_data · sentiment · stock_predictions                  │
│  prediction_explanations · feature_importance                     │
│  insider_transactions · macro_data_raw · notifications            │
└──────────────────────────────────────────────────────────────────┘
           │
           ▼
┌──────────────────────────────────────────────────────────────────┐
│                      REDIS (Optional)                             │
│  Prediction caching (60s)  ·  Rate limiting  ·  Holiday cache     │
└──────────────────────────────────────────────────────────────────┘

Layer	Role
Frontend (Next.js 15, port 3000)	UI, TradingView charts, Sankey diagrams, search, watchlist, stock detail pages
Node Backend (Express, port 5000, Koyeb)	API gateway, news aggregation, watchlist, Finnhub WebSocket, proxies ML endpoints
ML Backend (FastAPI, port 8000)	Predictions, sentiment analysis, model training, SHAP, Gemini explanations
MongoDB Atlas	All persistent data — historical prices, sentiment, predictions, explanations, insider trades, macro data
Redis (optional)	Caching (predictions, Sankey data, holidays), rate limiting (sliding window)

Daily Automated Pipeline

Runs every weeknight via GitHub Actions (~6:15 PM ET, after market close). Total runtime: ~60 minutes on a standard runner (7 GB RAM, 2 CPUs).

 ╔════════════════════════════════════════════════════════════════════╗
 ║  STEP 1: Gather Sentiment (~5 min)                                ║
 ║  Fetch news/social data for all 100 tickers from 10+ sources.     ║
 ║  Score with FinBERT, RoBERTa, VADER. Blend into composite score.  ║
 ║  Non-fatal — pipeline continues with stale sentiment if needed.   ║
 ╠════════════════════════════════════════════════════════════════════╣
 ║  STEP 2: Train Models (~15 min)                                   ║
 ║  Ingest OHLCV from Yahoo Finance. Engineer 42+ features.          ║
 ║  Train ONE pooled LightGBM model per horizon (3 models total).    ║
 ║  FATAL if fails — no predictions without trained models.          ║
 ╠════════════════════════════════════════════════════════════════════╣
 ║  STEP 3: Generate Predictions (~20 min)                           ║
 ║  Run models on all 100 tickers (10 batches × 10).                 ║
 ║  Per stock: predicted return, price, confidence, trade signal.     ║
 ║  Canary verification: checks 8 benchmark tickers for freshness.   ║
 ╠════════════════════════════════════════════════════════════════════╣
 ║  STEP 4: Explain Predictions (~15 min)                            ║
 ║  SHAP analysis decomposes each prediction into feature drivers.   ║
 ║  Gemini reads 11 data sources and writes per-stock explanations.  ║
 ╠════════════════════════════════════════════════════════════════════╣
 ║  STEP 5: Evaluate & Monitor (~5 min)                              ║
 ║  Compare last 60 days of predictions vs actuals.                  ║
 ║  Drift monitor: PSI, rolling accuracy, calibration checks.        ║
 ║  Quality gate: ≥80% tickers predicted, ≤20% data failures.       ║
 ╚════════════════════════════════════════════════════════════════════╝

Failure Handling

Step	Fatal?	Behavior
Sentiment cron	No	Logs warning, continues with stale data
Model training	Yes	Job fails immediately
Predictions	Partial	Fails if >30% of batches fail
Freshness verification	Yes	Asserts canary tickers are <3 hours old
SHAP analysis	Partial	Fails if >50% of batches fail
AI explanations	No	Logs warning, continues
Evaluation / Drift	No	Reports saved as artifacts

Machine Learning Deep Dive

Why LightGBM?

LightGBM (gradient-boosted decision trees) was chosen over deep learning approaches like LSTMs because:

Tabular data advantage — The 42+ pre-engineered numeric features are cross-sectional (many tickers × many features), which suits tree models far better than sequential RNNs
Speed — Trains across 100 stocks in ~15 minutes on a free GitHub Actions runner
Handles missing values natively — Critical when some API sources fail on any given day
Interpretable — Built-in feature importance + fast TreeSHAP for explanations
Robust — Huber loss function is resilient to outlier returns

Hyperparameters

Parameter	Value	Rationale
Objective	Huber (delta=0.8)	Robust to outlier returns (earnings gaps, black swans)
Learning rate	0.02	Slow learning for stability
Max depth	5	Allows slightly deeper feature interactions
Num leaves	20	Conservative tree complexity
N estimators	300	More rounds with slower learning rate
Min child samples	30	Requires sufficient evidence per leaf
Regularization	L1=0.5, L2=0.5	Strong regularization prevents overfitting
Subsampling	70% rows, 70% columns	Reduces variance, forces diverse feature usage
Walk-forward folds	4	Rolling validation for robust evaluation

Market-Neutral Alpha

The model predicts alpha (excess return over SPY), not just absolute price direction:

target = stock_return − SPY_return

This means predictions capture how much a stock will outperform or underperform the S&P 500, rather than simply whether it goes up.

Trade Recommendation Filters

A prediction generates a trade_recommended = True signal only when:

Predicted alpha > 0.1%
P(return > 0) > 52%
Predicted return exceeds transaction costs (10 basis points)

77 Engineered Features

Features are organized into ten categories, all using shift(1) to ensure point-in-time safety (no future data leakage):

Price & Return Features (6)

Feature	Description
`log_return_1d`	Yesterday's log return
`log_return_5d`	5-day log return
`log_return_21d`	21-day log return
`volatility_20d`	20-day rolling volatility
`intraday_range`	(High − Low) / Close
`overnight_gap`	Today's Open vs yesterday's Close

Volume Features (3)

Feature	Description
`volume_ratio`	Volume / 20-day average volume
`volume_z60`	Volume z-score over 60 days
`volume_vol_ratio`	Volume volatility ratio

Technical Features (7)

Feature	Description
`rsi`	14-day Relative Strength Index
`rsi_divergence`	Price-RSI divergence signal
`bb_position`	Position within Bollinger Bands
`price_vs_sma20`	Price relative to 20-day SMA
`price_vs_sma50`	Price relative to 50-day SMA
`momentum_5d`	5-day price momentum
`trend_20d`	20-day linear trend slope

Market Regime Features (5)

Feature	Description
`vix_level`	Current VIX (fear index) level
`vix_vol_20d`	VIX 20-day volatility
`spy_vol_20d`	S&P 500 20-day volatility
`spy_vol_regime`	Quantile-based volatility regime
`vol_regime`	Stock's own volatility regime

Sector-Relative Features (8)

Feature	Description
`sector_id`	Numeric sector identifier
`ticker_id`	Numeric ticker identifier
`sector_etf_return_20d`	Sector ETF 20-day return
`sector_etf_return_60d`	Sector ETF 60-day return
`sector_etf_vol_20d`	Sector ETF 20-day volatility
`excess_vs_sector_5d`	Stock return minus sector return (5d)
`excess_vs_sector_20d`	Stock return minus sector return (20d)
`sector_momentum_rank`	Sector rank by recent momentum

Sentiment Features (6)

Feature	Description
`sent_mean_1d`	Yesterday's composite sentiment
`sent_mean_7d`	7-day rolling average sentiment
`sent_mean_30d`	30-day rolling average sentiment
`sent_momentum`	Sentiment regime change (7d − 30d)
`news_count_7d`	Rolling 7-day article count
`news_spike_1d`	Unusual news activity detector

Macro & Insider Features (5+)

Feature	Description
`macro_spread_2y10y`	Treasury yield curve spread (recession indicator)
`macro_fed_funds`	Federal funds rate
`insider_net_value_30d`	Net insider trading value (30-day)
`insider_buy_ratio_30d`	Insider buy/sell ratio
`insider_cluster_buying`	Multiple insiders buying simultaneously

Earnings Features (4) — v2.0

Feature	Description
`earnings_surprise`	EPS actual − EPS estimated (latest earnings)
`earnings_beat`	+1 if beat, −1 if missed, 0 if met
`earnings_recency`	1/(days since last earnings + 1) decay weight
`earnings_surprise_pct`	Surprise normalized by estimate magnitude

Fundamental Features (5) — v2.0

Feature	Description
`fund_pe_ratio`	Price-to-Earnings ratio (TTM)
`fund_pb_ratio`	Price-to-Book ratio
`fund_dividend_yield`	Indicated annual dividend yield
`fund_roe`	Return on Equity (TTM)
`fund_beta`	Stock beta vs market

Short Interest Features (3) — v2.0

Feature	Description
`si_short_float_pct`	Short interest as % of float
`si_days_to_cover`	Short interest / avg daily volume
`si_available`	1 if short interest data exists, 0 otherwise

Confidence Scoring

A separate LightGBM binary classifier predicts P(return > 0) — the probability the stock will go up:

Confidence Level	Range	Interpretation
High	> 65%	Strong signal — multiple features agree
Medium	55–65%	Moderate signal — some conflicting indicators
Low	50–55%	Near coin-flip — model is uncertain
Contrarian	< 50%	Model leans bearish

Sentiment Analysis Engine

Sentiment is collected from 10+ sources, scored with three NLP models, and blended into a single composite score per stock per day.

NLP Models

Model	Type	Strength
FinBERT	Transformer (fine-tuned BERT)	Financial domain–specific sentiment
RoBERTa	Transformer	General-purpose sentiment robustness
VADER	Rule-based lexicon	Fast, reliable baseline for headlines

Source Blend Weights

Priority	Source	Blend Weight	Rate Limited?
1	RSS News (Yahoo + Seeking Alpha)	22%	No (free RSS)
2	Marketaux	15%	Yes (95/day budget)
3	SEC Filings	10%	No (free scrape)
4	Reddit (PRAW)	10%	Yes (90/min)
5	Finnhub (insider + recommendations)	10% + 10%	Yes (55/min)
6	FMP (analyst estimates, ratings)	8%	Yes (3/sec)
7	Finviz	5%	No (free scrape)
8	Seeking Alpha Comments	5%	No (Playwright)

Resilience guarantee: Even if all rate-limited APIs are exhausted, 42% of blend weight comes from free, unlimited sources that never fail.

What Gemini Reads for Explanations (11 Data Sources)

The AI explanation prompt is stock-specific — tailored with company name, sector, and industry context:

LightGBM predictions (all 3 horizons with confidence, alpha vs SPY, price ranges)
Technical analysis (RSI, MACD, Bollinger Bands, SMAs, EMAs, volume ratio, 52-week range)
News headlines (Finviz, RSS, Reddit, Marketaux — aggregated from MongoDB)
Sentiment scores (blended + per-source breakdown)
SHAP feature drivers (human-readable names with contribution values)
Macro economic context (Fed rate, CPI, unemployment, yield curve, GDP)
Insider trading activity (buy/sell ratio, recent transactions with names/prices)
Short interest data (short float %, days to cover)
Finnhub basic financials (P/E, P/B, ROE, dividend yield, market cap, beta)
FMP earnings data (EPS actual vs estimated, earnings surprise)
FMP analyst ratings and price targets

Data Sources & APIs

Provider	Purpose	Data Stored?
Finnhub	Quotes, profiles, search, news, insider trades, WebSocket prices	Insider trades + financials in MongoDB; prices in-memory
Yahoo Finance	Historical OHLCV (via yfinance), RSS news	OHLCV in MongoDB; RSS returned to frontend only
FRED	13 macro indicators (GDP, CPI, Fed rate, Treasury yields, unemployment)	MongoDB `macro_data_raw`
FMP	Income statements, product segmentation, earnings, analyst estimates, ratings, price targets	MongoDB + Redis (Sankey data cached 14 days)
Marketaux	Financial news articles	Sentiment scores in MongoDB
Reddit	Social sentiment via PRAW (r/wallstreetbets, r/stocks, r/investing)	Sentiment scores in MongoDB
SEC / Kaleidoscope	SEC filing analysis (10-K, 10-Q, 8-K)	Filing sentiment in MongoDB
Seeking Alpha	Comment sentiment (Playwright scraping)	Sentiment scores in MongoDB
Finviz	News headlines + short interest (fallback)	Sentiment scores in MongoDB
Nasdaq	Short interest data (settlement-cycle updates)	Short interest in MongoDB
Groq (Llama 3.1)	Primary AI explanation generation	Explanations in MongoDB
Google Gemini	Fallback AI explanation generation (auto-fallback: pro → flash → flash-lite)	Explanations in MongoDB
TradingView	Chart widgets, heatmaps, economic calendar	Not stored (client-side embed)
Calendarific	US market holidays for market-status detection	Redis (1-year TTL)

Technology Stack

Layer	Technologies
Frontend	Next.js 15, React 18, TypeScript 5, Tailwind CSS, Shadcn/UI, Framer Motion, TradingView Widgets, Apache ECharts (Sankey), Recharts
Backend	Node.js 18+, Express.js, MongoDB Atlas, Redis
ML / AI	Python 3.11+, FastAPI, LightGBM, SHAP (TreeSHAP), Groq (Llama 3.1-8b), Google Gemini 2.5, yfinance, FinBERT, RoBERTa, VADER
Data	Finnhub, Yahoo Finance, FRED, FMP, Marketaux, Reddit (PRAW), Seeking Alpha (Playwright), SEC/Kaleidoscope, Nasdaq
Infrastructure	Vercel (frontend), Koyeb (backends), GitHub Actions (daily pipeline), MongoDB Atlas, Redis Cloud
Monitoring	Drift monitor (PSI), stored prediction evaluation, quality gates, Vercel Analytics

Frontend Architecture

Built with Next.js 15 App Router — fully server-side rendered, SEO-friendly, with automatic code splitting per route.

Pages

Route	Purpose
`/`	Landing page with market overview
`/stocks/[symbol]`	Stock detail — predictions, AI explanation, TradingView chart, news, technical indicators
`/predictions`	Overview of all 100 stocks with stored predictions
`/news`	Unified multi-source news feed
`/sankey`	Interactive income-statement Sankey flow visualization
`/watchlist`	User watchlist with real-time price updates
`/fundamentals`	Financial fundamentals (Jika.io embeds)
`/how-it-works`	Educational guide on ML methodology
`/methodology`	Technical breakdown of the prediction pipeline
`/disclaimer`	Legal compliance page

Key Components

Component	Purpose
`AIExplanationWidget`	Displays Gemini-generated structured explanation with outlook, key drivers, and bottom line
`EnhancedQuickPredictionWidget`	Prediction lookup showing price targets, confidence, and trade signals
`SankeyChart`	Apache ECharts Sankey diagram — revenue sources → expenses → profit paths
`MarketSentimentBanner`	Fear & Greed Index with visual gauge
`TechnicalIndicators`	RSI, MACD, SMA, EMA gauges with historical context
`NotificationWidget`	Real-time notification bell (polls every 30s)
`SearchWidget`	Debounced stock search with autocomplete
`WebSocketProvider`	React Context for real-time Finnhub price updates

Pipeline Hardening & Reliability

The pipeline is engineered to run reliably on a free GitHub Actions runner (7 GB RAM, 2 CPUs) without crashing, timing out, or getting rate-limited.

Rate Limiting

Every external API has an enforced client-side rate limiter using a sliding-window token-bucket algorithm:

API	Plan Limit	Enforced Throttle
Finnhub	60/min	55/min + 25/sec burst
FMP	250/day	3/sec, 4 endpoints only
Marketaux	100/day	95/day hard budget (top-50 tickers)
Reddit	100 QPM	90/min, max 3 subreddits/ticker
FRED	~120/min	100/min (defensive)

Retry Logic

All external API calls use retry-with-exponential-backoff-and-jitter:

Finnhub: 3 attempts, honors Retry-After on 429
FRED: 2 attempts, re-raises config errors immediately
MongoDB: 3 attempts with driver-level + application-level retry wrappers

Safety Mechanisms

Mechanism	Purpose
Batched predictions (10 × 10)	Prevents CPU thermal throttling + API rate limit pressure
10s cool-down between batches	Lets APIs reset rate windows
Canary verification	Checks 8 benchmark tickers (AAPL, AMZN, JPM…) for freshness after predictions
Quality gate	Fails pipeline if <80% tickers predicted or >20% data failures
Feature-name enforcement	Skips predictions if >50% feature columns mismatch (prevents garbage output)
NaN preservation	Sentiment/insider missing values preserved as NaN instead of zero-filled
MongoDB connection hardening	Relaxed timeouts, connection pooling (10–50), retry writes/reads enabled

Model Validation & Backtesting

Daily Out-of-Sample Evaluation

Every pipeline run compares the last 60 days of stored predictions against actual market outcomes. These predictions were generated before outcomes were known — truly out-of-sample.

Walk-Forward Validation

Training uses a rolling split with purge gaps to prevent data leakage:

|────── Train (70%) ──────|─purge─|──── Validate (15%) ────|─purge─|──── Holdout (15%) ────|
                           5 days                            5 days

Tracked Metrics

Metric	What It Measures
Sharpe Ratio	Return per unit of risk (annualized)
Max Drawdown	Largest peak-to-trough decline
Win Rate	% of profitable trades
Directional Accuracy	% of correct up/down predictions
Rank Correlation	Spearman correlation: predicted vs actual returns
Brier Score	Probability calibration quality
Drift Monitor (PSI)	Detects prediction distribution shifts over time

Full Documentation

DOCUMENTATION.md — 2,500+ lines covering architecture, data flow, every API endpoint, database schemas, field-level data mappings, MongoDB document structures, file-by-file breakdown, pipeline details, and more.

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

If you deploy this software as a network service, you must make the complete source code available to users of that service under the same license.

Built by Yogesh Vadivel

Name		Name	Last commit message	Last commit date
Latest commit History 220 Commits
.github		.github
app		app
backend		backend
components		components
hooks		hooks
lib		lib
ml_backend		ml_backend
public		public
views		views
.gitignore		.gitignore
DOCUMENTATION.md		DOCUMENTATION.md
LICENSE		LICENSE
README.md		README.md
components.json		components.json
next.config.mjs		next.config.mjs
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
tailwind.config.ts		tailwind.config.ts
tsconfig.json		tsconfig.json
vercel.json		vercel.json

Folders and files

Latest commit

History

Repository files navigation

StockPredict AI

Overview

Table of Contents

Key Features

How It Works — The Three "Models"

1. The Predictor — LightGBM

2. The Explainer — SHAP + Gemini

3. The Sentiment Scorers — FinBERT, RoBERTa, VADER

System Architecture

Daily Automated Pipeline

Failure Handling

Machine Learning Deep Dive

Why LightGBM?

Hyperparameters

Market-Neutral Alpha

Trade Recommendation Filters

77 Engineered Features

Confidence Scoring

Sentiment Analysis Engine

NLP Models

Source Blend Weights

What Gemini Reads for Explanations (11 Data Sources)

Data Sources & APIs

Technology Stack

Frontend Architecture

Pages

Key Components

Pipeline Hardening & Reliability

Rate Limiting

Retry Logic

Safety Mechanisms

Model Validation & Backtesting

Daily Out-of-Sample Evaluation

Walk-Forward Validation

Tracked Metrics

Full Documentation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages