Fundamental Data Source - An AI-powered Python library that extends pandas to import and analyze complex, unstructured files.
Fundas leverages the OpenRouter API and generative AI to intelligently extract features and structured data from various file types based on simple prompts. It seamlessly converts any file into a clean pandas DataFrame for immediate analysis.
- 📄
read_pdf()- Extract structured data from PDF documents - 🖼️
read_image()- Extract data and text from images - 🎵
read_audio()- Process audio files and extract information - 🌐
read_webpage()- Scrape and structure web content - 🎥
read_video()- Analyze video content from frames, audio, or both
All functions return pandas DataFrames, making the data ready for immediate analysis!
pip install fundasOr install from source:
git clone https://github.com/AMSeify/fundas.git
cd fundas
pip install -e .First, set your OpenRouter API key. You can either:
Option 1: Use environment variable
export OPENROUTER_API_KEY="your-api-key-here"Option 2: Use .env file (recommended)
# Copy the example file
cp .env.example .env
# Edit .env and add your credentials:
# OPENROUTER_API_KEY=your-api-key-here
# OPENROUTER_MODEL=openai/gpt-3.5-turbo # Optional: set default modelOption 3: Pass directly to functions
import fundas as fd
df = fd.read_pdf("document.pdf", api_key="your-api-key-here")import fundas as fd
# Extract invoice data
df = fd.read_pdf(
"invoice.pdf",
prompt="Extract invoice items with product name, quantity, and price"
)
print(df)# Extract data from a chart or screenshot
df = fd.read_image(
"sales_chart.png",
prompt="Extract the sales data points from this chart"
)
print(df)
# Process a receipt
df = fd.read_image(
"receipt.jpg",
prompt="Extract items and their prices",
columns=["item", "price", "quantity"]
)# Scrape product information
df = fd.read_webpage(
"https://example.com/products",
prompt="Extract product names, descriptions, and prices"
)
print(df)
# Extract article data
df = fd.read_webpage(
"https://news.example.com/article",
columns=["title", "author", "date", "content"]
)# Transcribe and extract meeting notes
df = fd.read_audio(
"meeting.mp3",
prompt="Extract speaker names and key discussion points"
)# Analyze video frames
df = fd.read_video(
"presentation.mp4",
prompt="Extract slide titles and key points from this presentation",
from_="pics" # Extract from video frames
)
# Process audio track
df = fd.read_video(
"lecture.mp4",
prompt="Transcribe the lecture and identify key topics",
from_="audios" # Extract from audio track
)
# Analyze both video and audio
df = fd.read_video(
"interview.mp4",
prompt="Extract interview questions and answers",
from_="both" # or from_=["pics", "audios"]
)You can specify which columns you want to extract:
df = fd.read_pdf(
"report.pdf",
prompt="Extract quarterly financial data",
columns=["quarter", "revenue", "expenses", "profit"]
)Use different AI models via OpenRouter:
# Option 1: Pass model parameter to each function
df = fd.read_image(
"complex_diagram.png",
prompt="Extract relationships between components",
model="anthropic/claude-3-opus"
)
# Option 2: Set default model in .env file
# OPENROUTER_MODEL=anthropic/claude-3-sonnet
df = fd.read_image("diagram.png", prompt="Extract data") # Uses model from .env
# Option 3: Set via environment variable
import os
os.environ["OPENROUTER_MODEL"] = "openai/gpt-4"
df = fd.read_pdf("document.pdf", prompt="Extract info")Since all functions return pandas DataFrames, you can immediately use pandas operations:
import fundas as fd
# Read and analyze in one workflow
df = fd.read_pdf("sales.pdf", prompt="Extract sales data")
print(df.head())
print(df.describe())
print(df.groupby('region')['sales'].sum())- Python >= 3.8
- pandas >= 1.3.0
- requests >= 2.25.0
- PyPDF2 >= 3.0.0
- Pillow >= 10.3.0
- beautifulsoup4 >= 4.9.0
- opencv-python >= 4.8.1.78
Fundas includes an intelligent caching system to reduce redundant API calls:
import fundas as fd
# Enable caching (enabled by default)
df = fd.read_pdf("document.pdf", prompt="Extract data")
# The same file with the same prompt will use cached results
df2 = fd.read_pdf("document.pdf", prompt="Extract data") # No API call
# Disable caching if needed
from fundas import OpenRouterClient
client = OpenRouterClient(api_key="key", use_cache=False)Export your DataFrames with AI-powered summarization:
import fundas as fd
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({"product": ["A", "B", "C"], "sales": [100, 200, 150]})
# Export to CSV
fd.to_summarized_csv(df, "output.csv")
# Export to Excel with AI summary
fd.to_summarized_excel(
df,
"summary.xlsx",
prompt="Add a summary row with totals"
)
# Generate AI summary
summary = fd.summarize_dataframe(df, prompt="Summarize sales performance")
print(summary)Fundas includes robust error handling with automatic retries:
import fundas as fd
try:
df = fd.read_pdf("document.pdf", prompt="Extract data")
except FileNotFoundError:
print("File not found")
except ValueError as e:
print(f"Invalid parameters: {e}")
except RuntimeError as e:
print(f"API error: {e}")Control the cache behavior:
from fundas import get_cache
cache = get_cache()
# Clear all cache entries
cache.clear()
# Clear only expired entries
cache.clear_expired()
# Disable/enable cache
cache.disable()
cache.enable()All read functions share similar parameters:
Common Parameters:
filepathorurl(str | Path): Source file or URLprompt(str): Description of what data to extractcolumns(List[str], optional): Column names to extractapi_key(str, optional): OpenRouter API keymodel(str, optional): AI model to use (default: gpt-3.5-turbo)
Returns: pandas DataFrame
All export functions accept:
Parameters:
df(pd.DataFrame): DataFrame to exportfilepath(str | Path): Output file pathprompt(str, optional): AI transformation promptapi_key(str, optional): OpenRouter API keymodel(str, optional): AI model to use
Extract structured data from PDF files.
Parameters:
filepath(str | Path): Path to the PDF fileprompt(str): Description of what data to extractcolumns(List[str], optional): Column names to extractapi_key(str, optional): OpenRouter API keymodel(str, optional): AI model to use
Returns: pandas DataFrame
Extract structured data from image files.
Parameters:
filepath(str | Path): Path to the image fileprompt(str): Description of what data to extractcolumns(List[str], optional): Column names to extractapi_key(str, optional): OpenRouter API keymodel(str, optional): AI model to use
Returns: pandas DataFrame
Extract structured data from audio files.
Parameters:
filepath(str | Path): Path to the audio fileprompt(str): Description of what data to extractcolumns(List[str], optional): Column names to extractapi_key(str, optional): OpenRouter API keymodel(str, optional): AI model to use
Returns: pandas DataFrame
Extract structured data from web pages.
Parameters:
url(str): URL of the webpageprompt(str): Description of what data to extractcolumns(List[str], optional): Column names to extractapi_key(str, optional): OpenRouter API keymodel(str, optional): AI model to use
Returns: pandas DataFrame
Extract structured data from video files.
Parameters:
filepath(str | Path): Path to the video fileprompt(str): Description of what data to extractfrom_(str | List[str]): Source to extract from - 'pics', 'audios', or 'both'columns(List[str], optional): Column names to extractapi_key(str, optional): OpenRouter API keymodel(str, optional): AI model to usesample_rate(int): Frame sampling rate (default: 30)
Returns: pandas DataFrame
Export DataFrame to CSV with optional AI-powered summarization.
Parameters:
df(pd.DataFrame): DataFrame to exportfilepath(str | Path): Path to save the CSV fileprompt(str, optional): Prompt to transform/summarize dataapi_key(str, optional): OpenRouter API keymodel(str, optional): AI model to use**kwargs: Additional arguments forpd.DataFrame.to_csv()
to_summarized_excel(df, filepath, prompt=None, sheet_name="Sheet1", api_key=None, model=None, **kwargs)
Export DataFrame to Excel with optional AI-powered summarization.
Export DataFrame to JSON with optional AI-powered summarization.
Generate an AI-powered summary of a DataFrame.
Returns: str (AI-generated summary)
OPENROUTER_API_KEY: Your OpenRouter API key
The cache is stored in ~/.fundas/cache/ by default. You can configure:
- Cache directory location
- Time-to-live (TTL) for cache entries
- Enable/disable caching
- Use caching: Keep caching enabled (default) to avoid redundant API calls
- Specify columns: When you know what columns you need, specify them explicitly
- Choose the right model: Balance speed, cost, and accuracy by selecting appropriate models
- Batch operations: Process multiple files in sequence to leverage cache warming
MIT License - see LICENSE file for details
Contributions are welcome! We appreciate bug fixes, new features, documentation improvements, and more.
Please see our Contributing Guide for details on:
- Setting up your development environment
- Coding standards and style guide
- Testing requirements
- Pull request process
Quick start:
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Submit a pull request
For issues and questions, please open an issue on GitHub.