Skip to content

Latest commit

Β 

History

History
291 lines (218 loc) Β· 8.75 KB

File metadata and controls

291 lines (218 loc) Β· 8.75 KB

CADS Scripts Directory

πŸ”§ Overview

This directory contains organized scripts for CADS data processing, migration, and maintenance tasks. Scripts are categorized by function for easy discovery and maintenance.

πŸ“ Directory Structure

scripts/
β”œβ”€β”€ README.md                    # This documentation
β”œβ”€β”€ migration/                   # Database migration scripts
β”‚   β”œβ”€β”€ execute_cads_migration.py    # βœ… MAIN MIGRATION SCRIPT
β”‚   └── legacy/                      # πŸ“ Archived migration attempts
β”‚       β”œβ”€β”€ execute_cads_migration_direct.py
β”‚       β”œβ”€β”€ execute_cads_migration_alternative.py
β”‚       β”œβ”€β”€ execute_cads_migration_fixed.py
β”‚       β”œβ”€β”€ execute_cads_migration_ipv6.py
β”‚       β”œβ”€β”€ execute_cads_migration_port_6543.py
β”‚       β”œβ”€β”€ execute_cads_migration_retry.py
β”‚       └── execute_cads_migration.py
β”œβ”€β”€ processing/                  # Data processing scripts
β”‚   β”œβ”€β”€ process_cads_with_openalex_ids.py    # βœ… RECOMMENDED
β”‚   └── migrate_cads_data_to_cads_tables.py  # βœ… ESSENTIAL
└── utilities/                   # Utility and verification scripts
    β”œβ”€β”€ check_cads_data_location.py
    β”œβ”€β”€ check_existing_cads_data.py
    β”œβ”€β”€ test_cads_parsing.py
    └── [other utility scripts]

πŸš€ Quick Start Workflow

For Fresh CADS Database Setup:

# Step 1: Create CADS tables and basic structure
python3 scripts/migration/execute_cads_migration.py

# Step 2: Process all CADS professors with OpenAlex IDs  
python3 scripts/processing/process_cads_with_openalex_ids.py

# Step 3: Migrate data to CADS-specific tables
python3 scripts/processing/migrate_cads_data_to_cads_tables.py

# Step 4: Verify data location and completeness
python3 scripts/utilities/check_cads_data_location.py

πŸ“‚ Script Categories

πŸš€ Migration Scripts (migration/)

Main Migration Script βœ…

  • execute_cads_migration.py - WORKING VERSION
    • Uses IPv4 pooler connection (recommended)
    • Creates CADS tables and migrates data
    • Handles SQL syntax issues properly
    • Status: βœ… Tested and working

Legacy Scripts πŸ“

Located in migration/legacy/ - These are archived versions with various connection approaches:

  • execute_cads_migration_direct.py - Direct DATABASE_URL (has IPv6 issues)
  • execute_cads_migration_alternative.py - Multiple connection methods
  • execute_cads_migration_fixed.py - DNS resolution fix attempt
  • execute_cads_migration_ipv6.py - IPv6 specific approach
  • execute_cads_migration_port_6543.py - Port 6543 testing
  • execute_cads_migration_retry.py - Retry logic implementation
  • execute_cads_migration.py - Original migration script

πŸ“Š Processing Scripts (processing/)

OpenAlex Integration βœ…

  • process_cads_with_openalex_ids.py - RECOMMENDED
    • Processes all 42 CADS professors using known OpenAlex IDs
    • Most reliable approach for data collection
    • Handles all professors with confirmed OpenAlex profiles
    • Status: βœ… Tested and working

Data Migration βœ…

  • migrate_cads_data_to_cads_tables.py - ESSENTIAL
    • Migrates data from main tables to CADS-specific tables
    • Fixes data location issues
    • Required after running main processing scripts
    • Status: βœ… Tested and working

πŸ” Utility Scripts (utilities/)

Data Verification

  • check_cads_data_location.py - Verify where CADS data is stored
  • check_existing_cads_data.py - Analyze existing CADS data
  • test_cads_parsing.py - Test CADS data parsing functionality

🎯 Script Details

Migration Script

File: scripts/migration/execute_cads_migration.py

Purpose: Creates CADS database tables and initial structure

Features:

  • IPv4 pooler connection (resolves DNS issues)
  • Complete CADS schema creation
  • Error handling and logging
  • Verification of table creation

Usage:

python3 scripts/migration/execute_cads_migration.py

Expected Output:

  • Creates cads_researchers, cads_works, cads_topics tables
  • Sets up indexes and relationships
  • Enables vector extension for embeddings

Processing Script

File: scripts/processing/process_cads_with_openalex_ids.py

Purpose: Fetches and processes research data for all CADS faculty

Features:

  • Uses known OpenAlex IDs for reliable data retrieval
  • Processes ~42 CADS professors
  • Generates semantic embeddings
  • Handles API rate limiting

Usage:

python3 scripts/processing/process_cads_with_openalex_ids.py

Expected Output:

  • ~32 researchers in database
  • ~2,454 research works
  • ~6,834 research topics
  • Complete embeddings for all works

Data Migration Script

File: scripts/processing/migrate_cads_data_to_cads_tables.py

Purpose: Moves data from main tables to CADS-specific tables

Features:

  • Transfers data between table structures
  • Maintains relationships and integrity
  • Handles duplicate prevention
  • Provides migration summary

Usage:

python3 scripts/processing/migrate_cads_data_to_cads_tables.py

Expected Output:

  • Data moved to cads_* tables
  • Verification of successful migration
  • Summary of migrated records

βš™οΈ Configuration

Environment Requirements

All scripts require these environment variables:

# IPv4 Pooler Connection (Required)
user=postgres.zsezliiffdcgqekwggjq
password=cadstxst2025
host=aws-0-us-east-2.pooler.supabase.com
port=5432
dbname=postgres

# OpenAlex API (Required)
OPENALEX_EMAIL=test@texasstate.edu

# Optional: Groq API for theme generation
GROQ_API_KEY=your-groq-api-key

Dependencies

Scripts require these Python packages:

  • psycopg2-binary - PostgreSQL connection
  • requests - HTTP requests for APIs
  • pandas - Data manipulation
  • python-dotenv - Environment variable loading

πŸ“ˆ Success Metrics

Expected Results After Full Workflow:

Metric Expected Value Description
CADS Researchers ~32 Faculty from CS Department
Research Works ~2,454 Academic papers and publications
Research Topics ~6,834 Topic classifications
Embeddings 100% All works have semantic vectors
Citations Complete Citation data for all works

🚨 Troubleshooting

Common Issues

1. IPv6 Connection Errors

# Solution: Use IPv4 pooler scripts only
python3 scripts/migration/execute_cads_migration.py

2. Data in Wrong Tables

# Solution: Run migration script
python3 scripts/processing/migrate_cads_data_to_cads_tables.py

3. Missing OpenAlex Data

  • Check API rate limits (10 requests/second)
  • Verify network connectivity
  • Check OPENALEX_EMAIL configuration

4. SQL Syntax Errors

  • Use the main migration script (handles syntax properly)
  • Avoid legacy scripts unless debugging

Debug Commands

# Check data location
python3 scripts/utilities/check_cads_data_location.py

# Verify existing data
python3 scripts/utilities/check_existing_cads_data.py

# Test parsing functionality
python3 scripts/utilities/test_cads_parsing.py

πŸ“ Logging

All scripts generate detailed logs:

  • Execution logs: Saved to console and log files
  • Error tracking: Full stack traces for debugging
  • Progress monitoring: Real-time status updates
  • Performance metrics: Timing and success rates

πŸ”„ Maintenance

Regular Tasks

  1. Data Updates: Re-run processing scripts monthly
  2. Schema Updates: Apply new migrations as needed
  3. Performance Monitoring: Check script execution times
  4. Error Monitoring: Review logs for issues

Script Updates

When updating scripts:

  1. Test in development environment first
  2. Backup database before major changes
  3. Update documentation
  4. Archive old versions to legacy folder

🎯 Script Status Summary

Script Status Purpose Recommended
migration/execute_cads_migration.py βœ… Working Database setup Yes
processing/process_cads_with_openalex_ids.py βœ… Working Data collection Yes
processing/migrate_cads_data_to_cads_tables.py βœ… Working Data organization Yes
utilities/check_cads_data_location.py βœ… Working Verification Yes
migration/legacy/* ⚠️ Archived Alternative approaches No

πŸ”— Integration

With CADS Pipeline

  • Scripts prepare data for pipeline processing
  • Pipeline reads from tables created by migration scripts
  • Processing scripts generate data consumed by pipeline

With Visualization

  • Scripts create data structure for visualization
  • Migration ensures proper table relationships
  • Processing provides complete dataset for display

🎯 Scripts organized and ready for reliable CADS database management!