This directory contains organized scripts for CADS data processing, migration, and maintenance tasks. Scripts are categorized by function for easy discovery and maintenance.
scripts/
βββ README.md # This documentation
βββ migration/ # Database migration scripts
β βββ execute_cads_migration.py # β
MAIN MIGRATION SCRIPT
β βββ legacy/ # π Archived migration attempts
β βββ execute_cads_migration_direct.py
β βββ execute_cads_migration_alternative.py
β βββ execute_cads_migration_fixed.py
β βββ execute_cads_migration_ipv6.py
β βββ execute_cads_migration_port_6543.py
β βββ execute_cads_migration_retry.py
β βββ execute_cads_migration.py
βββ processing/ # Data processing scripts
β βββ process_cads_with_openalex_ids.py # β
RECOMMENDED
β βββ migrate_cads_data_to_cads_tables.py # β
ESSENTIAL
βββ utilities/ # Utility and verification scripts
βββ check_cads_data_location.py
βββ check_existing_cads_data.py
βββ test_cads_parsing.py
βββ [other utility scripts]
# Step 1: Create CADS tables and basic structure
python3 scripts/migration/execute_cads_migration.py
# Step 2: Process all CADS professors with OpenAlex IDs
python3 scripts/processing/process_cads_with_openalex_ids.py
# Step 3: Migrate data to CADS-specific tables
python3 scripts/processing/migrate_cads_data_to_cads_tables.py
# Step 4: Verify data location and completeness
python3 scripts/utilities/check_cads_data_location.pyexecute_cads_migration.py- WORKING VERSION- Uses IPv4 pooler connection (recommended)
- Creates CADS tables and migrates data
- Handles SQL syntax issues properly
- Status: β Tested and working
Located in migration/legacy/ - These are archived versions with various connection approaches:
execute_cads_migration_direct.py- Direct DATABASE_URL (has IPv6 issues)execute_cads_migration_alternative.py- Multiple connection methodsexecute_cads_migration_fixed.py- DNS resolution fix attemptexecute_cads_migration_ipv6.py- IPv6 specific approachexecute_cads_migration_port_6543.py- Port 6543 testingexecute_cads_migration_retry.py- Retry logic implementationexecute_cads_migration.py- Original migration script
process_cads_with_openalex_ids.py- RECOMMENDED- Processes all 42 CADS professors using known OpenAlex IDs
- Most reliable approach for data collection
- Handles all professors with confirmed OpenAlex profiles
- Status: β Tested and working
migrate_cads_data_to_cads_tables.py- ESSENTIAL- Migrates data from main tables to CADS-specific tables
- Fixes data location issues
- Required after running main processing scripts
- Status: β Tested and working
check_cads_data_location.py- Verify where CADS data is storedcheck_existing_cads_data.py- Analyze existing CADS datatest_cads_parsing.py- Test CADS data parsing functionality
File: scripts/migration/execute_cads_migration.py
Purpose: Creates CADS database tables and initial structure
Features:
- IPv4 pooler connection (resolves DNS issues)
- Complete CADS schema creation
- Error handling and logging
- Verification of table creation
Usage:
python3 scripts/migration/execute_cads_migration.pyExpected Output:
- Creates
cads_researchers,cads_works,cads_topicstables - Sets up indexes and relationships
- Enables vector extension for embeddings
File: scripts/processing/process_cads_with_openalex_ids.py
Purpose: Fetches and processes research data for all CADS faculty
Features:
- Uses known OpenAlex IDs for reliable data retrieval
- Processes ~42 CADS professors
- Generates semantic embeddings
- Handles API rate limiting
Usage:
python3 scripts/processing/process_cads_with_openalex_ids.pyExpected Output:
- ~32 researchers in database
- ~2,454 research works
- ~6,834 research topics
- Complete embeddings for all works
File: scripts/processing/migrate_cads_data_to_cads_tables.py
Purpose: Moves data from main tables to CADS-specific tables
Features:
- Transfers data between table structures
- Maintains relationships and integrity
- Handles duplicate prevention
- Provides migration summary
Usage:
python3 scripts/processing/migrate_cads_data_to_cads_tables.pyExpected Output:
- Data moved to
cads_*tables - Verification of successful migration
- Summary of migrated records
All scripts require these environment variables:
# IPv4 Pooler Connection (Required)
user=postgres.zsezliiffdcgqekwggjq
password=cadstxst2025
host=aws-0-us-east-2.pooler.supabase.com
port=5432
dbname=postgres
# OpenAlex API (Required)
OPENALEX_EMAIL=test@texasstate.edu
# Optional: Groq API for theme generation
GROQ_API_KEY=your-groq-api-keyScripts require these Python packages:
psycopg2-binary- PostgreSQL connectionrequests- HTTP requests for APIspandas- Data manipulationpython-dotenv- Environment variable loading
| Metric | Expected Value | Description |
|---|---|---|
| CADS Researchers | ~32 | Faculty from CS Department |
| Research Works | ~2,454 | Academic papers and publications |
| Research Topics | ~6,834 | Topic classifications |
| Embeddings | 100% | All works have semantic vectors |
| Citations | Complete | Citation data for all works |
# Solution: Use IPv4 pooler scripts only
python3 scripts/migration/execute_cads_migration.py# Solution: Run migration script
python3 scripts/processing/migrate_cads_data_to_cads_tables.py- Check API rate limits (10 requests/second)
- Verify network connectivity
- Check OPENALEX_EMAIL configuration
- Use the main migration script (handles syntax properly)
- Avoid legacy scripts unless debugging
# Check data location
python3 scripts/utilities/check_cads_data_location.py
# Verify existing data
python3 scripts/utilities/check_existing_cads_data.py
# Test parsing functionality
python3 scripts/utilities/test_cads_parsing.pyAll scripts generate detailed logs:
- Execution logs: Saved to console and log files
- Error tracking: Full stack traces for debugging
- Progress monitoring: Real-time status updates
- Performance metrics: Timing and success rates
- Data Updates: Re-run processing scripts monthly
- Schema Updates: Apply new migrations as needed
- Performance Monitoring: Check script execution times
- Error Monitoring: Review logs for issues
When updating scripts:
- Test in development environment first
- Backup database before major changes
- Update documentation
- Archive old versions to legacy folder
| Script | Status | Purpose | Recommended |
|---|---|---|---|
migration/execute_cads_migration.py |
β Working | Database setup | Yes |
processing/process_cads_with_openalex_ids.py |
β Working | Data collection | Yes |
processing/migrate_cads_data_to_cads_tables.py |
β Working | Data organization | Yes |
utilities/check_cads_data_location.py |
β Working | Verification | Yes |
migration/legacy/* |
Alternative approaches | No |
- Scripts prepare data for pipeline processing
- Pipeline reads from tables created by migration scripts
- Processing scripts generate data consumed by pipeline
- Scripts create data structure for visualization
- Migration ensures proper table relationships
- Processing provides complete dataset for display
π― Scripts organized and ready for reliable CADS database management!