TubeScript/
├── backend/
│ ├── app.py # Main application server (FastAPI)
│ ├── preload_models.py # Script to download and cache models
│ ├── requirements.txt # Python dependencies
│ ├── modules/
│ │ ├── __init__.py
│ │ ├── youtube.py # YouTube download functionality (yt-dlp)
│ │ ├── diarization.py # Speaker diarization (pyannote)
│ │ ├── transcription.py # Transcription (Whisper)
│ │ └── assembler.py # Transcript assembly and formatting
│ └── utils/
│ ├── __init__.py
│ ├── audio.py # Audio processing utilities
│ └── validators.py # Input validation functions
├── frontend/
│ ├── index.html # Main HTML page
│ ├── package.json # JS dependencies
│ ├── src/
│ │ ├── main.js # Main application entry point
│ │ ├── components/ # UI components
│ │ │ ├── App.js
│ │ │ ├── VideoInput.js
│ │ │ ├── TranscriptView.js
│ │ │ ├── SpeakerEditor.js
│ │ │ └── ExportOptions.js
│ │ ├── utils/ # Frontend utilities
│ │ │ ├── api.js # API communication
│ │ │ └── formatters.js # Transcript formatting
│ │ └── styles/ # CSS styles
│ └── public/ # Static assets
└── README.md # Project documentation
- Set up FastAPI environment with necessary dependencies
- Implement YouTube download functionality using
yt-dlp - Integrate
pyannote.audiofor speaker diarization - Implement Whisper transcription with punctuation
- Create transcript assembly logic
- Build REST API endpoints for frontend communication
- Create responsive UI for YouTube URL input
- Implement API communication with backend
- Design transcript display with speaker labels and timestamps
- Build speaker renaming functionality
- Implement export options (.txt, .srt, .vtt)
- Add styling and UX improvements
| Endpoint | Method | Description |
|---|---|---|
/api/process |
POST | Process YouTube URL, returns job ID |
/api/status/{job_id} |
GET | Get processing status |
/api/transcript/{job_id} |
GET | Get completed transcript |
/api/rename/{job_id} |
POST | Rename speakers in transcript |
/api/export/{job_id} |
GET | Export transcript in requested format |
- User submits YouTube URL via frontend
- Backend downloads audio and processes in sequence:
- Extract audio from video
- Perform speaker diarization
- Transcribe each speaker segment
- Assemble final transcript
- Frontend polls status endpoint until complete
- Transcript is displayed with speaker labels
- User can rename speakers and export in desired format
- Backend core functionality (YouTube download, diarization, transcription)
- API endpoint implementation
- Frontend basic UI and API integration
- Speaker renaming functionality
- Export options
- Testing and refinement
- Documentation and deployment instructions