Web Media Parser

A powerful application for parsing and downloading media files from websites with an intuitive GUI and intelligent URL prioritization.

Features

Smart Media Discovery: Automatically detects images, videos, and other media files on web pages
Intelligent URL Prioritization: Prioritizes media URLs over navigation links, ensuring complete media extraction from each page
Full-size Image Detection: Automatically discovers and downloads full-size versions of images (not just thumbnails)
Multi-threaded Processing: Concurrent parsing and downloading for optimal performance
Site Pattern Support: Extensible pattern system for site-specific optimizations
Modern GUI: Dark-themed user interface built with PySide6
Download Management: Progress tracking, pause/resume functionality, and duplicate detection
Configurable Settings: Customizable depth limits, thread counts, file filters, and more

Installation

Requirements

Python 3.8 or higher
PySide6 (Qt6 for Python)
Required Python packages (see requirements.txt)

Quick Setup

Clone this repository:

git clone https://github.com/yourusername/yourreponame.git
cd yourreponame

Install dependencies:

pip install -r requirements.txt

Run the application:

python main.py

Building Executable

To create a standalone executable:

python build_exe.py

Usage

Launch the Application: Run python main.py
Enter URL: Paste the URL of the website you want to parse
Select Download Folder: Choose where to save downloaded media
Configure Settings (optional):
- Search depth (how many levels deep to follow links)
- Number of parser and downloader threads
- File type filters
- Minimum image dimensions
Start Parsing: Click "Start" to begin media discovery and downloading

Advanced Features

Pattern Management: Add custom site patterns for better media detection
Stop Words: Filter out unwanted URLs containing specific keywords
Domain Restrictions: Limit parsing to the same domain as the starting URL
JavaScript Processing: Enable dynamic content parsing for modern websites

Architecture

The application uses a sophisticated multi-stage processing pipeline:

URL Queue Management: Intelligent prioritization ensures media URLs are processed before navigation
Webpage Parsing: HTML analysis to discover media files and follow relevant links
Media Classification: Automatic detection of full-size images vs thumbnails
Download Optimization: Concurrent downloading with progress tracking and error handling

Key Components

src/parser/priority_url_queue.py: Intelligent URL prioritization system
src/parser/webpage_parser.py: HTML parsing and media discovery
src/parser/parser_manager.py: Main coordination and workflow management
src/downloader/media_downloader.py: Multi-threaded file downloading
src/gui/main_window.py: User interface implementation

Configuration

The application supports various configuration options:

Search Depth: Control how deep to follow links (default: 3 levels)
Thread Counts: Adjust parser and downloader thread counts for performance
File Filters: Set minimum dimensions and file type restrictions
Custom Patterns: Add site-specific parsing rules

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Development Setup

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes and add tests if applicable
Commit your changes: git commit -am 'Add some feature'
Push to the branch: git push origin feature-name
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Changelog

Version 1.0.0

Initial release with core parsing and downloading functionality
Intelligent URL prioritization system
Full-size image detection
Multi-threaded processing
Modern GUI with dark theme

Troubleshooting

Common Issues

Media files are not being detected:

Check if the website uses JavaScript to load content (enable dynamic processing)
Verify the URL is accessible and contains media
Check the application logs for parsing errors

Download speeds are slow:

Increase the number of downloader threads in settings
Check your internet connection
Some sites may have rate limiting

Application crashes on startup:

Ensure all dependencies are installed: pip install -r requirements.txt
Check Python version compatibility (3.8+ required)
Try running from command line to see error messages

Support

If you encounter any issues or have questions:

Check the Issues page
Create a new issue with detailed information about the problem
Include log files and screenshots if relevant

Acknowledgments

Built with PySide6 for the GUI framework
Uses aiohttp for asynchronous web requests
BeautifulSoup for HTML parsing
Thanks to all contributors and testers

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
resources		resources
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
backup.py		backup.py
build_exe.py		build_exe.py
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Media Parser

Features

Installation

Requirements

Quick Setup

Building Executable

Usage

Advanced Features

Architecture

Key Components

Configuration

Contributing

Development Setup

License

Changelog

Version 1.0.0

Troubleshooting

Common Issues

Support

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Sucotasch/web-media-parser

Folders and files

Latest commit

History

Repository files navigation

Web Media Parser

Features

Installation

Requirements

Quick Setup

Building Executable

Usage

Advanced Features

Architecture

Key Components

Configuration

Contributing

Development Setup

License

Changelog

Version 1.0.0

Troubleshooting

Common Issues

Support

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages