LLM-vision-Captioning is a Python-based tool designed to automatically generate natural language captions for images using large language models (LLMs) with vision capabilities. This tool is ideal for researchers, dataset curators, content creators, and developers looking to automate the annotation of image datasets.
- ✅ Automatic image captioning using LLMs with vision support
- 📁 Batch processing of images from a directory
- ⏱ Progress tracking with estimated remaining time
- 🖼️ Supports common image formats (
.jpg,.png) - 🌈 Terminal color output for easier monitoring (with
colorama) - 🧩 Modular codebase for easy extension and customization
LLM-vision-Captioning/
├── libs/ # Core model execution logic
├── ui/ # User interface logic (if any)
├── captioning/ # Folder for storing image data
├── main.py # Entry point script
├── requirements.txt # Python dependencies
└── README.md # Project documentation
To install and run this tool, follow these steps:
# 1. Clone the repository
git clone https://github.com/adigayung/LLM-vision-Captioning
# 2. Navigate into the project directory
cd LLM-vision-Captioning
# 3. Install all required Python packages
pip install -r requirements.txt✅ Python 3.8 or higher is required.
Once dependencies are installed, run the tool using:
python main.pyBy default, it will:
- Load the selected model
- Load prompts for guiding the caption generation
- Iterate through all
.jpgand.pngimages in the selected folder - Generate captions
- Display progress in the terminal, including:
- Iteration status
[3 / 10] - Time taken per image
- Total elapsed time
- Estimated remaining time
- Iteration status
Example output:
[3 / 10] Processing: /images/cat.jpg
✅ Success
⏱ Time taken : 2.4 seconds
⏳ Elapsed time : 6.5 seconds
🕒 Estimated remaining: 17.1 seconds
--------------------------------------------------
The core logic resides in libs/ExecuteModel.py and libs/LoadModel.py, where images are passed to a vision-language model along with a text prompt. The model then returns a natural language description of the visual content. You can modify or extend the prompt logic as needed.
Key Python libraries used:
transformersPillowcoloramagradio(optional UI integration)torchtqdm
All dependencies are listed in requirements.txt.
This project is designed to work with checkpoint-based vision-language models. Make sure the model path and configurations are correctly set inside the source code if you're loading a specific architecture (e.g., BLIP, MiniGPT, LLaVA, etc.).
This project is released under the MIT License. See LICENSE for details.
Contributions are welcome! Feel free to:
- Submit issues
- Open pull requests
- Suggest new features or enhancements
Created by @adigayung
If you use this tool for your work or research, a star ⭐️ on the repo would be appreciated!