Image Captioning with Phi 3.5 Vision Instruct Model

This repository provides a comprehensive implementation for generating image captions using the microsoft/Phi-3.5-vision-instruct model. The Phi-3.5 model is designed to generate descriptive text based on the content of images, using cutting-edge deep learning techniques. Whether you're building a computer vision application, annotating datasets, or experimenting with AI, this repository will help you easily integrate image-to-text capabilities.

Introduction

The microsoft/Phi-3.5-vision-instruct model is a highly versatile machine learning model designed for image captioning tasks. This repository demonstrates how to use the model to generate meaningful and contextually relevant text captions based on image inputs.

This model leverages advanced instruction-based techniques, making it suitable for vision tasks that require comprehensive understanding of visual inputs. By integrating it into your workflows, you can convert images into descriptive, human-like sentences.

About Phi 3.5 Vision Instruct

Phi 3.5 Vision Instruct is part of the larger "Phi" series developed by Microsoft for multimodal AI tasks. This model is particularly strong in image-to-text tasks, where its deep learning framework can analyze intricate details of an image to produce accurate, concise, and contextually appropriate descriptions.

It excels in tasks such as:

Image Captioning
Image Annotation for Datasets
Visual Question Answering (VQA)
General vision-related NLP tasks

The model has been trained on a large and diverse dataset, making it highly robust across various domains like photography, art, landscapes, and even object-based images.

Installation

To set up the repository on your machine and start generating captions with the Phi 3.5 Vision Instruct model, follow these simple steps:

Clone the Repository:

First, clone the repository to your local machine by running:
```
git clone https://github.com/adigayung/captioning-phi-3.5.git
```
Navigate to the Directory:

Change your working directory to the cloned folder:
```
cd captioning-phi-3.5
```
Grant Execution Permission for the Installation Script:

To ensure the installation script has the correct permissions, run:
```
chmod +x install.sh
```
Run the Installation Script:

Finally, execute the script to install all dependencies and set up the model:
```
./install.sh
```
The installation script will handle the following tasks:
- Install Python 3.11 if needed
- Set up the transformers library and required dependencies
- Install additional libraries such as torch, Pillow, and requests
- Download the Phi 3.5 Vision Instruct model and tokenizer

Usage

Once the installation is complete, you can use the model to generate captions for images. Here's how you can use it:

1. Generate Captions for All Images in a Directory

To generate descriptions for all images in a directory, run:

python run.py --image-dir /path/to/image-directory

Replace /path/to/image-directory with the path of the folder containing the images. The descriptions for each image will be saved in a .txt file with the same name as the image in the same folder.

2. Generate Captions from an Image URL

You can provide a URL of an image and get its description by running:

python run.py --url-image https://example.com/image.jpg

The script will fetch the image from the URL, generate its description, and print the result.

3. Generate Captions for a Specific Image File

To generate a description for a specific image file, run:

python run.py --image-file /path/to/image.jpg

The description will be saved as a .txt file in the same directory as the image.

Model Details

The microsoft/Phi-3.5-vision-instruct model is pre-trained and fine-tuned for generating natural language captions from images. Some of its key features include:

Multimodal learning: The model is trained to handle both visual and textual data, combining image understanding with natural language generation.
Instruction-based architecture: By following given instructions, it can generalize across a wide variety of tasks such as generating captions, answering visual questions, and classifying images.
Pre-training and Fine-tuning: The model has been pre-trained on vast image datasets and fine-tuned specifically for image captioning tasks. This ensures that it can capture even subtle details in images, providing highly accurate captions.

Performance

The Phi 3.5 Vision Instruct model is optimized for real-time captioning, making it suitable for applications like:

Automatic image annotation
Content creation for media
Dataset labeling for machine learning

Its underlying architecture uses deep convolutional neural networks (CNNs) combined with attention-based mechanisms to understand context from the image and generate text that reflects the content accurately.

Requirements

Make sure your system meets the following requirements:

Python 3.11 or higher
8 GB RAM (recommended for handling image processing tasks)
An NVIDIA GPU with CUDA support (optional, but recommended for faster processing)
Libraries: transformers, torch, Pillow, requests

If you are using a CPU-only system, the performance may be slower, but the model will still work.

Acknowledgements

This repository uses the Phi 3.5 Vision Instruct model developed by Microsoft. Special thanks to the open-source community and the Hugging Face platform for providing access to powerful AI models.

For more details about the Phi model series, visit Hugging Face's Phi Model page.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh
kill.sh		kill.sh
model-load.py		model-load.py
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Image Captioning with Phi 3.5 Vision Instruct Model

Table of Contents

Introduction

About Phi 3.5 Vision Instruct

Installation

Usage

1. Generate Captions for All Images in a Directory

2. Generate Captions from an Image URL

3. Generate Captions for a Specific Image File

Model Details

Performance

Requirements

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

adigayung/captioning-phi-3.5

Folders and files

Latest commit

History

Repository files navigation

Image Captioning with Phi 3.5 Vision Instruct Model

Table of Contents

Introduction

About Phi 3.5 Vision Instruct

Installation

Usage

1. Generate Captions for All Images in a Directory

2. Generate Captions from an Image URL

3. Generate Captions for a Specific Image File

Model Details

Performance

Requirements

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages