Instagram Crawler

Overview

This Instagram Crawler is designed to systematically scrape posts from famous Instagram accounts, focusing on capturing both textual and visual content. It starts with an initial set of profile URLs, collects their posts along with metadata, and downloads images for further analysis. The crawler is particularly effective at gathering post descriptions and content types from popular accounts, making it an ideal tool for building image and video datasets, training vision models, and generating multimedia content.

By structuring the collected data in an organized way, the crawler ensures that it can be seamlessly integrated into machine learning pipelines, AI-based captioning systems, and multimodal applications. Furthermore, the crawler continuously expands its dataset by discovering popular accounts with the most followers and scraping their connections to ensure a growing and diverse repository of Instagram data.

Account Authentication

Open cookies/.account_env file
Change the email(or username) and password variables' value to the ones corresponding to your Instagram account.

email="[email protected]"
password="your_password"

Docker Database Configuration

If running the crawler in Docker, ensure that DB_HOST is set to postgres in the db/.database_env file.

Features

Scrapes Instagram posts including:
- Post descriptions
- Content type (Image or Video)
- Follower count of the account owner
- Download path of images
Dynamically discovers new accounts based on high-follower connections
Stores all data in a PostgreSQL database
Can be easily exported to CSV or other formats
Optimized for computer vision dataset collection
Currently only downloads images (Videos are identified but not downloaded)

Data Structure

The crawler collects and stores the following data fields:

`posts` Table

Field Name	Description
`post_url`	URL of the Instagram post
`unique_id`	Unique identifier for the post
`content_type`	`TRUE` for images, `FALSE` for videos
`download_path`	Local path of the downloaded image
`description`	Caption/description of the post

`accounts` Table

Field Name	Description
`id`	Unique identifier for the account
`account_name`	The Instagram username of the account
`account_url`	Profile URL of the account
`follower_number`	Number of followers the user has
`following_scraped`	Whether the following list has been scraped (`TRUE` or `FALSE`)
`posts_scraped`	Whether posts have been scraped (`TRUE` or `FALSE`)

How It Works

Start with Initial URLs: The crawler begins with a predefined set of Instagram accounts.
Scrape Account Data: Collects metadata such as follower count and account details.
Scrape Posts: Gathers post-related metadata, including descriptions and content type.
Download Images: Saves images to a local directory.
Expand Network: Identifies the most followed user and scrapes the accounts they follow.
Repeat Process: The cycle continues, ensuring data growth.

Installation

Prerequisites

Ensure you have the following installed:

Python 3.8+
PostgreSQL
Docker (if using the Docker method)
Required dependencies from requirements.txt

Manual Setup (Using Virtual Environment)

# Clone the repository
git clone https://github.com/hikmatazimzade/instagram-crawler.git
cd instagram-crawler

# Create a virtual environment
python -m venv venv

# Activate the virtual environment (Windows)
venv\Scripts\activate

# Activate the virtual environment (Mac/Linux)
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Docker Setup

Here is how you can run the crawler using Docker.

# Clone the repository
git clone https://github.com/yourusername/instagram-crawler.git
cd instagram-crawler

# Build the Docker container
docker build -t instagram-crawler .

# Run the crawler container
docker run -d instagram-crawler

Run the crawler manually with:

# On Windows
py -m crawler.main

# On Mac/Linux
python3 -m crawler.main

Or if using Docker:

docker start instagram-crawler

Potential Use Cases

Training computer vision models
Social media trend analysis
Large-scale dataset collection
NLP applications on Instagram captions

Notes

This project is for educational and research purposes only.
Scraping Instagram without permission may violate their Terms of Service.
Ensure compliance with local laws and regulations when using this tool.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
config		config
cookies		cookies
crawler		crawler
db		db
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Instagram Crawler

Overview

Account Authentication

Docker Database Configuration

Features

Data Structure

`posts` Table

`accounts` Table

How It Works

Installation

Prerequisites

Manual Setup (Using Virtual Environment)

Docker Setup

Potential Use Cases

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

hikmatazimzade/instagram-crawler

Folders and files

Latest commit

History

Repository files navigation

Instagram Crawler

Overview

Account Authentication

Docker Database Configuration

Features

Data Structure

posts Table

accounts Table

How It Works

Installation

Prerequisites

Manual Setup (Using Virtual Environment)

Docker Setup

Potential Use Cases

Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`posts` Table

`accounts` Table

Packages