Skip to content

This project demonstrates a complete pipeline for extracting textual data from various sources and performing Natural Language Processing (NLP) analysis to derive meaningful insights.

Notifications You must be signed in to change notification settings

jsonusuman351/Data_Extraction-NLP_Analysis_Project

Repository files navigation

📊 Data Extraction & NLP Analysis

Python Pandas NLTK BeautifulSoup

Hey there! Welcome to my project on Data Extraction and NLP Analysis. I built this to automate the process of analyzing financial articles. The goal was to take a list of URLs, scrape the text from each one, and then perform a detailed textual analysis to calculate various scores like sentiment, readability, and other linguistic metrics.

This repository contains the complete pipeline I built, from fetching the data to generating the final structured output in an Excel file.


🤔 The Challenge

The main challenge was to create a robust system that could:

  1. Reliably extract only the main article text from different web pages, ignoring headers, footers, and ads.
  2. Clean the extracted text by removing stop words specific to financial and general contexts.
  3. Perform complex NLP calculations to derive meaningful variables like polarity, subjectivity, and readability.
  4. Handle potential errors (like dead links or server issues) without crashing.
  5. Present the final analysis in a clean, structured Excel format as required.

✨ Core Features & My Solution Approach

Here’s a breakdown of how I tackled the problem:

  1. Resource Loading:

    • Before starting, I pre-loaded all the necessary resources. This includes a custom set of stop words from the StopWords directory and the positive/negative word lists from the MasterDictionary.
    • To improve accuracy, I made sure to clean the master dictionaries by removing any words that were also present in the stop words list.
  2. Data Extraction:

    • I used the Pandas library to read the Input.xlsx file.
    • Then, I looped through each URL, using the Requests library to fetch the HTML content. I used BeautifulSoup to parse the HTML and intelligently extract just the article title (from <h1> tags) and the main body text (from <article> or relevant <div> tags).
    • Each successfully scraped article is saved as a .txt file in the Scraped_Articles/ directory.
  3. NLP Analysis Pipeline:

    • For each article's text, I built a function that calculates all the required variables. I used the NLTK library for tokenization (breaking text into words and sentences) and Textstat for syllable counting.
    • The analysis includes:
      • Sentimental Analysis: Calculating Positive, Negative, Polarity, and Subjectivity Scores.
      • Readability Analysis: Calculating the Gunning Fog Index.
      • Other Linguistic Metrics: Word Count, Complex Word Count, Average Sentence Length, Personal Pronouns, and more.
  4. Structured Output Generation:

    • Finally, I collected all the calculated scores for each article and merged them back with the original input data.
    • The final, comprehensive result is saved to output.xlsx, matching the required data structure perfectly.

🛠️ Tech Stack

  • Core Language: Python
  • Data Manipulation: Pandas
  • Web Scraping: Requests, BeautifulSoup
  • NLP & Text Analysis: NLTK, Textstat
  • Excel Handling: openpyxl

⚙️ Setup and Installation

  1. Clone the repository:

    # Replace with your repository URL
    git clone [https://github.com/your-username/Data_Extraction-NLP_Analysis_Project.git](https://github.com/your-username/Data_Extraction-NLP_Analysis_Project.git)
    cd Data_Extraction-NLP_Analysis_Project
  2. Create and activate a virtual environment:

    # It is recommended to use Python 3.9 or higher for this project
    python -m venv venv
    .\venv\Scripts\activate # On Windows
    # source venv/bin/activate # On macOS/Linux
  3. Install the required packages:

    pip install -r requirements.txt
  4. One-Time NLTK Setup: The first time you run this, you'll need to download a tokenizer model from NLTK. Open a Python interpreter from your activated environment and run the following:

    import nltk
    nltk.download('punkt')
    exit()

🚀 How to Run the Project

Once you've completed the setup, running the entire analysis pipeline is just a single command.

  1. Make sure your Input.xlsx file is in the root directory.
  2. Run the main script from your terminal:
    python main.py

The script will start scraping the articles, performing the analysis, and will generate the Scraped_Articles/ directory and the final output.xlsx file when it's done.


🔬 Project Structure

I've organized the project with a clear separation for data, dictionaries, and code, which makes it easy to manage.

Click to view the project layout
Blackcoffer_Assignment/
│
├── main.py                     # Main script for the entire text analysis pipeline
├── Input.xlsx                  # Input file containing URLs to be analyzed
├── output.xlsx                 # The final output file with calculated scores
├── requirements.txt            # List of Python dependencies
├── instructions.md             # Project instructions and details
│
├── Scraped_Articles/
│   └── (This directory gets created to store the extracted article text files)
│
├── MasterDictionary/
│   ├── positive-words.txt
│   └── negative-words.txt
│
└── StopWords/
    ├── StopWords_Auditor.txt
    └── ... (and other stop word files)


About

This project demonstrates a complete pipeline for extracting textual data from various sources and performing Natural Language Processing (NLP) analysis to derive meaningful insights.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages