📊 Data Extraction & NLP Analysis

Hey there! Welcome to my project on Data Extraction and NLP Analysis. I built this to automate the process of analyzing financial articles. The goal was to take a list of URLs, scrape the text from each one, and then perform a detailed textual analysis to calculate various scores like sentiment, readability, and other linguistic metrics.

This repository contains the complete pipeline I built, from fetching the data to generating the final structured output in an Excel file.

🤔 The Challenge

The main challenge was to create a robust system that could:

Reliably extract only the main article text from different web pages, ignoring headers, footers, and ads.
Clean the extracted text by removing stop words specific to financial and general contexts.
Perform complex NLP calculations to derive meaningful variables like polarity, subjectivity, and readability.
Handle potential errors (like dead links or server issues) without crashing.
Present the final analysis in a clean, structured Excel format as required.

✨ Core Features & My Solution Approach

Here’s a breakdown of how I tackled the problem:

Resource Loading:
- Before starting, I pre-loaded all the necessary resources. This includes a custom set of stop words from the StopWords directory and the positive/negative word lists from the MasterDictionary.
- To improve accuracy, I made sure to clean the master dictionaries by removing any words that were also present in the stop words list.
Data Extraction:
- I used the Pandas library to read the Input.xlsx file.
- Then, I looped through each URL, using the Requests library to fetch the HTML content. I used BeautifulSoup to parse the HTML and intelligently extract just the article title (from <h1> tags) and the main body text (from <article> or relevant <div> tags).
- Each successfully scraped article is saved as a .txt file in the Scraped_Articles/ directory.
NLP Analysis Pipeline:
- For each article's text, I built a function that calculates all the required variables. I used the NLTK library for tokenization (breaking text into words and sentences) and Textstat for syllable counting.
- The analysis includes:
  - Sentimental Analysis: Calculating Positive, Negative, Polarity, and Subjectivity Scores.
  - Readability Analysis: Calculating the Gunning Fog Index.
  - Other Linguistic Metrics: Word Count, Complex Word Count, Average Sentence Length, Personal Pronouns, and more.
Structured Output Generation:
- Finally, I collected all the calculated scores for each article and merged them back with the original input data.
- The final, comprehensive result is saved to output.xlsx, matching the required data structure perfectly.

🛠️ Tech Stack

Core Language: Python
Data Manipulation: Pandas
Web Scraping: Requests, BeautifulSoup
NLP & Text Analysis: NLTK, Textstat
Excel Handling: openpyxl

⚙️ Setup and Installation

Clone the repository:

# Replace with your repository URL
git clone [https://github.com/your-username/Data_Extraction-NLP_Analysis_Project.git](https://github.com/your-username/Data_Extraction-NLP_Analysis_Project.git)
cd Data_Extraction-NLP_Analysis_Project

Create and activate a virtual environment:

# It is recommended to use Python 3.9 or higher for this project
python -m venv venv
.\venv\Scripts\activate # On Windows
# source venv/bin/activate # On macOS/Linux

Install the required packages:
```
pip install -r requirements.txt
```
One-Time NLTK Setup: The first time you run this, you'll need to download a tokenizer model from NLTK. Open a Python interpreter from your activated environment and run the following:
```
import nltk
nltk.download('punkt')
exit()
```

🚀 How to Run the Project

Once you've completed the setup, running the entire analysis pipeline is just a single command.

Make sure your Input.xlsx file is in the root directory.
Run the main script from your terminal:
```
python main.py
```

The script will start scraping the articles, performing the analysis, and will generate the Scraped_Articles/ directory and the final output.xlsx file when it's done.

🔬 Project Structure

I've organized the project with a clear separation for data, dictionaries, and code, which makes it easy to manage.

Click to view the project layout

Blackcoffer_Assignment/
│
├── main.py                     # Main script for the entire text analysis pipeline
├── Input.xlsx                  # Input file containing URLs to be analyzed
├── output.xlsx                 # The final output file with calculated scores
├── requirements.txt            # List of Python dependencies
├── instructions.md             # Project instructions and details
│
├── Scraped_Articles/
│   └── (This directory gets created to store the extracted article text files)
│
├── MasterDictionary/
│   ├── positive-words.txt
│   └── negative-words.txt
│
└── StopWords/
    ├── StopWords_Auditor.txt
    └── ... (and other stop word files)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📊 Data Extraction & NLP Analysis

🤔 The Challenge

✨ Core Features & My Solution Approach

🛠️ Tech Stack

⚙️ Setup and Installation

🚀 How to Run the Project

🔬 Project Structure

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
MasterDictionary		MasterDictionary
Scraped_Articles		Scraped_Articles
StopWords		StopWords
.gitignore		.gitignore
Input.xlsx		Input.xlsx
README.md		README.md
instructions.md		instructions.md
main.py		main.py
output.xlsx		output.xlsx
requirements.txt		requirements.txt
template.py		template.py

jsonusuman351/Data_Extraction-NLP_Analysis_Project

Folders and files

Latest commit

History

Repository files navigation

📊 Data Extraction & NLP Analysis

🤔 The Challenge

✨ Core Features & My Solution Approach

🛠️ Tech Stack

⚙️ Setup and Installation

🚀 How to Run the Project

🔬 Project Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages