Hey there! Welcome to my project on Data Extraction and NLP Analysis. I built this to automate the process of analyzing financial articles. The goal was to take a list of URLs, scrape the text from each one, and then perform a detailed textual analysis to calculate various scores like sentiment, readability, and other linguistic metrics.
This repository contains the complete pipeline I built, from fetching the data to generating the final structured output in an Excel file.
The main challenge was to create a robust system that could:
- Reliably extract only the main article text from different web pages, ignoring headers, footers, and ads.
- Clean the extracted text by removing stop words specific to financial and general contexts.
- Perform complex NLP calculations to derive meaningful variables like polarity, subjectivity, and readability.
- Handle potential errors (like dead links or server issues) without crashing.
- Present the final analysis in a clean, structured Excel format as required.
Here’s a breakdown of how I tackled the problem:
-
Resource Loading:
- Before starting, I pre-loaded all the necessary resources. This includes a custom set of stop words from the
StopWordsdirectory and the positive/negative word lists from theMasterDictionary. - To improve accuracy, I made sure to clean the master dictionaries by removing any words that were also present in the stop words list.
- Before starting, I pre-loaded all the necessary resources. This includes a custom set of stop words from the
-
Data Extraction:
- I used the Pandas library to read the
Input.xlsxfile. - Then, I looped through each URL, using the Requests library to fetch the HTML content. I used BeautifulSoup to parse the HTML and intelligently extract just the article title (from
<h1>tags) and the main body text (from<article>or relevant<div>tags). - Each successfully scraped article is saved as a
.txtfile in theScraped_Articles/directory.
- I used the Pandas library to read the
-
NLP Analysis Pipeline:
- For each article's text, I built a function that calculates all the required variables. I used the NLTK library for tokenization (breaking text into words and sentences) and Textstat for syllable counting.
- The analysis includes:
- Sentimental Analysis: Calculating Positive, Negative, Polarity, and Subjectivity Scores.
- Readability Analysis: Calculating the Gunning Fog Index.
- Other Linguistic Metrics: Word Count, Complex Word Count, Average Sentence Length, Personal Pronouns, and more.
-
Structured Output Generation:
- Finally, I collected all the calculated scores for each article and merged them back with the original input data.
- The final, comprehensive result is saved to
output.xlsx, matching the required data structure perfectly.
- Core Language: Python
- Data Manipulation: Pandas
- Web Scraping: Requests, BeautifulSoup
- NLP & Text Analysis: NLTK, Textstat
- Excel Handling: openpyxl
-
Clone the repository:
# Replace with your repository URL git clone [https://github.com/your-username/Data_Extraction-NLP_Analysis_Project.git](https://github.com/your-username/Data_Extraction-NLP_Analysis_Project.git) cd Data_Extraction-NLP_Analysis_Project
-
Create and activate a virtual environment:
# It is recommended to use Python 3.9 or higher for this project python -m venv venv .\venv\Scripts\activate # On Windows # source venv/bin/activate # On macOS/Linux
-
Install the required packages:
pip install -r requirements.txt
-
One-Time NLTK Setup: The first time you run this, you'll need to download a tokenizer model from NLTK. Open a Python interpreter from your activated environment and run the following:
import nltk nltk.download('punkt') exit()
Once you've completed the setup, running the entire analysis pipeline is just a single command.
- Make sure your
Input.xlsxfile is in the root directory. - Run the main script from your terminal:
python main.py
The script will start scraping the articles, performing the analysis, and will generate the Scraped_Articles/ directory and the final output.xlsx file when it's done.
I've organized the project with a clear separation for data, dictionaries, and code, which makes it easy to manage.
Click to view the project layout
Blackcoffer_Assignment/
│
├── main.py # Main script for the entire text analysis pipeline
├── Input.xlsx # Input file containing URLs to be analyzed
├── output.xlsx # The final output file with calculated scores
├── requirements.txt # List of Python dependencies
├── instructions.md # Project instructions and details
│
├── Scraped_Articles/
│ └── (This directory gets created to store the extracted article text files)
│
├── MasterDictionary/
│ ├── positive-words.txt
│ └── negative-words.txt
│
└── StopWords/
├── StopWords_Auditor.txt
└── ... (and other stop word files)