In this project, we aim to conduct a sentiment analysis on Twitter data. Our primary objective is to compare traditional statistical methods against modern AI/ML techniques to evaluate tweets. We plan to analyze the sentiment of tweets stored in the PostgreSQL Twitter database using different approaches and tools.
-
Owen Andreasen
- GitHub/GitLab Username: andreaseno
- Role: AI/ML sentiment analysis using Python and libraries like TensorFlow, Keras, SciKit-Learn, Scipy, and Torch.
-
Justin Park
- GitHub/GitLab Username: Chang-park
- Role: Conducting traditional statistical analysis primarily using SAS and SAS Text Miner.
This is a sentiment analysis project with a unique angle of comparing traditional statistics and AI/ML models such as BERT transformers and XGBoost. Our analysis will include metrics like accuracy, recall, and precision and will consider the detection of sarcasm and complex sentiments.
The Twitter data used in this project is sourced from the PostgreSQL Twitter database. The dataset contains tweets collected over a specific timeframe and stored in a structured format. Prior to analysis, the data will be preprocessed to remove noise and irrelevant information. Additionally, we are referencing the following external datasets and resources to aid our research:
We will employ a combination of traditional statistical analysis and AI/ML techniques for sentiment analysis:
- Traditional Statistical Analysis: Justin will utilize SAS and SAS Text Miner to perform statistical analysis on the Twitter data. This approach may involve lexicon-based methods and other statistical techniques.
- AI/ML Sentiment Analysis: Owen will implement AI/ML models using Python and libraries such as TensorFlow, Keras, and Scikit-Learn. This approach may include using BERT transformers and feeding their embeddings into an XGBoost Model.
We will evaluate the performance of our sentiment analysis methods using the following metrics:
- Accuracy: The proportion of correctly classified tweets.
- Recall: The proportion of true positive tweets identified by the model.
- Precision: The proportion of correctly classified positive tweets out of all predicted positive tweets.
Upon completion of the sentiment analysis, we will summarize the findings and discuss the implications of the results. We will compare the performance of traditional statistical methods with AI/ML techniques, highlighting any differences in accuracy, recall, and precision. Additionally, we will explore the detection of sarcasm and complex sentiments in the Twitter data.
Datasets
- Twitter Sentiment Dataset (Twitter_Data) Kaggle Dataset
- This dataset was a possible training dataset for positive, neutral, negative multiclass models, but the data looked low quality
- Sentiment140 Dataset Kaggle Dataset
- This dataset only uses positive and negative classes (no neutral), but the data looked high quality and there are a ton of rows
- Tweet Sentiment Extraction Kaggle Dataset
- This dataset uses positive, negative, and neutral classes, and the data looks fairly high quality
- MultiClassLabeledCustomTwitterSentiments.csv
- Labeled "Gold Standard" dataset created for analysis in SentimentAnalysis.ipynb
- BinaryLabeledCustomTwitterSentiments.csv
- Binary version of MultiClassLabeledCustomTwitterSentiments in case I wanted to use it instead
- CustomTwitterSentiments.csv
- MultiClassLabeledCustomTwitterSentiments before I hand labeled the data
- MultiClassLabeledCombined.csv
- Experimental dataset curated by combining MultiClassLabeledCustomTwitterSentiments with data from the Tweet Sentiment Extraction dataset.
Other Resources
- Twitter Sentiment Extraction Analysis Kaggle Notebook
- Twitter Sentiment Analysis with BERT vs RoBERTa Kaggle Notebook
- Sentiment Analysis of Unlabelled Text Using Word2Vec Model Stack Overflow Discussion
- Emojis in Social Media Sentiment Analysis Article
- Usage of SpaCy as a Text Meta Feature generator Kaggle Notebook
Files
- SentimentAnalysis.ipynb: File to perform sentiment analysis
- TestData.ipynb: File to create the custom dataset from PostgreSQL
- CountRows.py: Helper file to count CSV file lines, since some entries are multirow
- AppendRows.py: Helper file to combine csv files. Used to create MultiClassLabeledCombined.csv