The goal of this project is to practice and improve my data wrangling skills which include; Gathering, Assessment, Transformation, Cleaning, Visualisation and Analyis. These activities will be carried out on the twitter archive of the @RateDogs account. This account rates people's dogs with humorous comments about the dogs.
This README summarises how I approached the data wrangling for this project and displays some visualization(s) which produced my insights.
For this project, I worked with three dataset provided by Udacity. They each contained different information needed to carry out analysis and reporting.
The first dataset was a csv file named twitter_archive_enhanced, it contained information about 2356 tweets and was downloaded manually.
The second dataset was a tsv file named image_prediction, a url to the data was provided by udacity server which was used to programmatically downloaded the file. It contained 2075 predictions classifying dogs by their breeds using pictures provided with tweets.
The third dataset, I downloaded manually as a txt file, containing JSON format of tweet informations. It contained extra information about tweets like the rewetet count, favorite count recieved for 2357 tweets.
The dataset were assessed visually and programmatically for quality and tidiness issues. These issues were addressed and corrected, a merge/join was performed on the three datframes to create one master dataset which I then used to carry out my investigative analysis.
This section of the wrangling process was broken down into three parts:
Define: Cleaning process to be carried out was explainedCode: Code needed to achieve the cleaning goal that had been defined.Test: Code to confirm that the cleaning goal had been achieved.
To begin, copies of the three datasets were created. These copies were used to carry out the cleaning activities.
Some of the cleaning proccesses carried out on the datasets are as follows:
- Some rows and columns containing null values were dropped.
- Attributes type were converted to appropriate type
- Some columns were concatenated, unpivot(ed) to form single columns
- The three datasets were merged
- We can see from the chart that there seems to be correlation between some attributes of our data. The strongest correlation can be observed between favourite count and retweet counts.
- We can also see that the date column seems to have a relationship with the favourite count, rating numerator and to a much lesser extent retweet count and length of tweets.
- The visual above shows us the popularity of breeds or occurrence of breeds rated by the twitter account.
- Terriers were the most talked about dogs
- Investigation showed that 6 of these breeds were small sized dogs
- The highest rated breed, of the most popular(breeds that occur more than 14 times) breeds, is the Samoyed.
- Golden Retriever, which happens to be the most commonly tweeted about dog, is the second highest rated dog.
- 5 of our most common breeds within the time frame of our entire data, also happen to be among the Top 10 rated breeds.
- The French Bulldog and the Cocker Spaniel, small sized dogs, are two of the top 3 most popular breeds for twitter users engagemets.
These include other insights not shown in this summarised report. These can be found in the Jupyter notebook used to carry out the wrangling and analysis process.
- The handler of the this twitter account likes to use the abbreviation "af" a lot. We can assume that this abbreviation means that he/she liked to emphasize their description or sentiments for any dog they were rating.
- Of the four different classisfication of dogs, Pupper was the handlers favourite description for dogs. From the 'dogtionary', these dogs are physically small or young dogs.
- In the time period under consideration, the maximum number of characters for a tweet was 140. Therefore, the handler used a lot of long tweets, between 100 and 140 characters for most of their tweets.
- The handler was very generous with ratings, giving most dogs a numerator 10 or more.
- The two most popular breeds, were both Retrievers. If people sent their dog's pictures to be rated, most of the dogs owners who are aware of the account owned Retrievers (Golden and Labrador). If the handler found the pictures on their own, is it easier to find pictures of retrievers than other breeds ?
- It appears the handler has a bias towards the Golden Retriever. Of the most popular dogs, it possessed the second highest rating. The Labrador Retriever is also in the top 10 for popular dogs with the highest ratings.
- The most popular dogs also had the most total engagements in terms of Favourites and retweets.
- Followers of the accounts seemed to have lots of love for the French Bulldog, Cocker Spaniel and the Samoyed as they were the most popular dogs which received the most likes or retweets on average.
- The Beddington Terrier and Saluki breeds are probably the breed followers of the account liked the most. They received the most total engagements on average even with their lack of popularity. (All though these engagement values could be skewed due to just one of their pictures receiving very very large number of engagements.)






