MOD1 LAB

i) Introduction

by Tuyen Vu and Tarik Salay Lab's repository can be found here First 10 minutes of the video recording can be found here, and the second part can be found here OR a short video can be found here

ii) Objectives

Tasks:

Write a program to find a longest substring without repeating characters from a given string input from the console. Sample Input: ‘ababcdxa’ Sample Output: abcdx
Suppose you have a list of tuples as follows: [( ‘John’, (‘Physics’, 80)) , (‘ Daniel’, (‘Science’, 90)), (‘John’, (‘Science’, 95)), (‘Mark’,(‘Maths’, 100)), (‘Daniel’, (’History’, 75)), (‘Mark’, (‘Social’, 95))] Create a dictionary with key s as names and values as list of (subjects, marks) in sorted order. { John : [(‘Physics’, 80), (‘Science’, 95)] Daniel : [ (’History’, 75), (‘Science’, 90)] Mark : [ (‘Maths’, 100), (‘Social’, 95)] }
Write a python program to create any one of the following management systems. a. Airline Booking Reservation System (e.g. classes Flight, Person, Employee, Passenger etc.) b. Library Management System(eg: Student, Book, Faculty, Department etc.)
Go to https://scikit-learn.org/stable/modules/clustering.html#clustering and fetch comparison of the clustering algorithms in scikit-learn. Hint:Use BeautifulSoup package.
Pick any dataset from the dataset sheet in the class sheet or online which includes both numeric and non-numeric features a. Perform exploratory data analysis on the data set (like Handling null values, removing the features not correlated to the target class, encoding the categorical features, ...) b. Apply the three classification algorithms Naïve Bayes, SVM and KNN on the chosen data set and report which classifier gives better result.
Choose any dataset of your choice. Apply K-means on the dataset and visualize the clusters using matplotlib or seaborn. a. Report which K is the best using the elbow method. b. Evaluate with silhouette score or other scores relevant for unsupervised approaches (before applying clustering clean the data set with the EDA learned in the class)
Write a program in which take an Input file, use the simple approach below to summarize a text file: Link to input file: https://umkc.box.com/s/7by0f4540cdbdp3pm60h5fxxffefsvrw a. Read the data from a file b. Tokenize the text into words and apply lemmatization technique on each word. c. Find all the trigrams for the words. d. Extract the top 10 of the most repeated trigrams based on their count. e. Go through the text in the file f. Find all the sentences with the most repeated tri-grams g. Extract those sentences and concatenate h. Print the concatenated result
Create Multiple Regression by choosing a dataset of your choice (again before evaluating, clean the data set with the EDA learned in the class). Evaluate the model using RMSE and R2 and also report if you saw any improvement before and after the EDA.

iii) Approaches/Methods

First three tasks were dealing with simple Python functionality, such as, in the first task taking a string as an input from the user and creating a sub-string by using a for loop, where in the second task you start by defining an empty dictionary and filling it accordingly from the tuple by making the first part of each tuple(i[0]) the key, and the next part value (i[1]), and finally sorting it, and the third question was dealing with classes and inheritance. After picking the Airline task, we started by defining the classes with their constructors and inherited accordingly from their parent classes to be able to create the relation between them.

With the 4th task, we started using the machine learning knowledge we had in this course, specifically. We started By using BeautifulSoup, requests libraries, parsering the html content and fetch the needed information from the website. Finally, save data into a file.

5th task was asking to apply three classification algorithms which are Naive-Bayes, SVM and KNN, but first requirement was performing EDA(exploratory data analysis) on the data which is for planning the gathering of data to make its analysis easier, more precise or more accurate according to its founder, John Tukey. We used the machine learning libraries, such as seaborn, pandas, matplotlib, and sklearn. The only different library we used was "warnings", and seaborn's FacetGrid method, which is used for visualizing the data to analyze our feature correlations. Finally, we applied the requested method and techniques accordingly. We performed classification by using three different algorithms: Naive Bayes, K-Neighbor(k=3), and Support Machine Vector. These models were evaluated by classification report.

For the task 6 (and also 8th) we spent significant amount of time to decide which dataset we want to use, since some of the ones we dealt with didn't work as well as planned. After deciding the dataset, the task was asking to apply k-means clustering and to report which K was the best using the elbow method, and evaluating the relevant scores. As always, we started by reading the csv using the pandas library and storing it into dataframe. After making sure that there are no null values etc. we were good to go for visualizing part, and matplotlib is used to visualize. By applying k-means algorithms and visualizing elbow methow, after several trials, we've found that k=2 was fitting better in this case, in terms of Silhouette score. For this data, it was a bad idea if data is scaled and apply PCA due to the silhouette score is very low.

For the 7th task, we were working with text/stream of words. By using Natural Language Toolkit, we were able to tokenize the text into words and sentences. From that, we could perform trigram for the text.

iv) Workflow

Task1

Task2

Task3

Task4

Task5

Task6

Task7

Installing NLTK Data downloads

Task8

v) Datasets

Kaggle has been really helpful to us throughout this project. Datasets could be found at:

Dataset1 Dataset2

vi) Parameters

For the first part none. When we started dealing with machine learning techniques, yes. Parameter has an enormous place in machine learning and as I know, we will be discussing the differences between parameters and hyper parameters in the second module of this course. Parameters are key to machine learning algorithms. Even though, they would not really effect our results, they are the part of the model that is learned from historical training data.

vii) Evaluation & Conclusion

The 5th, 6th, and 8th tasks were working with datasets and supervised/unsupervised algorithms. Before applying any algorithms, we performed EDA(exploratory data analysis) on the data which is for planning the gathering of data to make its analysis easier, more precise or more accurate according to its founder, John Tukey: visualize the features relationship, handling null values, encoding category features, .... The only different library we used was "warnings", and seaborn's FacetGrid method, which is used for visualizing the data to analyze our feature correlations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MOD1 LAB

i) Introduction

ii) Objectives

iii) Approaches/Methods

iv) Workflow

v) Datasets

vi) Parameters

vii) Evaluation & Conclusion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally