Skip to content

mariah-may23/WebCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

High Level Overview

We parse the command line arguments to get the username and password for the program. We create a TLS-wrapped socket to connect with the HTTP host. Once the connection is formed, we send a GET request to the server to retrieve the data on the root page. This is necessary for the server to respond with the csrf token that is used to log in. We send this token along with another GET request to the login page. Once again, the server responds to our request with both a session cookie and csrfmiddlewaretoken. The HTML Parser is used to locate this csrfmiddleware token. Now that all the needed data is obtained, the user can submit a request to log in. This request must contain both cookies, the csrfmiddlewaretoken, and the username and password of the current user. Upon successful login in, we are given a new pair of cookies. These cookies are stored and sent along with every subsequent GET request in our program. We send a new GET request to the user’s fakebook profile. Once we have logged the user into their Fakebook profile, we start crawling through the web pages. We maintain a set of previously visited links(to ensure uniqueness), a queue of links to be visited and a set of secret flags for the user. We use the HTML Parser object to parse through the server response and look for “a” tags which contain URLs we need to visit to complete our search and “h2” tags which may contain our secret flags. Once a URL has been parsed, it is added to the visited links set so we do not parse it again. This crawling continues until all 5 secret flags are found and the program breaks out of the loop or if we run out of URLs in the queue. We also ensure that appropriate error handling is done while crawling these sites. We allow parsing for 200 and 302 as normal. We abandoned the current URL if we encountered a 403 (Forbidden) 0r 404 (Not Found) error. For a 500 error code, Internal Server Error, we add the link back in the queue to be requested again. We wrote a separate function to handle 301 errors (Moved Permanently) which extracts the new URL location(if it exists) from the response and adds it to our persisting queue.

Mariah - I started with researching ways to form a continuous connection to the server so that multiple GET requests could be sent from our program. I started with researching the structure of these requests using the resources provided from the specs. Once I was able to connect to the root page, I spent time looking at the inspector tools while going through the website. After careful inspection of each request in the browser, I created the necessary functions for the GET and POST requests. Some more research was needed to understand the body of the log in that was needed to fully send a log in request. I coded some of the HTML Parser to do this. By searching through the input tags, I was able to locate the csrfmiddlwaretoken. I also created functions to store the cookies that would be needed to log in. I used these functions together to code the beginning of the program walk through each page so we could access the user’s fakebook profile. I worked on refactoring and editing some code after Shriya completed her part of the assignment as well.

Shriya - I worked on restructuring the code to accept the username and password for the user and making a secure connection to the host server while ensuring error handling for the same. I had to dig into the ways I could parse the content of the URLs which required an understanding of the structure of the HTTP requests and responses as well as resources (as listed in the allowed set for this project) which would be useful in parsing and structuring our setup for the web crawler. It took some research and playing around with different code designs to finally select the resource most useful for meeting all ends - HTMLParser which is used in our program for both parsing and finding appropriate tags as well as extracting the content in between the tags for obtaining the secret flags. I worked on setting up appropriate data structures like sets for visited links and flags and queues for links to be visited and then set up the crawler for URLs. The crawler also required additional error-handling functionality which was handled in the form of extracting error codes and handling them as mentioned in our description above. We also had to refactor our receive function to get the content length and receive the entire message based on the end tag for the HTML.

We both collectively worked on figuring out the most optimal way to handle errors for our program as well as refactoring existing code for receiving our responses from the server which was done as a last step towards the completion of this project.

Challenges you faced

Mariah - The most challenging part of this assignment was understanding how to use the cookies along with the requests to log in. This took me quite a bit of time to look through the pages individually. I looked at the response and request cookies for each page to give myself a better idea about what was needed in each request. The content of the headers needed to make these requests also took some research due to the different versions/ requirements needed in the headers for each version. I sent many requests to the server during this time while attempting to log in. At first, I had overlooked the hidden fields in the login page, which caused many of these attempts to fail. Once I inspected the HTML code more carefully, I understood that the hidden field (csrfmiddlewaretoken) had to be sent with the POST request and it changed each time, so I had to find a way to look through the HTML code first before submitting the login info. After trial and error, this was solved.

Shriya - The most challenging part about this project for me was figuring out the appropriate library/ methodology to parse the URLs and extract contents of the h2 tags as there is no direct way to get content in between the tags using the HTMLParser or any other allowed resources. After several trials and errors, I was finally able to extract the content after understanding the functioning of different methods under the parser. It also took me a while to sift through the URL content and understand how I could make the loop most optimal for crawling and reduce search time. It was also tricky to implement the error handling for the program as we delved deep into different methodologies and checkpoints for the same.

Testing the Code

This program required testing at pretty much every single step. We tested our code by printing out and studying the responses that were received to implement further functionality. Extraction of data and content from the URL had to be tested against unwanted responses so appropriate check flags and exception handlers were placed for the same. We also added additional error-handling functionality for our parsing, connections as well as crawling as mentioned in the description above.

About

cs5700 proj 2

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •