Skip to content

ravisraju/vector_space_retreival

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simple Search Engine

Simple search engine based on Boolean Retrieval and Vector Space Retrieval.

filename - simple_search_engine.py

The corpus has not been uploaded due to size constraints. Place the corpus to be indexed ("nsf award abstracts" in our case) in the same folder as the python file and continue.

Run the file as 

python simple_search_engine.py

Initial Index will be created for about 25 seconds.

TF IDF calculation and construction of tfidf vectors for all files will happen in about another 50 seconds.

A total of 75 seconds to make sure the query part doesn't take a long time.

After this, query prompt will open up.

query structure is "<bool/vector> query"

"bool" option searches in boolean way, while option "vector" computes cosine similarity for all documents, ranks them and prints the top 50 results.

Example queries

1) bool stephen palumbi

2) vector ricardo osuna


To exit, 

type "exit". Note - If you type "bool exit", the application will search for keyword exit.

The time noted has been in my personal machine. (8GB Ram, CORE i5)

In Sun machines, it takes considerably and variably longer time. The total index time varied from 8 minutes to 14 minutes in sun machine.


Resources Used - 

"glob" - from piazza
"log(1.0*N/df)" - from piazza
python documentation
stackoverflow

About

Vector Space Retrieval

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages