Skip to content

KnlG/JBEmbeddingSearchEngine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Embeddings-based code search engine

This repository implements and evaluates a simple embeddings-based code search engine. It consists of 3 parts and an Appendix.

  • Part 1: Embeddings-based search engine - implementation

    • In this part of the project, a sample dataset from Kaggle called Single-Topic RAG Evaluation Dataset was used to initially check if the search engine works. This dataset includes 20 different documents in documents.csv which is utilized for our embeddings-based search engine.
    • To tokenize the documents into sentences NLTK library was used.
    • usearch engine was used to store and search through the embeddings.
  • Part 2: Evaluation of search engine

    • The search engine is evaluated on queries and function bodies from the CoSQA dataset.
    • The following evaluation metrics are used:
      • Recall@10 - measures the percentage of all relevant items that appear within the top 10 results.
      • MRR@10 - measures how effectively the system places the first relevant item high up in the top 10 results.
      • NDCG@10 - measures the quality of the search results by considering relevance score and position discounting.
    • Initial evaluation results on CoSQA test dataset are roughly as follows:
      • Recall@10 : 0.946
      • MRR@10 : 0.758
      • NDCG@10 : 0.804
  • Part 3: Fine-tuning with code data and re-evaluation

    • To make the model more suitable for the code search task we train the embedding model further on CoSQA training dataset and evaluate it again to see the improvement.
    • MultipleNegativeRankingLoss from Sentence Transformers was used as the loss function.
    • The evaluation results on CoSQA test dataset after the fine-tuning are as follows:
      • Recall@10 : 0.988
      • MRR@10 : 0.811
      • NDCG@10 : 0.855
  • Appendix: Alternative loss function that was part of experiment

Running the notebook

To run the Report.ipynb notebook:

  • Make sure Python is installed.
  • Clone the repository:
    • git clone <URL>
  • Create virtual environment (optional/recommended)
    • using venv:
     python -m venv venv
     source venv/bin/activate       # On macOS/Linux
     venv\Scripts\activate          # On Windows
    • alternatively using conda:
      conda create -n myenv python=3.14 -y
      conda activate myenv
  • Install dependencies:
    pip install -r requirements.txt
  • Launch Jupyter Notebook:
    jupyter notebook
    When the notebook interface opens in your browser, navigate to the .ipynb file and open it.

About

This repository implements and evaluates a simple embeddings-based code search engine.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors