This repository implements and evaluates a simple embeddings-based code search engine. It consists of 3 parts and an Appendix.
-
Part 1: Embeddings-based search engine - implementation
- In this part of the project, a sample dataset from Kaggle called Single-Topic RAG Evaluation Dataset
was used to initially check if the search engine works. This dataset includes 20 different documents in
documents.csvwhich is utilized for our embeddings-based search engine. - To tokenize the documents into sentences NLTK library was used.
- usearch engine was used to store and search through the embeddings.
- In this part of the project, a sample dataset from Kaggle called Single-Topic RAG Evaluation Dataset
was used to initially check if the search engine works. This dataset includes 20 different documents in
-
Part 2: Evaluation of search engine
- The search engine is evaluated on queries and function bodies from the CoSQA dataset.
- The following evaluation metrics are used:
- Recall@10 - measures the percentage of all relevant items that appear within the top 10 results.
- MRR@10 - measures how effectively the system places the first relevant item high up in the top 10 results.
- NDCG@10 - measures the quality of the search results by considering relevance score and position discounting.
- Initial evaluation results on CoSQA test dataset are roughly as follows:
- Recall@10 : 0.946
- MRR@10 : 0.758
- NDCG@10 : 0.804
-
Part 3: Fine-tuning with code data and re-evaluation
- To make the model more suitable for the code search task we train the embedding model further on CoSQA training dataset and evaluate it again to see the improvement.
MultipleNegativeRankingLossfrom Sentence Transformers was used as the loss function.- The evaluation results on CoSQA test dataset after the fine-tuning are as follows:
- Recall@10 : 0.988
- MRR@10 : 0.811
- NDCG@10 : 0.855
-
Appendix: Alternative loss function that was part of experiment
To run the Report.ipynb notebook:
- Make sure Python is installed.
- Clone the repository:
git clone <URL>
- Create virtual environment (optional/recommended)
- using venv:
python -m venv venv source venv/bin/activate # On macOS/Linux venv\Scripts\activate # On Windows
- alternatively using conda:
conda create -n myenv python=3.14 -y conda activate myenv
- Install dependencies:
pip install -r requirements.txt
- Launch Jupyter Notebook:
When the notebook interface opens in your browser, navigate to the .ipynb file and open it.
jupyter notebook