Embeddings-based code search engine

This repository implements and evaluates a simple embeddings-based code search engine. It consists of 3 parts and an Appendix.

Part 1: Embeddings-based search engine - implementation
- In this part of the project, a sample dataset from Kaggle called Single-Topic RAG Evaluation Dataset was used to initially check if the search engine works. This dataset includes 20 different documents in documents.csv which is utilized for our embeddings-based search engine.
- To tokenize the documents into sentences NLTK library was used.
- usearch engine was used to store and search through the embeddings.
Part 2: Evaluation of search engine
- The search engine is evaluated on queries and function bodies from the CoSQA dataset.
- The following evaluation metrics are used:
  - Recall@10 - measures the percentage of all relevant items that appear within the top 10 results.
  - MRR@10 - measures how effectively the system places the first relevant item high up in the top 10 results.
  - NDCG@10 - measures the quality of the search results by considering relevance score and position discounting.
- Initial evaluation results on CoSQA test dataset are roughly as follows:
  - Recall@10 : 0.946
  - MRR@10 : 0.758
  - NDCG@10 : 0.804
Part 3: Fine-tuning with code data and re-evaluation
- To make the model more suitable for the code search task we train the embedding model further on CoSQA training dataset and evaluate it again to see the improvement.
- MultipleNegativeRankingLoss from Sentence Transformers was used as the loss function.
- The evaluation results on CoSQA test dataset after the fine-tuning are as follows:
  - Recall@10 : 0.988
  - MRR@10 : 0.811
  - NDCG@10 : 0.855
Appendix: Alternative loss function that was part of experiment

Running the notebook

To run the Report.ipynb notebook:

Make sure Python is installed.
Clone the repository:
- git clone <URL>

Create virtual environment (optional/recommended)

using venv:

 python -m venv venv
 source venv/bin/activate       # On macOS/Linux
 venv\Scripts\activate          # On Windows

alternatively using conda:

  conda create -n myenv python=3.14 -y
  conda activate myenv

Install dependencies:
```
pip install -r requirements.txt
```
Launch Jupyter Notebook:
```
jupyter notebook
```
When the notebook interface opens in your browser, navigate to the .ipynb file and open it.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
.gitignore		.gitignore
README.md		README.md
Report.ipynb		Report.ipynb
W&B Chart 10_30_2025, 11_38_57 PM.png		W&B Chart 10_30_2025, 11_38_57 PM.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Embeddings-based code search engine

Running the notebook

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Embeddings-based code search engine

Running the notebook

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages