GitHub - AmanSinghal927/RAG-prod

Information Retrieval - Approach

What do you think about your solution and how do you think it can be scaled up? Have a document classifier depending on what information we have available Better chunking strategies : Have a segmentation model? or something else Metrics for multi-document retrieval Better labelled data to figure out those metrics Different kinds of indexes e.g. HNSW, other ANN based techniques to reduce retrieval latency (just comapred L2 and cosine distance)
What have you already done? Problem Exploration: First I looked at the data to see what kind of questions were being asked Discovered that some questions had no answers e.g. What are the standard rules of a golf game? and majority of the questions could be answered using a single paragraph Also noted that one of the PDFs had no raw text. Moreover the PDFs were text based and not image based. Then I looked at the raw text and decided to setup a basic pipeline as my baseline

Based on this I decided the basic problems I would be targeting to solve: 0) What metrics to use

Which db/index to use
Which encoder to use
Re-chunking: Some problems in raw text which required context across two chunks and the second chunk would not be retrieved without the first context
Tabular data: Some questions were based on tables and the tabular data was extracted pretty well from the documents
How to handle no answer? ~ would the LLM handle this? I think it woud

Solutions: Based just on para based data

0a) Metrics: Recall@4 and looked at the results as well as the paper "Lost in the middle" to see how much context can an LLM handle without missing information (although this was true for documents it was a good proxy) - Recall@10 0b) Metrics: Earlier contexts are better and the paper argues that also in the end so I also decided to have a metric which captures the rank (MRR); giving higher preference to Recall and then to MRR; If I had more question which require multi-document retrieval, I would have also considered NDCG (but it essentially a scaled version of MRR if there is only 1 item to be retrieved)

Index: Milvus support is excellent for Linux but not good for Windows so I decided to use FAISS. Start with a flatIndex, using L2 distance of cosine distance. Exretemely fast so I decided not to move on to indexes like HNSW or other Approximate nearest neighbors based indexes

2a) Sentence Transformers: Worked on a previous problem where they worked well so I used that as a starting point. Trained using Siamese tower based approach so I was aware that they're better for sentence level representations rather than paragraph level representations

~ 29%

Error Analysis: 50% errors in para based retrieval were due to chunking and 50% due to fetching the wrong contexts

Solution1: Solve chunking LayoutPDFReader

Solution2: Use other encoders like TF-IDF and dragon Reason why TF-IDF worked and dragon did not: Exact phrase matching is poor in dense retreivers

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.ipynb_checkpoints		.ipynb_checkpoints
config		config
data		data
logs		logs
models		models
outputs/2024-04-11		outputs/2024-04-11
reports		reports
results		results
scripts		scripts
src		src
utils		utils
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md
document_questions.xlsx		document_questions.xlsx
main.py		main.py
retrieval_module.zip		retrieval_module.zip
test.csv		test.csv
tf_idf.ipynb		tf_idf.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information Retrieval - Approach

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval - Approach

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages