feat: improve rag responses with better prompt and gen params. #23

Khan-Ramsha · 2025-04-05T09:40:48Z

Fixes for #20

Went through all the discussion in PR #17,
here's what i got--> Currently, there are two endpoints — /ask and /ask-llm — and we’re focusing on the former.
The /ask endpoint tries to fetch document content related to the user query and uses that to answer the question. But there’s a twist: it runs through two pipelines — the first one generates the answer, and the second one simplifies that answer for kids. Then the final simplified response is returned to the user.

But here’s the problem:
The model gives different responses every time for the same exact query — super inconsistent. Also, it doesn’t seem to actually use the document content for answering the question. On top of that, it sometimes cuts off the response halfway.

And when I tested the current main branch code, the model was generating super vague answers, kind of like it was “talking to itself.” It kept repeating phrases and even echoed the user’s prompt.

Here's how i fixed all that--> When i peeked into the code, turns out the model wasn’t even getting any document content tied to the user’s query. So how could it use something it never got?
To fix that, I included the document content directly in the prompt — now the model actually has context to work with. Next, to stop the “self-talking” behavior, I adjusted the prompt to explicitly tell the model not to repeat or talk to itself — that helped cut out the weirdness. Since the model’s behavior was non-deterministic (random), I set do_sample = False to make the outputs consistent every time. Also added: temperature, top_k, top_p = None , these settings basically kills the randomness, and randomness is what we don’t want here.

Then, I used return_full_text = False so we only get the generated output (not the full prompt + output). This stopped the model from repeating parts of the prompt. And repetition_penalty helps prevent repetitive phrases in the response.

And yep — added max_new_tokens to control response length. Without it, the model either cuts off early or overshoots and starts rambling. This keeps answers clean and to the point.

Lastly, I added a trim_incomplete_sentences() function to clean up outputs that were being cut off mid-sentence — now responses end properly.

All of this genuinely improved the model’s responses.
They now stick to the document content and the final output is clear, consistent, and simplified for kids.

… simplifications

MostlyKIGuess · 2025-04-05T09:50:57Z

There seems to be changes which aren't needed, all we had to do for that issue to re-write the prompt , could you also share the responses as discussed at the end in #17 on questions.

Khan-Ramsha · 2025-04-05T09:55:54Z

Results Before (i.e current main branch):

Model had "talking to itself" behavior, kept repeating the same phrases or echoing its own answer and sometimes drifted off-topic or added unrelated explanations.

The model attempts to generate Mandarin Chinese outta nowhere

Results After( after my changes):

Prompt passing to model for simplifying response-->>

Final Response

Prompt passing to model for simplifying response-->>

Final Response

Khan-Ramsha · 2025-04-05T10:09:14Z

There seems to be changes which aren't needed, all we had to do for that issue to re-write the prompt , could you also share the responses as discussed at the end in #17 on questions.

Yep, the main issue was with the prompt. But I also found that tweaking some other parameters like temperature and repetition penalty helped improve the model's responses overall. It reduced repetition and made the answers simpler and more consistent, especially for the child-friendly prompt.

Khan-Ramsha · 2025-04-05T10:13:22Z

@chimosky have a look!

MostlyKIGuess · 2025-04-05T11:33:03Z

But here’s the problem: The model gives different responses every time for the same exact query — super inconsistent. Also, it doesn’t seem to actually use the document content for answering the question. On top of that, it sometimes cuts off the response halfway.

No, this is not a problem, the answer is usually dependent on the seed you are on, you can potentially set seed = -1 (prev) or 0, for consistent responses. That is how LLM work. And it does retrieve information on /ask, we can set the embeddings to show later but we don't need it right now. Also even if it did not I do not see what part of the change would make it work.

And when I tested the current main branch code, the model was generating super vague answers, kind of like it was “talking to itself.” It kept repeating phrases and even echoed the user’s prompt.

Here's how i fixed all that--> When i peeked into the code, turns out the model wasn’t even getting any document content tied to the user’s query. So how could it use something it never got? To fix that, I included the document content directly in the prompt — now the model actually has context to work with. Next, to stop the “self-talking” behavior, I adjusted the prompt to explicitly tell the model not to repeat or talk to itself — that helped cut out the weirdness. Since the model’s behavior was non-deterministic (random), I set do_sample = False to make the outputs consistent every time. Also added: temperature, top_k, top_p = None , these settings basically kills the randomness, and randomness is what we don’t want here.

Then, I used return_full_text = False so we only get the generated output (not the full prompt + output). This stopped the model from repeating parts of the prompt. And repetition_penalty helps prevent repetitive phrases in the response.

This can be useful but we can decide on how we want to keep it, also that didn't stop your model from repeating, it was the max tokens that you reduced.

And yep — added max_new_tokens to control response length. Without it, the model either cuts off early or overshoots and starts rambling. This keeps answers clean and to the point.

This was discussed in #17 on what to keep, we ended up keeping 1024 because we need enough length to go through some topics.

MostlyKIGuess

Suggestions:
1.The code can be refactored to just changing the prompt.
2. As discussed in #17 the reason we wrote everything on new line was because we want to code to be readable so all the changes on functions aren't required.
3. Max_Token is something we need to discuss and can keep repetition penalty and DoLa layers.

Khan-Ramsha · 2025-04-05T13:57:30Z

No, this is not a problem, the answer is usually dependent on the seed you are on, you can potentially set seed = -1 (prev) or 0, for consistent responses. That is how LLM work.

Yes, the model can give different outputs for the same query depending on the seed, which is pretty normal for LLMs. but since we have do_sample=False, the model should already be generating deterministic responses.

Khan-Ramsha · 2025-04-05T14:07:11Z

And it does retrieve information on /ask, we can set the embeddings to show later but we don't need it right now.

I agree, Got it! The context is being directly passed in the prompt template so the model has the information upfront. As for the embeddings, yes they can be shown later if needed

Khan-Ramsha · 2025-04-05T14:59:08Z

This can be useful but we can decide on how we want to keep it, also that didn't stop your model from repeating, it was the max tokens that you reduced.

You are right about max tokens playing role in reducing repetition. But other params like repetition_penalty also contributes in doing that

Khan-Ramsha · 2025-04-05T15:01:36Z

This was discussed in #17 on what to keep, we ended up keeping 1024 because we need enough length to go through some topics.

Ohh, will increase it to 1024 then, But before that, we should figure out if we’re going with max_len or max_new_tokens

Khan-Ramsha · 2025-04-05T18:23:28Z

Suggestions: 1.The code can be refactored to just changing the prompt. 2. As discussed in #17 the reason we wrote everything on new line was because we want to code to be readable so all the changes on functions aren't required. 3. Max_Token is something we need to discuss and can keep repetition penalty and DoLa layers.

Changes Done,

Results:

Response:-

All set , take a look!

MostlyKIGuess

Reviewed yet to test

rag_agent.py

MostlyKIGuess · 2025-04-05T18:30:41Z

rag_agent.py

            first_response = first_chain.invoke({
-                "query": question,
-                "context": doc_result.page_content
+                "question": question,


again , I don't see how this change is required here 245-292

again , I don't see how this change is required here 245-292

Chatprompttemplate expects key to be 'question', not query. As in the prompt i am passing '{question}'

So if am changing this key from 'question' to 'query ' again i need to do changes in the Prompt too

chimosky · 2025-04-07T14:30:45Z

Haven't tested or looked at it deeply, but I agree with @MostlyKIGuess.

There's way too many unnecessary changes.

Also see making commits to write better commit messages.

Khan-Ramsha · 2025-04-08T14:20:34Z

Haven't tested or looked at it deeply, but I agree with @MostlyKIGuess.

There's way too many unnecessary changes.

Also see making commits to write better commit messages.

Fair point! I agree with @MostlyKIGuess too. I went a bit overboard and ended up changing whole codebase rather than just focusing on prompts which would have been enough to solve the issue. That said, my latest commit includes only prompt engineering and model params tweaking. Pls take a moment to review it and lemme know if you have any suggestions or changes.

Khan-Ramsha · 2025-04-08T14:21:07Z

Also see making commits to write better commit messages.

Sure

chimosky · 2025-04-08T16:00:51Z

You're yet to push said latest commits.

Khan-Ramsha · 2025-04-08T19:28:49Z

You're yet to push said latest commits.

I had already pushed the latest commits before asking @MostlyKIGuess for review. lemme know if anything's still missing, I will make the changes.

chimosky · 2025-04-09T12:43:57Z

Looking at the last changes you pushed, nothing has changed.

I'd made my comments after looking at those changes, I'm wondering why you're saying you pushed changes when I don't see any.

Khan-Ramsha · 2025-04-09T16:33:01Z

Looking at the last changes you pushed, nothing has changed.

I'd made my comments after looking at those changes, I'm wondering why you're saying you pushed changes when I don't see any.

Correct, I did not push anything after this commit: 7fd33c6

The only code I changed was refining the prompt by passing the fetched document content to it, and adding a few model parameters.
I’ve tested the updated prompt along with the model parameters, but I think the prompt already guides the model clearly enough, so the extra model parameters might be unnecessary.

Should I go ahead and remove them?
If you were pointing to any other unnecessary changes, could you please point them out? I’ll clean those up too.

Thankyou for your guidance. I really appreciate it and want to make sure I’m aligned with what’s expected.

Khan-Ramsha · 2025-04-09T16:44:09Z

so the extra model parameters might be unnecessary.

These are the ones I meant, as I believe the prompt already guides the model well enough in that direction. Still, I'll test it once by removing these parameters to see if there's any noticeable difference.

chimosky · 2025-04-10T16:00:02Z

Should I go ahead and remove them?

I figured that was the unspoken agreement considering we'd said it was unnecessary.

Khan-Ramsha · 2025-04-11T06:56:52Z

Should I go ahead and remove them?

I figured that was the unspoken agreement considering we'd said it was unnecessary.

Got it, sir! removing all unnecessary changes.

Removed unnecessary generation parameters (e.g. eturn_full_text, do_sample, emperature, op_p, op_k, epetition_penalty) from the model pipelines. This change is part of ongoing prompt engineering and response refinement to improve model output quality and maintain cleaner configuration defaults.

- Removed threshold logic from get_relevant_document() because even though related documents were retrieved, they often scored below the 0.5 threshold, causing it to return None and break the flow. - Now returning top 2 documents directly for better coverage and reliability. - In run(), avoided using RunnablePassthrough and chain_input because the model wasn't getting full context, which led to hallucinated answers. - Instead, formatting and passing the document context explicitly into the prompt.

Cleaned up an extra trailing comma after 'EleutherAI/gpt-neo-1.3B' in the --model choices list. No functional impact, just tidying up to avoid confusion.

Khan-Ramsha · 2025-04-11T10:05:09Z

Just noticed temperature and a few others didn’t show up properly in the commit message — removed do_sample, temperature, top_p, top_k, repetition_penalty, and return_full_text for cleaner defaults. Missed listing them all due to formatting, sorry about that!

Khan-Ramsha · 2025-04-11T10:46:39Z

Noticed the model still adds explanations and repeats content in responses. I'll tweak the prompts further and push updated changes within the next few hours.

…trol Improves the prompt used for rewriting answers for children to be more direct, strict, and aligned with project tone. Also sets the temperature parameter to better control the creativity of the model responses.

…nd top document These help trace model behavior for prompt tuning and answer quality.

Khan-Ramsha · 2025-04-11T15:25:20Z

Done! I tried to improve the model responses by making only the necessary changes. Have a look, sir, and let me know if anything still needs to be updated.

chimosky · 2025-04-11T21:22:54Z

rag_agent.py

+        if not context_text.strip():
+            return "I couldn't find an answer in the documents."


Not all answers are supposed to be in the documents, the documents are there for Sugar specific context not as the sole source of answers.

Totally agree the model should still answer even if there's no doc context. That was a temp check I forgot to remove. Will fix it

chimosky · 2025-04-11T21:24:32Z

Your commit message in 396edd4, I'm guessing you meant parsing and not passing.

Would also be nice if you didn't use all the width provided by your editor, makes the commit message easier to read.

chimosky

Debug prints shouldn't be part of your changes as they're not useful to anyone else but you while you were debugging, please don't include them.

chimosky · 2025-04-11T21:26:32Z

@MostlyKIGuess could you review and test this, I can't test at the moment.

Khan-Ramsha · 2025-04-12T03:43:08Z

Your commit message in 396edd4, I'm guessing you meant parsing and not passing.

I actually meant "passing". Just feeding the formatted context straight into the prompt

Khan-Ramsha · 2025-04-12T03:45:25Z

Would also be nice if you didn't use all the width provided by your editor, makes the commit message easier to read.

Alright, will keep that in mind for future commits

Khan-Ramsha · 2025-04-12T03:50:27Z

Debug prints shouldn't be part of your changes as they're not useful to anyone else but you while you were debugging, please don't include them.

Sure. I will make sure to remove those before I push my changes to the repository.

Removed the check that required answers to always come from documents. This update allows the model to provide answers even when no relevant context is found in the documents.

feat: improve rag responses with better prompt, gen params, and child…

0652404

… simplifications

MostlyKIGuess reviewed Apr 5, 2025

View reviewed changes

refactor: improving code readability

684a604

MostlyKIGuess requested changes Apr 5, 2025

View reviewed changes

Khan-Ramsha added 4 commits April 6, 2025 12:57

chore: remove unused Qwen model from CLI options

e7bc03a

chore: increasing token value

9053d59

chore: add missing newline at end of file

c0bb209

chore(debug): add debug print for retrieved documents

7fd33c6

Khan-Ramsha requested a review from MostlyKIGuess April 6, 2025 07:49

Khan-Ramsha added 2 commits April 11, 2025 15:25

chore: remove unnecessary comma after last model choice in argparse

ed2c83a

Cleaned up an extra trailing comma after 'EleutherAI/gpt-neo-1.3B' in the --model choices list. No functional impact, just tidying up to avoid confusion.

Khan-Ramsha added 2 commits April 11, 2025 20:47

chore: add debug prints for final response, prompt, first response, a…

03132cb

…nd top document These help trace model behavior for prompt tuning and answer quality.

chimosky reviewed Apr 11, 2025

View reviewed changes

Khan-Ramsha added 2 commits April 12, 2025 09:28

remove restriction for document-based answer.

2f8c373

Removed the check that required answers to always come from documents. This update allows the model to provide answers even when no relevant context is found in the documents.

remove debug prints

8061058

		if not context_text.strip():
		return "I couldn't find an answer in the documents."

feat: improve rag responses with better prompt and gen params. #23

Are you sure you want to change the base?

feat: improve rag responses with better prompt and gen params. #23

Uh oh!

Conversation

Khan-Ramsha commented Apr 5, 2025

Uh oh!

MostlyKIGuess commented Apr 5, 2025

Uh oh!

Khan-Ramsha commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Khan-Ramsha commented Apr 5, 2025

Uh oh!

Khan-Ramsha commented Apr 5, 2025

Uh oh!

MostlyKIGuess commented Apr 5, 2025

Uh oh!

MostlyKIGuess left a comment

Choose a reason for hiding this comment

Uh oh!

Khan-Ramsha commented Apr 5, 2025

Uh oh!

Khan-Ramsha commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Khan-Ramsha commented Apr 5, 2025

Uh oh!

Khan-Ramsha commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Khan-Ramsha commented Apr 5, 2025

Uh oh!

MostlyKIGuess left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MostlyKIGuess Apr 5, 2025

Choose a reason for hiding this comment

Uh oh!

Khan-Ramsha Apr 6, 2025

Choose a reason for hiding this comment

Uh oh!

Khan-Ramsha Apr 6, 2025

Choose a reason for hiding this comment

Uh oh!

chimosky commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Khan-Ramsha commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Khan-Ramsha commented Apr 8, 2025

Uh oh!

chimosky commented Apr 8, 2025

Uh oh!

Khan-Ramsha commented Apr 8, 2025

Uh oh!

chimosky commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Khan-Ramsha commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Khan-Ramsha commented Apr 9, 2025

Uh oh!

chimosky commented Apr 10, 2025

Uh oh!

Khan-Ramsha commented Apr 11, 2025

Uh oh!

Khan-Ramsha commented Apr 11, 2025

Uh oh!

Khan-Ramsha commented Apr 11, 2025

Uh oh!

Khan-Ramsha commented Apr 5, 2025 •

edited

Loading

Khan-Ramsha commented Apr 5, 2025 •

edited

Loading

Khan-Ramsha commented Apr 5, 2025 •

edited

Loading

chimosky commented Apr 7, 2025 •

edited

Loading

Khan-Ramsha commented Apr 8, 2025 •

edited

Loading

chimosky commented Apr 9, 2025 •

edited

Loading

Khan-Ramsha commented Apr 9, 2025 •

edited

Loading