Skip to content

Conversation

@Khan-Ramsha
Copy link

Fixes for #20

Went through all the discussion in PR #17,
here's what i got--> Currently, there are two endpoints — /ask and /ask-llm — and we’re focusing on the former.
The /ask endpoint tries to fetch document content related to the user query and uses that to answer the question. But there’s a twist: it runs through two pipelines — the first one generates the answer, and the second one simplifies that answer for kids. Then the final simplified response is returned to the user.

But here’s the problem:
The model gives different responses every time for the same exact query — super inconsistent. Also, it doesn’t seem to actually use the document content for answering the question. On top of that, it sometimes cuts off the response halfway.

And when I tested the current main branch code, the model was generating super vague answers, kind of like it was “talking to itself.” It kept repeating phrases and even echoed the user’s prompt.

Here's how i fixed all that--> When i peeked into the code, turns out the model wasn’t even getting any document content tied to the user’s query. So how could it use something it never got?
To fix that, I included the document content directly in the prompt — now the model actually has context to work with. Next, to stop the “self-talking” behavior, I adjusted the prompt to explicitly tell the model not to repeat or talk to itself — that helped cut out the weirdness. Since the model’s behavior was non-deterministic (random), I set do_sample = False to make the outputs consistent every time. Also added: temperature, top_k, top_p = None , these settings basically kills the randomness, and randomness is what we don’t want here.

Then, I used return_full_text = False so we only get the generated output (not the full prompt + output). This stopped the model from repeating parts of the prompt. And repetition_penalty helps prevent repetitive phrases in the response.

And yep — added max_new_tokens to control response length. Without it, the model either cuts off early or overshoots and starts rambling. This keeps answers clean and to the point.

Lastly, I added a trim_incomplete_sentences() function to clean up outputs that were being cut off mid-sentence — now responses end properly.

All of this genuinely improved the model’s responses.
They now stick to the document content and the final output is clear, consistent, and simplified for kids.

@MostlyKIGuess
Copy link
Member

There seems to be changes which aren't needed, all we had to do for that issue to re-write the prompt , could you also share the responses as discussed at the end in #17 on questions.

@Khan-Ramsha
Copy link
Author

Khan-Ramsha commented Apr 5, 2025

Results Before (i.e current main branch):

Model had "talking to itself" behavior, kept repeating the same phrases or echoing its own answer and sometimes drifted off-topic or added unrelated explanations.

image

image

The model attempts to generate Mandarin Chinese outta nowhere

image

Results After( after my changes):

Prompt passing to model for simplifying response-->>
image

Final Response
image

Prompt passing to model for simplifying response-->>

image

Final Response

image

@Khan-Ramsha
Copy link
Author

There seems to be changes which aren't needed, all we had to do for that issue to re-write the prompt , could you also share the responses as discussed at the end in #17 on questions.

Yep, the main issue was with the prompt. But I also found that tweaking some other parameters like temperature and repetition penalty helped improve the model's responses overall. It reduced repetition and made the answers simpler and more consistent, especially for the child-friendly prompt.

@Khan-Ramsha
Copy link
Author

@chimosky have a look!

@MostlyKIGuess
Copy link
Member

But here’s the problem: The model gives different responses every time for the same exact query — super inconsistent. Also, it doesn’t seem to actually use the document content for answering the question. On top of that, it sometimes cuts off the response halfway.

No, this is not a problem, the answer is usually dependent on the seed you are on, you can potentially set seed = -1 (prev) or 0, for consistent responses. That is how LLM work. And it does retrieve information on /ask, we can set the embeddings to show later but we don't need it right now. Also even if it did not I do not see what part of the change would make it work.

And when I tested the current main branch code, the model was generating super vague answers, kind of like it was “talking to itself.” It kept repeating phrases and even echoed the user’s prompt.

Here's how i fixed all that--> When i peeked into the code, turns out the model wasn’t even getting any document content tied to the user’s query. So how could it use something it never got? To fix that, I included the document content directly in the prompt — now the model actually has context to work with. Next, to stop the “self-talking” behavior, I adjusted the prompt to explicitly tell the model not to repeat or talk to itself — that helped cut out the weirdness. Since the model’s behavior was non-deterministic (random), I set do_sample = False to make the outputs consistent every time. Also added: temperature, top_k, top_p = None , these settings basically kills the randomness, and randomness is what we don’t want here.

Then, I used return_full_text = False so we only get the generated output (not the full prompt + output). This stopped the model from repeating parts of the prompt. And repetition_penalty helps prevent repetitive phrases in the response.

This can be useful but we can decide on how we want to keep it, also that didn't stop your model from repeating, it was the max tokens that you reduced.

And yep — added max_new_tokens to control response length. Without it, the model either cuts off early or overshoots and starts rambling. This keeps answers clean and to the point.

This was discussed in #17 on what to keep, we ended up keeping 1024 because we need enough length to go through some topics.

Copy link
Member

@MostlyKIGuess MostlyKIGuess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions:
1.The code can be refactored to just changing the prompt.
2. As discussed in #17 the reason we wrote everything on new line was because we want to code to be readable so all the changes on functions aren't required.
3. Max_Token is something we need to discuss and can keep repetition penalty and DoLa layers.

@Khan-Ramsha
Copy link
Author

No, this is not a problem, the answer is usually dependent on the seed you are on, you can potentially set seed = -1 (prev) or 0, for consistent responses. That is how LLM work.

Yes, the model can give different outputs for the same query depending on the seed, which is pretty normal for LLMs. but since we have do_sample=False, the model should already be generating deterministic responses.

@Khan-Ramsha
Copy link
Author

Khan-Ramsha commented Apr 5, 2025

And it does retrieve information on /ask, we can set the embeddings to show later but we don't need it right now.

I agree, Got it! The context is being directly passed in the prompt template so the model has the information upfront. As for the embeddings, yes they can be shown later if needed

@Khan-Ramsha
Copy link
Author

This can be useful but we can decide on how we want to keep it, also that didn't stop your model from repeating, it was the max tokens that you reduced.

You are right about max tokens playing role in reducing repetition. But other params like repetition_penalty also contributes in doing that

@Khan-Ramsha
Copy link
Author

Khan-Ramsha commented Apr 5, 2025

This was discussed in #17 on what to keep, we ended up keeping 1024 because we need enough length to go through some topics.

Ohh, will increase it to 1024 then, But before that, we should figure out if we’re going with max_len or max_new_tokens

@Khan-Ramsha
Copy link
Author

Suggestions: 1.The code can be refactored to just changing the prompt. 2. As discussed in #17 the reason we wrote everything on new line was because we want to code to be readable so all the changes on functions aren't required. 3. Max_Token is something we need to discuss and can keep repetition penalty and DoLa layers.

Changes Done,

Results:

image

Response:-

image

image

Response:-
image

All set , take a look!

Copy link
Member

@MostlyKIGuess MostlyKIGuess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed yet to test

rag_agent.py Outdated
first_response = first_chain.invoke({
"query": question,
"context": doc_result.page_content
"question": question,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again , I don't see how this change is required here 245-292

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again , I don't see how this change is required here 245-292

Chatprompttemplate expects key to be 'question', not query. As in the prompt i am passing '{question}'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if am changing this key from 'question' to 'query ' again i need to do changes in the Prompt too

@chimosky
Copy link
Member

chimosky commented Apr 7, 2025

Haven't tested or looked at it deeply, but I agree with @MostlyKIGuess.

There's way too many unnecessary changes.

Also see making commits to write better commit messages.

@Khan-Ramsha
Copy link
Author

Khan-Ramsha commented Apr 8, 2025

Haven't tested or looked at it deeply, but I agree with @MostlyKIGuess.

There's way too many unnecessary changes.

Also see making commits to write better commit messages.

Fair point! I agree with @MostlyKIGuess too. I went a bit overboard and ended up changing whole codebase rather than just focusing on prompts which would have been enough to solve the issue. That said, my latest commit includes only prompt engineering and model params tweaking. Pls take a moment to review it and lemme know if you have any suggestions or changes.

@Khan-Ramsha
Copy link
Author

Also see making commits to write better commit messages.

Sure

@chimosky
Copy link
Member

chimosky commented Apr 8, 2025

You're yet to push said latest commits.

@Khan-Ramsha
Copy link
Author

You're yet to push said latest commits.

I had already pushed the latest commits before asking @MostlyKIGuess for review. lemme know if anything's still missing, I will make the changes.

@chimosky
Copy link
Member

chimosky commented Apr 9, 2025

Looking at the last changes you pushed, nothing has changed.

I'd made my comments after looking at those changes, I'm wondering why you're saying you pushed changes when I don't see any.

@Khan-Ramsha
Copy link
Author

Khan-Ramsha commented Apr 9, 2025

Looking at the last changes you pushed, nothing has changed.

I'd made my comments after looking at those changes, I'm wondering why you're saying you pushed changes when I don't see any.

Correct, I did not push anything after this commit: 7fd33c6

The only code I changed was refining the prompt by passing the fetched document content to it, and adding a few model parameters.
I’ve tested the updated prompt along with the model parameters, but I think the prompt already guides the model clearly enough, so the extra model parameters might be unnecessary.

Should I go ahead and remove them?
If you were pointing to any other unnecessary changes, could you please point them out? I’ll clean those up too.

Thankyou for your guidance. I really appreciate it and want to make sure I’m aligned with what’s expected.

@Khan-Ramsha
Copy link
Author

so the extra model parameters might be unnecessary.

image

These are the ones I meant, as I believe the prompt already guides the model well enough in that direction. Still, I'll test it once by removing these parameters to see if there's any noticeable difference.

@chimosky
Copy link
Member

Should I go ahead and remove them?

I figured that was the unspoken agreement considering we'd said it was unnecessary.

@Khan-Ramsha
Copy link
Author

Should I go ahead and remove them?

I figured that was the unspoken agreement considering we'd said it was unnecessary.

Got it, sir! removing all unnecessary changes.

Removed unnecessary generation parameters (e.g. 
eturn_full_text, do_sample, 	emperature, 	op_p, 	op_k, 
epetition_penalty) from the model pipelines.

This change is part of ongoing prompt engineering and response refinement to improve model output quality and maintain cleaner configuration defaults.
- Removed threshold logic from get_relevant_document() because even though related documents were retrieved, they often scored below the 0.5 threshold, causing it to return None and break the flow.
- Now returning top 2 documents directly for better coverage and reliability.
- In run(), avoided using RunnablePassthrough and chain_input because the model wasn't getting full context, which led to hallucinated answers.
- Instead, formatting and passing the document context explicitly into the prompt.
Cleaned up an extra trailing comma after 'EleutherAI/gpt-neo-1.3B' in the --model choices list. No functional impact, just tidying up to avoid confusion.
@Khan-Ramsha
Copy link
Author

Just noticed temperature and a few others didn’t show up properly in the commit message — removed do_sample, temperature, top_p, top_k, repetition_penalty, and return_full_text for cleaner defaults. Missed listing them all due to formatting, sorry about that!

@Khan-Ramsha
Copy link
Author

Noticed the model still adds explanations and repeats content in responses. I'll tweak the prompts further and push updated changes within the next few hours.

…trol

Improves the prompt used for rewriting answers for children to be more direct, strict, and aligned with project tone.
Also sets the temperature parameter to better control the creativity of the model responses.
…nd top document

These help trace model behavior for prompt tuning and answer quality.
@Khan-Ramsha
Copy link
Author

Done! I tried to improve the model responses by making only the necessary changes. Have a look, sir, and let me know if anything still needs to be updated.

rag_agent.py Outdated
Comment on lines 205 to 206
if not context_text.strip():
return "I couldn't find an answer in the documents."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all answers are supposed to be in the documents, the documents are there for Sugar specific context not as the sole source of answers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree the model should still answer even if there's no doc context. That was a temp check I forgot to remove. Will fix it

@chimosky
Copy link
Member

Your commit message in 396edd4, I'm guessing you meant parsing and not passing.

Would also be nice if you didn't use all the width provided by your editor, makes the commit message easier to read.

Copy link
Member

@chimosky chimosky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug prints shouldn't be part of your changes as they're not useful to anyone else but you while you were debugging, please don't include them.

@chimosky
Copy link
Member

@MostlyKIGuess could you review and test this, I can't test at the moment.

@Khan-Ramsha
Copy link
Author

Your commit message in 396edd4, I'm guessing you meant parsing and not passing.

I actually meant "passing". Just feeding the formatted context straight into the prompt

@Khan-Ramsha
Copy link
Author

Would also be nice if you didn't use all the width provided by your editor, makes the commit message easier to read.

Alright, will keep that in mind for future commits

@Khan-Ramsha
Copy link
Author

Debug prints shouldn't be part of your changes as they're not useful to anyone else but you while you were debugging, please don't include them.

Sure. I will make sure to remove those before I push my changes to the repository.

Removed the check that required answers to always come from documents.
This update allows the model to provide answers even when no relevant
context is found in the documents.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants