fix: fix byod flow and update integrated vectorization to work with byod flow #1905
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
This pull request introduces significant enhancements to the document chunking and embedding pipeline, primarily by adding a custom Azure Function and updating the Azure Cognitive Search skillset to use a more structured approach for handling document pages and their chunk numbers. It also improves the robustness of tests and adjusts authentication settings for Azure Search. The most important changes are grouped below:
1. Integrated Vectorization Pipeline Improvements
combine_pages_and_chunknos.py
) to combine page texts and chunk numbers into a single array of objects, exposed as a WebApiSkill endpoint for use in the Azure Cognitive Search skillset. [1] [2] [3]pages_with_chunks
structure.2. Test Robustness and Coverage
3. Application Logic and Authentication Adjustments
user_assigned_managed_identity
and include the managed identity resource ID, improving security and flexibility.Does this introduce a breaking change?
How to Test
What to Check
Verify that the following are valid
Other Information