Skip to content

Conversation

Harsh-Microsoft
Copy link
Contributor

@Harsh-Microsoft Harsh-Microsoft commented Sep 19, 2025

Purpose

This pull request introduces significant enhancements to the document chunking and embedding pipeline, primarily by adding a custom Azure Function and updating the Azure Cognitive Search skillset to use a more structured approach for handling document pages and their chunk numbers. It also improves the robustness of tests and adjusts authentication settings for Azure Search. The most important changes are grouped below:

1. Integrated Vectorization Pipeline Improvements

  • Added a custom Azure Function (combine_pages_and_chunknos.py) to combine page texts and chunk numbers into a single array of objects, exposed as a WebApiSkill endpoint for use in the Azure Cognitive Search skillset. [1] [2] [3]
  • Updated the integrated vectorization skillset to:
    • Include the new WebApiSkill for combining pages and chunk numbers.
    • Adjust the AzureOpenAIEmbeddingSkill and index projections to operate on the new pages_with_chunks structure.
    • Add a ShaperSkill to bundle metadata into a complex object for each chunk.
    • Update field mappings and skillset composition accordingly. [1] [2] [3] [4] [5] [6]

2. Test Robustness and Coverage

  • Refactored functional tests to validate response structure and key fields rather than exact citation content, making tests more resilient to dynamic data (e.g., SAS tokens).
  • Improved request validation in tests by checking the presence and structure of key request fields instead of matching the entire payload.
  • Updated skillset creation tests to verify the presence and configuration of the new WebApiSkill and related outputs.

3. Application Logic and Authentication Adjustments

  • Changed the Azure Search authentication type to use user_assigned_managed_identity and include the managed identity resource ID, improving security and flexibility.
  • Fixed citation URL generation to ensure a SAS token placeholder is always present, preventing issues with missing tokens.
  • Cleaned up unused field mappings in the search configuration dictionary for clarity and maintainability.

Does this introduce a breaking change?

  • Yes
  • No

How to Test

  • Get the code
git clone [repo-address]
cd [repo-name]
git checkout [branch-name]
npm install
  • Test the code

What to Check

Verify that the following are valid

  • ...

Other Information

@Prajwal-Microsoft Prajwal-Microsoft merged commit 28e0a1e into Azure-Samples:dev Sep 30, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants