Skip to content

Add Compliance tests for GB18030-2022 #118075

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

jozkee
Copy link
Member

@jozkee jozkee commented Jul 25, 2025

No description provided.

@jozkee jozkee requested a review from tarekgh July 25, 2025 22:41
@jozkee jozkee self-assigned this Jul 25, 2025
@Copilot Copilot AI review requested due to automatic review settings July 25, 2025 22:41
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces comprehensive compliance testing for the GB18030-2022 Chinese character encoding standard by adding a new test project with thorough test coverage across multiple .NET API surfaces.

Key changes:

  • Creates a complete test suite for GB18030 encoding compliance validation
  • Implements tests for string operations, file I/O, directory operations, console I/O, and encoding roundtrips
  • Adds test data file with Chinese character test cases for both short and extended length scenarios

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated no comments.

Show a summary per file
File Description
Compliance.Tests.csproj New test project configuration with GB18030 test files and dependencies
TestHelper.cs Central test helper providing GB18030 encoding, test data parsing, and culture/comparison configurations
StringTests.cs Comprehensive string operation tests including comparison, search, equality, and manipulation methods
FileTests.cs File I/O operation tests for reading/writing text with GB18030 encoding
FileTestBase.cs Abstract base class for file system operation tests with GB18030 filenames
FileInfoTests.cs FileInfo-specific implementation of file system tests
DirectoryTests.cs Directory operation tests using GB18030 directory names
DirectoryTestBase.cs Abstract base class for directory operation tests with nested directory support
DirectoryInfoTests.cs DirectoryInfo-specific implementation with enumeration tests
ConsoleTests.cs Console I/O tests for standard input/output/error streams with GB18030 encoding
EncodingTests.cs Basic encoding roundtrip validation tests
Level3+Amendment_Test_Data_for_Mid_to_High_Volume_cases.txt Test data file containing GB18030 character sequences
Common.Tests.slnx Solution file update to include the new compliance test project
Comments suppressed due to low confidence (1)

src/libraries/Common/tests/ComplianceTests/GB18030/Tests/StringTests.cs:13

  • [nitpick] The constant name 'Dummy' is ambiguous and doesn't clearly indicate its purpose as a replacement character for testing. Consider renaming to 'ReplacementChar' or 'TestReplacementCharacter'.
    private const string Dummy = "\uFFFF";

Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-meta
See info in area-owners.md if you want to be subscribed.

[Theory]
[MemberData(nameof(TestHelper.DecodedTestData), MemberType = typeof(TestHelper))]
public void CreateSubdirectory(string decoded)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do clean up after executing every test in this file?

Copy link
Member

@tarekgh tarekgh Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This apply on other directory and file tests if it is not already doing that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be cleaning up, TempDirectory gets deleted on Dispose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not seeing the dispose is explicitly called here. Do you mean the temp directory has a finalizer that eventually likely calling the dispose? should we do teh following instead?

using var subDirInfo = TempDirectory.CreateSubdirectory(gb18030Line);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dispose is in the base class:

@tarekgh
Copy link
Member

tarekgh commented Jul 28, 2025

@jozkee I'll wait your changes specifically in the string tests and then will take another look then.

Also, it would be good if you could add a specific net core app test that tests Char, CharUnicodeInfo, and Regex with GB18030 too.

@jozkee
Copy link
Member Author

jozkee commented Aug 8, 2025

CharUnicodeInfo

@tarekgh This test looks like is matching all codepoints with their category in UnicodeData.txt, does that suffice?

@tarekgh
Copy link
Member

tarekgh commented Aug 8, 2025

This test looks like is matching all codepoints with their category in UnicodeData.txt, does that suffice?

This test currently verifies whatever data is taken from the Unicode file UnicodeData.txt. If we happened to use an older version that doesn’t include the GB18030 characters, the test would still pass. While this is unlikely, I suggest adding checks in your new tests for some of the newly added GB18030 characters. That way, regardless of the Unicode version in use, we can be confident that GB18030 is covered.

@jozkee jozkee requested a review from tarekgh August 11, 2025 17:16
@jozkee
Copy link
Member Author

jozkee commented Aug 11, 2025

Tests for Regex, Char and CharUnicodeInfo are adding a considerable chunk of runtime in net481 (2.5 mins in my PC), it finishes in ~8s otherwise. For comparison, it takes ~30s in net10.

Time increase is because I made theories for all characters in GB18030 ranges which are ~9k.

We could skip them in net481 as you suggested, @tarekgh.

Test run for D:\runtime\artifacts\bin\Compliance.Tests\Debug\net481\Compliance.Tests.dll (.NETFramework,Version=v4.8.1)
A total of 1 test files matched the specified pattern.
Results File: D:\runtime\artifacts\bin\Compliance.Tests\Debug\net481\TestResults\dacantu_DESKTOP_2025-08-11_12_12_29.trx
Html test results file : D:\runtime\artifacts\bin\Compliance.Tests\Debug\net481\TestResults\TestResult_dacantu_DESKTOP_20250811_121511.html

Passed!  - Failed:     0, Passed: 60107, Skipped:     0, Total: 60107, Duration: 2 m 38 s - Compliance.Tests.dll (net481)

{
string utf32String = testCase.Utf32CodeValue;
Assert.True(char.IsLetter(utf32String, 0));
Assert.True(char.IsLetterOrDigit(utf32String, 0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does all added characters are letters? no other Unicode categories we should check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I discovered this because I didn't want to write code to handle all categories.

Copy link
Member

@tarekgh tarekgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jozkee!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants