Skip to content

Commit db33e23

Browse files
committed
feat: 添加简体中文文档并更新版本至0.1.2
- 新增README_zh-hans.md简体中文文档 - 优化代码格式以提高可读性
1 parent dc6c1df commit db33e23

File tree

8 files changed

+198
-45
lines changed

8 files changed

+198
-45
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,6 @@
55
# Avoid committing pubspec.lock for library packages; see
66
# https://dart.dev/guides/libraries/private-files#pubspeclock.
77
pubspec.lock
8+
9+
.idea/
10+
.vscode/

CHANGELOG.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,8 @@
33
- Initial version.
44

55
## 0.1.1
6-
s
7-
- Update LICENSE.
6+
7+
- Update LICENSE.
8+
9+
## 0.1.2
10+
- add README_zh-hans.md and format the dart code.

README.md

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,13 @@
1+
English | [简体中文](README_zh-hans.md)
12
# text_counter
23

4+
<p align="center">
5+
<a href="https://github.com/hexwarrior6/text_counter"><img alt="GitHub repo" src="https://img.shields.io/github/last-commit/hexwarrior6/text_counter?logo=github"></a>
6+
<a href="https://gitee.com/HexWarrior6/text_counter"><img alt="Gitee repo" src="https://img.shields.io/badge/Gitee-repo-red?logo=gitee"></a>
7+
<a href="https://pub.dev/packages/text_counter"><img alt="pub version" src="https://img.shields.io/pub/v/text_counter?logo=dart"></a>
8+
<a href="https://github.com/hexwarrior6/text_counter/blob/master/LICENSE"><img alt="LICENSE" src="https://img.shields.io/github/license/hexwarrior6/text_counter.svg?color=blue"></a>
9+
</p>
10+
311
A lightweight Dart utility for accurately counting characters and words in **over 100 languages**, including CJK (Chinese, Japanese, Korean), RTL (Right-to-Left) scripts like Arabic and Hebrew, and mixed-language texts.
412

513
`text_counter` uses **Microsoft Word-compatible word counting logic**, ensuring consistent and familiar results across different writing systems. This makes it ideal for applications requiring accurate text metrics — such as content editors, writing tools, and input validation systems.
@@ -23,7 +31,7 @@ Add this to your package's `pubspec.yaml`:
2331

2432
```yaml
2533
dependencies:
26-
text_counter: ^0.1.0
34+
text_counter: ^0.1.2
2735
```
2836
2937
Then run:
@@ -54,11 +62,11 @@ void main() {
5462

5563
## 🗺️ Supported Languages
5664

57-
| Script Type | Language Codes |
58-
| ------------------------- | ------------------------------------------------------------ |
59-
| **CJK (Character-based)** | `zh`, `yue`, `ja`, `ko`, `th`, `hi`, `bn`, `ta`, `te`, `kn`,`ml`, `si`, `km`, `my`, `lo`, `tl`, `jw`, `su`, `bo`, `dz` |
60-
| **RTL (Word-based)** | `ml`, `si`, `km`, `my`, `lo`, `tl`, `jw`, `su`, `bo`, `dz` |
61-
| **Latin (Word-based)** | All other ISO 639-1 language codes not listed above, including: `en`,`de`,`es`,`fr`,`it`,`pt`,`nl`,`tr`,`pl`,`ca`,`sv`,`id`,`fi`,`vi`,`hi`,`uk`,`el`,`ms`,`cs`,`ro`,`da`,`hu`,`no`,`th`... |
65+
| Script Type | Language Codes |
66+
|---------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
67+
| **CJK (Character-based)** | `zh`, `yue`, `ja`, `ko`, `th`, `hi`, `bn`, `ta`, `te`, `kn`,`ml`, `si`, `km`, `my`, `lo`, `tl`, `jw`, `su`, `bo`, `dz` |
68+
| **RTL (Word-based)** | `ml`, `si`, `km`, `my`, `lo`, `tl`, `jw`, `su`, `bo`, `dz` |
69+
| **Latin (Word-based)** | All other ISO 639-1 language codes not listed above, including: `en`, `de`, `es`, `fr`, `it`, `pt`, `nl`, `tr`, `pl`, `ca`, `sv`, `id`, `fi`, `vi`, `hi`, `uk`, `el`, `ms`, `cs`, `ro`, `da`, `hu`, `no`, `th` ... |
6270

6371
> If no `languageCode` is provided, the library automatically detects script types and applies appropriate counting rules.
6472
@@ -80,7 +88,7 @@ void main() {
8088
## 📚 API Reference
8189

8290
```dart
83-
int TextCounter.count(String text, {String? languageCode});
91+
int TextCounter.count(String text, {String? languageCode})
8492
```
8593

8694
- `text`: The input string to be analyzed.

README_zh-hans.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
[English](README.md) | 简体中文
2+
# text_counter
3+
4+
<p align="center">
5+
<a href="https://github.com/hexwarrior6/text_counter"><img alt="GitHub 仓库" src="https://img.shields.io/github/last-commit/hexwarrior6/text_counter?logo=github"></a>
6+
<a href="https://gitee.com/HexWarrior6/text_counter"><img alt="Gitee 仓库" src="https://img.shields.io/badge/Gitee-repo-red?logo=gitee"></a>
7+
<a href="https://pub.dev/packages/text_counter"><img alt="pub 版本" src="https://img.shields.io/pub/v/text_counter?logo=dart"></a>
8+
<a href="https://github.com/hexwarrior6/text_counter/blob/master/LICENSE"><img alt="许可证" src="https://img.shields.io/github/license/hexwarrior6/text_counter.svg?color=blue"></a>
9+
</p>
10+
11+
一个轻量级的 Dart 工具库,用于精确统计**100多种语言**的字符数和词数,包括 CJK(中文、日文、韩文)、阿拉伯语和希伯来语等从右向左书写的文字(RTL),以及混合语言的文本。
12+
13+
`text_counter` 采用**与 Microsoft Word 兼容的词数统计逻辑**,确保在不同书写系统中都能获得一致且熟悉的统计结果。这使得它非常适合需要精确文本统计的应用场景,例如内容编辑器、写作工具和输入验证系统。
14+
15+
## ✨ 功能特点
16+
17+
- ✅ 采用 Microsoft Word 的词数统计规则:
18+
- 通过空格和常见标点符号分隔单词。
19+
- 连字符连接的单词(如 "state-of-the-art")会被视为一个单词。
20+
- 根据上下文正确处理数字和符号。
21+
22+
- ✅ 支持语言感知的统计策略:
23+
- **CJK(基于字符)**:每个字符单独计数(适用于中文、日文、韩文等)。
24+
- **拉丁文字和 RTL 文字(基于单词)**:使用适当的分隔符和分词规则进行标准的单词计数。
25+
26+
- 🔍 **自动检测混合文本的语言/文字类型**
27+
28+
-**轻量级且无依赖**:无需外部库。
29+
30+
- 🌐 **开箱即用支持 100 多种语言**
31+
32+
## 📦 安装
33+
34+
在项目的 `pubspec.yaml` 文件中添加:
35+
36+
```yaml
37+
dependencies:
38+
text_counter: ^0.1.2
39+
```
40+
41+
然后运行:
42+
43+
```bash
44+
dart pub get
45+
```
46+
47+
## 🧪 使用方法
48+
49+
### 基础示例
50+
51+
```dart
52+
import 'package:text_counter/text_counter.dart';
53+
54+
void main() {
55+
print('中文: ${TextCounter.count("你好,世界", languageCode: "zh")}'); // 5
56+
print('日文: ${TextCounter.count("こんにちは世界", languageCode: "ja")}'); // 7
57+
print('韩文: ${TextCounter.count("안녕하세요 세상", languageCode: "ko")}'); // 7
58+
print('阿拉伯语: ${TextCounter.count("مرحبا بالعالم", languageCode: "ar")}'); // 2
59+
print('希伯来语: ${TextCounter.count("שלום עולם", languageCode: "he")}'); // 2
60+
print('英文: ${TextCounter.count("Hello world", languageCode: "en")}'); // 2
61+
62+
const mixed = "Hello 你好 مرحبا こんにちは";
63+
print('混合文本 "$mixed": ${TextCounter.count(mixed)}'); // 9
64+
}
65+
```
66+
67+
## 🗺️ 支持的语言
68+
69+
| 文字类型 | 语言代码 |
70+
|----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
71+
| **CJK(基于字符)** | `zh`, `yue`, `ja`, `ko`, `th`, `hi`, `bn`, `ta`, `te`, `kn`,`ml`, `si`, `km`, `my`, `lo`, `tl`, `jw`, `su`, `bo`, `dz` |
72+
| **RTL(基于单词)** | `ml`, `si`, `km`, `my`, `lo`, `tl`, `jw`, `su`, `bo`, `dz` |
73+
| **拉丁文字(基于单词)** | 所有其他未列出的 ISO 639-1 语言代码,包括:`en`, `de`, `es`, `fr`, `it`, `pt`, `nl`, `tr`, `pl`, `ca`, `sv`, `id`, `fi`, `vi`, `hi`, `uk`, `el`, `ms`, `cs`, `ro`, `da`, `hu`, `no`, `th` ... |
74+
75+
> 如果不提供 `languageCode`,库会自动检测文字类型并应用适当的统计规则。
76+
77+
## 🛠️ 工作原理
78+
79+
- 对于 **CJK 语言**,每个表意文字或语素文字字符都会被单独计数。
80+
- 对于 **拉丁文字和 RTL 文字**,使用类似于 Microsoft Word 的空格和标点符号模式来检测单词边界。
81+
-**混合语言文本**中,计数器会根据所使用的文字类型动态切换统计方法。
82+
83+
## 🧩 适用场景
84+
85+
- 内容管理系统
86+
- 富文本编辑器
87+
- 有字数限制的写作应用
88+
- 语言学习平台
89+
- 分析仪表盘
90+
- 表单验证工具
91+
92+
## 📚 API 参考
93+
94+
```dart
95+
int TextCounter.count(String text, {String? languageCode})
96+
```
97+
98+
- `text`:需要分析的输入字符串。
99+
- `languageCode`:可选的 BCP 47 语言代码(例如,`"en"` 表示英语,`"zh"` 表示中文)。如果省略,则使用自动检测。
100+
101+
## 📎 许可证
102+
103+
MIT 许可证 - 详见 [LICENSE](https://yuanbao.tencent.com/chat/naQivTmsDa/LICENSE)

example/text_counter_example.dart

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,11 @@ void main() {
44
print('Chinese: ${TextCounter.count("你好,世界", languageCode: "zh")}'); // 5
55
print('Japanese: ${TextCounter.count("こんにちは世界", languageCode: "ja")}'); // 7
66
print('Korean: ${TextCounter.count("안녕하세요 세상", languageCode: "ko")}'); // 7
7-
print('Arabic: ${TextCounter.count("مرحبا بالعالم", languageCode: "ar")}'); // 2
7+
print(
8+
'Arabic: ${TextCounter.count("مرحبا بالعالم", languageCode: "ar")}'); // 2
89
print('Hebrew: ${TextCounter.count("שלום עולם", languageCode: "he")}'); // 2
9-
print('English: ${TextCounter.count("Hello world", languageCode: "en")}'); // 2
10+
print(
11+
'English: ${TextCounter.count("Hello world", languageCode: "en")}'); // 2
1012

1113
const mixed = "Hello 你好 مرحبا こんにちは";
1214
print('Mixed Text "$mixed": ${TextCounter.count(mixed)}'); // 9

lib/text_counter.dart

Lines changed: 47 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,37 @@
22
class TextCounter {
33
// Languages that are counted by characters (e.g., Chinese, Japanese, Korean, Thai)
44
static final Set<String> _characterBasedLanguages = {
5-
'zh', 'yue', 'ja', 'ko', 'th', 'hi', 'bn', 'ta', 'te', 'kn',
6-
'ml', 'si', 'km', 'my', 'lo', 'tl', 'jw', 'su', 'bo', 'dz'
5+
'zh',
6+
'yue',
7+
'ja',
8+
'ko',
9+
'th',
10+
'hi',
11+
'bn',
12+
'ta',
13+
'te',
14+
'kn',
15+
'ml',
16+
'si',
17+
'km',
18+
'my',
19+
'lo',
20+
'tl',
21+
'jw',
22+
'su',
23+
'bo',
24+
'dz'
725
};
826

927
// RTL languages requiring special tokenization (Arabic/Hebrew family)
1028
static final Set<String> _rtlLanguages = {
11-
'ar', 'he', 'fa', 'ur', 'ps', 'ug', 'sd'
29+
'ar',
30+
'he',
31+
'fa',
32+
'ur',
33+
'ps',
34+
'ug',
35+
'sd'
1236
};
1337

1438
/// Main counting method
@@ -40,41 +64,41 @@ class TextCounter {
4064
static int _countRtlWords(String text) {
4165
// Remove all punctuation (keep Arabic and Hebrew characters)
4266
final cleaned = text.replaceAllMapped(
43-
RegExp(r'[^\u0600-\u06FF\u0590-\u05FF\s]'),
44-
(match) => ''
45-
);
46-
67+
RegExp(r'[^\u0600-\u06FF\u0590-\u05FF\s]'), (match) => '');
68+
4769
if (cleaned.trim().isEmpty) return 0;
48-
70+
4971
// Split by whitespace, Arabic tatdeel (ـ), and Hebrew maqaf (־)
50-
return cleaned.split(RegExp(r'[\s\u0640\u05BE]+'))
51-
.where((word) => word.isNotEmpty)
52-
.length;
72+
return cleaned
73+
.split(RegExp(r'[\s\u0640\u05BE]+'))
74+
.where((word) => word.isNotEmpty)
75+
.length;
5376
}
5477

5578
/// Count mixed-language text (automatically identifies different parts)
5679
static int _countMixed(String text) {
5780
// Match CJK characters (Chinese, Japanese, Korean)
58-
final cjkChars = RegExp(
59-
r'[\u4e00-\u9fff\u3040-\u309f\u30a0-\u30ff\uac00-\ud7af]'
60-
).allMatches(text).length;
81+
final cjkChars =
82+
RegExp(r'[\u4e00-\u9fff\u3040-\u309f\u30a0-\u30ff\uac00-\ud7af]')
83+
.allMatches(text)
84+
.length;
6185

6286
// Match RTL text (Arabic, Hebrew, etc.)
6387
final rtlText = text.replaceAllMapped(
64-
RegExp(r'[^\u0600-\u06FF\u0590-\u05FF\s]'),
65-
(match) => ''
66-
);
88+
RegExp(r'[^\u0600-\u06FF\u0590-\u05FF\s]'), (match) => '');
6789
final rtlWords = _countRtlWords(rtlText);
6890

6991
// Process remaining text (mainly Latin-based)
7092
final remainingText = text
71-
.replaceAll(RegExp(r'[\u4e00-\u9fff\u3040-\u309f\u30a0-\u30ff\uac00-\ud7af]'), ' ')
72-
.replaceAll(RegExp(r'[\u0600-\u06FF\u0590-\u05FF]'), ' ');
73-
93+
.replaceAll(
94+
RegExp(r'[\u4e00-\u9fff\u3040-\u309f\u30a0-\u30ff\uac00-\ud7af]'),
95+
' ')
96+
.replaceAll(RegExp(r'[\u0600-\u06FF\u0590-\u05FF]'), ' ');
97+
7498
final otherWords = remainingText.trim().isEmpty
75-
? 0
76-
: remainingText.trim().split(RegExp(r'\s+')).length;
99+
? 0
100+
: remainingText.trim().split(RegExp(r'\s+')).length;
77101

78102
return cjkChars + rtlWords + otherWords;
79103
}
80-
}
104+
}

pubspec.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
name: text_counter
22
description: A lightweight Dart utility for counting characters and words in multiple languages including CJK, RTL, and mixed texts.
3-
version: 0.1.1
3+
version: 0.1.2
44
repository: https://github.com/hexwarrior6/text_counter
55
environment:
66
sdk: '>=3.0.0 <4.0.0'

test/text_counter_test.dart

Lines changed: 20 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -26,29 +26,38 @@ void main() {
2626
// --- 英文统计 ---
2727
test('English (en)', () {
2828
expect(TextCounter.count("Hello world", languageCode: "en"), equals(2));
29-
expect(TextCounter.count("This is a test.", languageCode: "en"), equals(4));
30-
expect(TextCounter.count("One multiple spaces", languageCode: "en"), equals(3));
29+
expect(
30+
TextCounter.count("This is a test.", languageCode: "en"), equals(4));
31+
expect(TextCounter.count("One multiple spaces", languageCode: "en"),
32+
equals(3));
3133
});
3234

3335
// --- 阿拉伯语统计 ---
3436
test('Arabic (ar)', () {
3537
expect(TextCounter.count("مرحبا بالعالم", languageCode: "ar"), equals(2));
36-
expect(TextCounter.count("كيف حالك اليوم؟", languageCode: "ar"), equals(3));
37-
expect(TextCounter.count("السلام عليكم ورحمة الله", languageCode: "ar"), equals(4));
38+
expect(
39+
TextCounter.count("كيف حالك اليوم؟", languageCode: "ar"), equals(3));
40+
expect(TextCounter.count("السلام عليكم ورحمة الله", languageCode: "ar"),
41+
equals(4));
3842
});
3943

4044
// --- 希伯来语统计 ---
4145
test('Hebrew (he)', () {
4246
expect(TextCounter.count("שלום עולם", languageCode: "he"), equals(2));
43-
expect(TextCounter.count("מה שלומך היום?", languageCode: "he"), equals(3));
47+
expect(
48+
TextCounter.count("מה שלומך היום?", languageCode: "he"), equals(3));
4449
expect(TextCounter.count("תודה רבה לך", languageCode: "he"), equals(3));
4550
});
4651

4752
// --- 自动识别混合文本 ---
4853
test('Mixed text detection', () {
49-
expect(TextCounter.count("Hello 你好 مرحبا こんにちは"), equals(9)); // 1 + 2 + 1 + 1
50-
expect(TextCounter.count("The quick brown fox jumps over the lazy dog. 你好吗"), equals(12));
51-
expect(TextCounter.count("مرحبا Hello كيف الحال?こんにちは"), equals(10)); // ar + en + ar + ja
54+
expect(TextCounter.count("Hello 你好 مرحبا こんにちは"),
55+
equals(9)); // 1 + 2 + 1 + 1
56+
expect(
57+
TextCounter.count("The quick brown fox jumps over the lazy dog. 你好吗"),
58+
equals(12));
59+
expect(TextCounter.count("مرحبا Hello كيف الحال?こんにちは"),
60+
equals(10)); // ar + en + ar + ja
5261
});
5362

5463
// --- 边界测试 ---
@@ -59,7 +68,8 @@ void main() {
5968
expect(TextCounter.count(".,!@#\$% ^&*()"), equals(2)); // 标点加空格
6069
expect(TextCounter.count(" Hello world "), equals(2)); // 前后空格
6170
expect(TextCounter.count("你好,, 世界!!"), equals(6)); // 中文夹杂标点
62-
expect(TextCounter.count("שלום־עולם", languageCode: "he"), equals(2)); // 希伯来连接符
71+
expect(TextCounter.count("שלום־עולם", languageCode: "he"),
72+
equals(2)); // 希伯来连接符
6373
});
6474
});
65-
}
75+
}

0 commit comments

Comments
 (0)