Skip to content

Commit cd31d82

Browse files
AntonEliatrakolchfa-awsnatebower
authored
Add Classic token filter docs (opensearch-project#7918)
* adding classic token filter docs opensearch-project#7876 Signed-off-by: AntonEliatra <[email protected]> * Updating details as per comments Signed-off-by: AntonEliatra <[email protected]> * Update classic.md Signed-off-by: AntonEliatra <[email protected]> * Update classic.md Signed-off-by: AntonEliatra <[email protected]> * Update classic.md Signed-off-by: AntonEliatra <[email protected]> * Update _analyzers/token-filters/classic.md Signed-off-by: kolchfa-aws <[email protected]> * Apply suggestions from code review Co-authored-by: kolchfa-aws <[email protected]> Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: AntonEliatra <[email protected]> --------- Signed-off-by: AntonEliatra <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> Co-authored-by: kolchfa-aws <[email protected]> Co-authored-by: Nathan Bower <[email protected]>
1 parent 5f53f5b commit cd31d82

File tree

2 files changed

+94
-1
lines changed

2 files changed

+94
-1
lines changed

_analyzers/token-filters/classic.md

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
---
2+
layout: default
3+
title: Classic
4+
parent: Token filters
5+
nav_order: 50
6+
---
7+
8+
# Classic token filter
9+
10+
The primary function of the classic token filter is to work alongside the classic tokenizer. It processes tokens by applying the following common transformations, which aid in text analysis and search:
11+
- Removal of possessive endings such as *'s*. For example, *John's* becomes *John*.
12+
- Removal of periods from acronyms. For example, *D.A.R.P.A.* becomes *DARPA*.
13+
14+
15+
## Example
16+
17+
The following example request creates a new index named `custom_classic_filter` and configures an analyzer with the `classic` filter:
18+
19+
```json
20+
PUT /custom_classic_filter
21+
{
22+
"settings": {
23+
"analysis": {
24+
"analyzer": {
25+
"custom_classic": {
26+
"type": "custom",
27+
"tokenizer": "classic",
28+
"filter": ["classic"]
29+
}
30+
}
31+
}
32+
}
33+
}
34+
```
35+
{% include copy-curl.html %}
36+
37+
## Generated tokens
38+
39+
Use the following request to examine the tokens generated using the analyzer:
40+
41+
```json
42+
POST /custom_classic_filter/_analyze
43+
{
44+
"analyzer": "custom_classic",
45+
"text": "John's co-operate was excellent."
46+
}
47+
```
48+
{% include copy-curl.html %}
49+
50+
The response contains the generated tokens:
51+
52+
```json
53+
{
54+
"tokens": [
55+
{
56+
"token": "John",
57+
"start_offset": 0,
58+
"end_offset": 6,
59+
"type": "<APOSTROPHE>",
60+
"position": 0
61+
},
62+
{
63+
"token": "co",
64+
"start_offset": 7,
65+
"end_offset": 9,
66+
"type": "<ALPHANUM>",
67+
"position": 1
68+
},
69+
{
70+
"token": "operate",
71+
"start_offset": 10,
72+
"end_offset": 17,
73+
"type": "<ALPHANUM>",
74+
"position": 2
75+
},
76+
{
77+
"token": "was",
78+
"start_offset": 18,
79+
"end_offset": 21,
80+
"type": "<ALPHANUM>",
81+
"position": 3
82+
},
83+
{
84+
"token": "excellent",
85+
"start_offset": 22,
86+
"end_offset": 31,
87+
"type": "<ALPHANUM>",
88+
"position": 4
89+
}
90+
]
91+
}
92+
```
93+

_analyzers/token-filters/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ Token filter | Underlying Lucene token filter| Description
1919
[`asciifolding`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/asciifolding/) | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters.
2020
`cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens.
2121
[`cjk_width`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/cjk-width/) | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules: <br> - Folds full-width ASCII character variants into their equivalent basic Latin characters. <br> - Folds half-width katakana character variants into their equivalent kana characters.
22-
`classic` | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms.
22+
[`classic`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/classic) | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms.
2323
`common_grams` | [CommonGramsFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html) | Generates bigrams for a list of frequently occurring terms. The output contains both single terms and bigrams.
2424
`conditional` | [ConditionalTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ConditionalTokenFilter.html) | Applies an ordered list of token filters to tokens that match the conditions provided in a script.
2525
`decimal_digit` | [DecimalDigitFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/DecimalDigitFilter.html) | Converts all digits in the Unicode decimal number general category to basic Latin digits (0--9).

0 commit comments

Comments
 (0)