Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -1,2 +1,7 @@
testdata/* -text
maint/manifest-* -text
maint/ucptestdata -text
*.sh text eol=lf
pcre2-config.in text eol=lf
RunTest text eol=lf
RunGrepTest text eol=lf
54 changes: 54 additions & 0 deletions maint/FetchUcd.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
#! /bin/sh

# Small helper script to fetch the Unicode Character Database files

VER=17.0.0

cd "$(dirname "$0")"
pwd

rm -rf Unicode.tables/
mkdir Unicode.tables

fetch_file()
{
url="$1"
i="$2"

echo "=== Downloading $i ==="
# Download each file with curl and place into the Unicode.tables folder
# Reject the download if there is an HTTP error
if ! curl --fail -o Unicode.tables/$i -L "$url"; then
echo "Error downloading $i"
rm -f Unicode.tables/$i
fi
}

for i in BidiMirroring.txt \
CaseFolding.txt \
DerivedCoreProperties.txt \
PropertyAliases.txt \
PropertyValueAliases.txt \
PropList.txt \
ScriptExtensions.txt \
Scripts.txt \
UnicodeData.txt \
; do
fetch_file "https://www.unicode.org/Public/$VER/ucd/$i" "$i"
done

for i in DerivedBidiClass.txt \
DerivedGeneralCategory.txt \
; do
fetch_file "https://www.unicode.org/Public/$VER/ucd/extracted/$i" "$i"
done

for i in GraphemeBreakProperty.txt \
; do
fetch_file "https://www.unicode.org/Public/$VER/ucd/auxiliary/$i" "$i"
done

for i in emoji-data.txt \
; do
fetch_file "https://www.unicode.org/Public/$VER/ucd/emoji/$i" "$i"
done
2 changes: 1 addition & 1 deletion maint/GenerateCommon.py
Original file line number Diff line number Diff line change
Expand Up @@ -348,7 +348,7 @@ def open_output(default):
POSSIBILITY OF SUCH DAMAGE.
-----------------------------------------------------------------------------
*/
\n""")
\n\n""")
return file

# End of UcpCommon.py
3 changes: 3 additions & 0 deletions maint/GenerateUcd.py
Original file line number Diff line number Diff line change
Expand Up @@ -788,10 +788,13 @@ def write_bitsets(list, item_size):
just one of these tables is actually needed. When compiling the library, some
headers are needed. */


#ifndef PCRE2_PCRE2TEST
#include "pcre2_internal.h"
#endif /* PCRE2_PCRE2TEST */



/* The tables herein are needed only when UCP support is built, and in PCRE2
that happens automatically with UTF support. This module should not be
referenced otherwise, so it should not matter whether it is compiled or not.
Expand Down
13 changes: 9 additions & 4 deletions maint/README
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,10 @@ GenerateUcpTables.py
GenerateCommon.py and Unicode data files. The generated file contains tables
for looking up Unicode property names.

FetchUcd.sh
A shell script to download the UCD data from the Unicode website into
the Unicode.tables directory.

FilterCoverage.py
A small helper used by the RunCoverage script.

Expand Down Expand Up @@ -141,10 +145,11 @@ Updating to a new Unicode release
=================================

When there is a new release of Unicode, the files in Unicode.tables must be
refreshed from the Unicode web site. Once that is done, the four Python scripts
that generate files from the Unicode data can be run from within the "maint"
directory. Note that the format used for those files is not stable, and
therefore changes to the scripts might be needed to support new versions.
refreshed from the Unicode web site, which can be done with the script
FetchUcd.sh. Once that is done, the four Python scripts that generate files from
the Unicode data can be run from within the "maint" directory. Note that the
format used for those files is not stable, and therefore changes to the scripts
might be needed to support new versions.

Note: Previously, it was necessary to update lists of scripts and their
abbreviations by hand before running the Python scripts. This is no longer
Expand Down
8 changes: 4 additions & 4 deletions maint/Unicode.tables/BidiMirroring.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# BidiMirroring-16.0.0.txt
# Date: 2024-01-30
# © 2024 Unicode®, Inc.
# BidiMirroring-17.0.0.txt
# Date: 2025-08-01
# © 2025 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
#
Expand All @@ -16,7 +16,7 @@
# value, for which there is another Unicode character that typically has a glyph
# that is the mirror image of the original character's glyph.
#
# The repertoire covered by the file is Unicode 16.0.0.
# The repertoire covered by the file is Unicode 17.0.0.
#
# The file contains a list of lines with mappings from one code point
# to another one for character-based mirroring.
Expand Down
40 changes: 34 additions & 6 deletions maint/Unicode.tables/CaseFolding.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# CaseFolding-16.0.0.txt
# Date: 2024-04-30, 21:48:11 GMT
# © 2024 Unicode®, Inc.
# CaseFolding-17.0.0.txt
# Date: 2025-07-30, 23:54:36 GMT
# © 2025 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
#
Expand All @@ -18,15 +18,15 @@
# The data supports both implementations that require simple case foldings
# (where string lengths don't change), and implementations that allow full case folding
# (where string lengths may grow). Note that where they can be supported, the
# full case foldings are superior: for example, they allow "MASSE" and "Maße" to match.
# full case foldings are superior: for example, they allow "FUSS" and "Fuß" to match.
#
# All code points not listed in this file map to themselves.
#
# NOTE: case folding does not preserve normalization formats!
#
# For information on case folding, including how to have case folding
# preserve normalization formats, see Section 3.13 Default Case Algorithms in
# The Unicode Standard.
# preserve normalization formats, see the
# "Conformance" / "Default Case Algorithms" section of the core specification.
#
# ================================================================================
# Format
Expand Down Expand Up @@ -1243,7 +1243,10 @@ A7C7; C; A7C8; # LATIN CAPITAL LETTER D WITH SHORT STROKE OVERLAY
A7C9; C; A7CA; # LATIN CAPITAL LETTER S WITH SHORT STROKE OVERLAY
A7CB; C; 0264; # LATIN CAPITAL LETTER RAMS HORN
A7CC; C; A7CD; # LATIN CAPITAL LETTER S WITH DIAGONAL STROKE
A7CE; C; A7CF; # LATIN CAPITAL LETTER PHARYNGEAL VOICED FRICATIVE
A7D0; C; A7D1; # LATIN CAPITAL LETTER CLOSED INSULAR G
A7D2; C; A7D3; # LATIN CAPITAL LETTER DOUBLE THORN
A7D4; C; A7D5; # LATIN CAPITAL LETTER DOUBLE WYNN
A7D6; C; A7D7; # LATIN CAPITAL LETTER MIDDLE SCOTS S
A7D8; C; A7D9; # LATIN CAPITAL LETTER SIGMOID S
A7DA; C; A7DB; # LATIN CAPITAL LETTER LAMBDA
Expand Down Expand Up @@ -1616,6 +1619,31 @@ FF3A; C; FF5A; # FULLWIDTH LATIN CAPITAL LETTER Z
16E5D; C; 16E7D; # MEDEFAIDRIN CAPITAL LETTER O
16E5E; C; 16E7E; # MEDEFAIDRIN CAPITAL LETTER AI
16E5F; C; 16E7F; # MEDEFAIDRIN CAPITAL LETTER Y
16EA0; C; 16EBB; # BERIA ERFE CAPITAL LETTER ARKAB
16EA1; C; 16EBC; # BERIA ERFE CAPITAL LETTER BASIGNA
16EA2; C; 16EBD; # BERIA ERFE CAPITAL LETTER DARBAI
16EA3; C; 16EBE; # BERIA ERFE CAPITAL LETTER EH
16EA4; C; 16EBF; # BERIA ERFE CAPITAL LETTER FITKO
16EA5; C; 16EC0; # BERIA ERFE CAPITAL LETTER GOWAY
16EA6; C; 16EC1; # BERIA ERFE CAPITAL LETTER HIRDEABO
16EA7; C; 16EC2; # BERIA ERFE CAPITAL LETTER I
16EA8; C; 16EC3; # BERIA ERFE CAPITAL LETTER DJAI
16EA9; C; 16EC4; # BERIA ERFE CAPITAL LETTER KOBO
16EAA; C; 16EC5; # BERIA ERFE CAPITAL LETTER LAKKO
16EAB; C; 16EC6; # BERIA ERFE CAPITAL LETTER MERI
16EAC; C; 16EC7; # BERIA ERFE CAPITAL LETTER NINI
16EAD; C; 16EC8; # BERIA ERFE CAPITAL LETTER GNA
16EAE; C; 16EC9; # BERIA ERFE CAPITAL LETTER NGAY
16EAF; C; 16ECA; # BERIA ERFE CAPITAL LETTER OI
16EB0; C; 16ECB; # BERIA ERFE CAPITAL LETTER PI
16EB1; C; 16ECC; # BERIA ERFE CAPITAL LETTER ERIGO
16EB2; C; 16ECD; # BERIA ERFE CAPITAL LETTER ERIGO TAMURA
16EB3; C; 16ECE; # BERIA ERFE CAPITAL LETTER SERI
16EB4; C; 16ECF; # BERIA ERFE CAPITAL LETTER SHEP
16EB5; C; 16ED0; # BERIA ERFE CAPITAL LETTER TATASOUE
16EB6; C; 16ED1; # BERIA ERFE CAPITAL LETTER UI
16EB7; C; 16ED2; # BERIA ERFE CAPITAL LETTER WASSE
16EB8; C; 16ED3; # BERIA ERFE CAPITAL LETTER AY
1E900; C; 1E922; # ADLAM CAPITAL LETTER ALIF
1E901; C; 1E923; # ADLAM CAPITAL LETTER DAALI
1E902; C; 1E924; # ADLAM CAPITAL LETTER LAAM
Expand Down
Loading