From 4e7fa6b8a8b2989b4ad7dbfbe1a25ab8a3d69945 Mon Sep 17 00:00:00 2001 From: Airton Lastori Date: Tue, 17 Jun 2025 23:34:30 -0400 Subject: [PATCH 1/8] =?UTF-8?q?Clarify=20file-type=20rules=20and=20wildcar?= =?UTF-8?q?d=20limits:=20*=20Format=20detection=20based=20on=20file=20exte?= =?UTF-8?q?nsions=20(.csv,=20.sql,=20.parquet)=20for=20IMPORT=20INTO=20job?= =?UTF-8?q?s.=20*=20Wildcards=20are=20accepted=20in=20file=20paths,=20but?= =?UTF-8?q?=20requires=20one=20by=20directory=20level=20(non-recursive).?= =?UTF-8?q?=20*=20Add=20explicit=20note=20that=20each=20IMPORT=20INTO=20jo?= =?UTF-8?q?b=20must=20target=20one=20file=20format;=20wildcards=20that=20m?= =?UTF-8?q?atch=20mixed=20extensions=20now=20fail=20the=20pre-check=20and?= =?UTF-8?q?=20require=20separate=20jobs.=20*=20Explain=20that=20compressio?= =?UTF-8?q?n=20suffixes=20(.gz,=20.zstd,=20.snappy,=20=E2=80=A6)=20are=20i?= =?UTF-8?q?gnored=20when=20TiDB=20infers=20CSV/SQL/PARQUET.=20*=20Minor=20?= =?UTF-8?q?wording=20clean-ups=20for=20consistency;=20no=20change=20to=20s?= =?UTF-8?q?yntax=20or=20behavior.?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- sql-statements/sql-statement-import-into.md | 33 ++++++++++++++++++--- 1 file changed, 29 insertions(+), 4 deletions(-) diff --git a/sql-statements/sql-statement-import-into.md b/sql-statements/sql-statement-import-into.md index 3e47231d2910f..86c4d003398ee 100644 --- a/sql-statements/sql-statement-import-into.md +++ b/sql-statements/sql-statement-import-into.md @@ -110,17 +110,17 @@ In the left side of the `SET` expression, you can only reference a column name t ### fileLocation -It specifies the storage location of the data file, which can be an Amazon S3 or GCS URI path, or a TiDB local file path. +It specifies where your data files are and which files to import. You may point to a single file or use wildcards to match many files. - Amazon S3 or GCS URI path: for URI configuration details, see [URI Formats of External Storage Services](/external-storage-uri.md). -- TiDB local file path: it must be an absolute path, and the file extension must be `.csv`, `.sql`, or `.parquet`. Make sure that the files corresponding to this path are stored on the TiDB node connected by the current user, and the user has the `FILE` privilege. +- TiDB local file path: it must be an absolute path. It is recommended to use a recognized extension (`.csv`, `.sql`, `.parquet`). If the file has no extension, TiDB treats it as CSV. Ensure the specified files exist on the TiDB node where your session is connected, and confirm you have the required `FILE` privilege. > **Note:** > > If [SEM](/system-variables.md#tidb_enable_enhanced_security) is enabled in the target cluster, the `fileLocation` cannot be specified as a local file path. -In the `fileLocation` parameter, you can specify a single file, or use the `*` and `[]` wildcards to match multiple files for import. Note that the wildcard can only be used in the file name, because it does not match directories or recursively match files in subdirectories. Taking files stored on Amazon S3 as examples, you can configure the parameter as follows: +In the `fileLocation` parameter, you can specify a single file or use wildcards (`*` and `[` `]`) to match multiple files. Wildcards can be used to match sub-path segments (e.g., a directory level) and filenames. For example, if your files are stored on Amazon S3, you can configure the parameter like this: - Import a single file: `s3:///path/to/data/foo.csv` - Import all files in a specified path: `s3:///path/to/data/*` @@ -128,10 +128,15 @@ In the `fileLocation` parameter, you can specify a single file, or use the `*` a - Import all files with the `foo` prefix in a specified path: `s3:///path/to/data/foo*` - Import all files with the `foo` prefix and the `.csv` suffix in a specified path: `s3:///path/to/data/foo*.csv` - Import `1.csv` and `2.csv` in a specified path: `s3:///path/to/data/[12].csv` +- Import `foo.csv` files from all immediate sub-paths: `s3:///path/to/*/foo.csv` (add another `*/` for each extra directory level, for example `path/to/*/*/foo.csv`) + +> **Note:** +> +> Use one format per import job. If a wildcard matches files with different extensions (for example, `.csv` and `.sql` in the same pattern), the pre-check fails. Import each format with its own `IMPORT INTO` statement. ### Format -The `IMPORT INTO` statement supports three data file formats: `CSV`, `SQL`, and `PARQUET`. If not specified, the default format is `CSV`. +The `IMPORT INTO` statement supports three data file formats: `CSV`, `SQL`, and `PARQUET`. If the `FORMAT` clause is omitted, TiDB automatically determines the format based on the file’s extension (`.csv`, `.sql`, `.parquet`). Compressed files are supported, and the compression suffix (`.gz`, `.gzip`, `.zstd`, `.zst`, `.snappy`) is ignored when detecting the file format. If the file does not have an extension, TiDB assumes that the file format is `CSV`. ### WithOptions @@ -183,6 +188,7 @@ For TiDB Self-Managed, `IMPORT INTO ... FROM FILE` supports importing data from > > - The Snappy compressed file must be in the [official Snappy format](https://github.com/google/snappy). Other variants of Snappy compression are not supported. > - Because TiDB Lightning cannot concurrently decompress a single large compressed file, the size of the compressed file affects the import speed. It is recommended that a source file is no greater than 256 MiB after decompression. +> - When `FORMAT` is omitted, TiDB first removes one compression suffix from the file name, then inspects the remaining extension to choose `CSV`, or `SQL`. ### Global Sort @@ -288,6 +294,25 @@ If you only need to import `file-01.csv` and `file-03.csv` into the target table IMPORT INTO t FROM '/path/to/file-0[13].csv'; ``` +#### Import data from a nested directory structure + +You can use wildcards to import data from a common directory structure, such as one organized by date. For example, assume your sales data is stored in S3 and organized by year and quarter: + +``` +/path/to/sales-data/2023/q1/data.csv +/path/to/sales-data/2023/q2/data.csv +/path/to/sales-data/2023/q3/data.csv +/path/to/sales-data/2023/q4/data.csv +/path/to/sales-data/2024/q1/data.csv +... +``` + +To import all `data.csv` files from all quarters across all years, you can use a wildcard for each directory level: + +```sql +IMPORT INTO sales FROM '/path/to/sales-data/*/*/data.csv'; +``` + #### Import data files from Amazon S3 or GCS - Import data files from Amazon S3: From 195b356f6c4891cce1b6e4fc074af201cef41bc1 Mon Sep 17 00:00:00 2001 From: Airton Lastori Date: Wed, 18 Jun 2025 00:11:37 -0400 Subject: [PATCH 2/8] Apply suggestions from code review Applying suggestions from @gemini-code-assist Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- sql-statements/sql-statement-import-into.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/sql-statements/sql-statement-import-into.md b/sql-statements/sql-statement-import-into.md index 86c4d003398ee..1401642bded4e 100644 --- a/sql-statements/sql-statement-import-into.md +++ b/sql-statements/sql-statement-import-into.md @@ -110,17 +110,17 @@ In the left side of the `SET` expression, you can only reference a column name t ### fileLocation -It specifies where your data files are and which files to import. You may point to a single file or use wildcards to match many files. +It specifies where your data files are and which files to import. You can point to a single file or use wildcards to match many files. - Amazon S3 or GCS URI path: for URI configuration details, see [URI Formats of External Storage Services](/external-storage-uri.md). -- TiDB local file path: it must be an absolute path. It is recommended to use a recognized extension (`.csv`, `.sql`, `.parquet`). If the file has no extension, TiDB treats it as CSV. Ensure the specified files exist on the TiDB node where your session is connected, and confirm you have the required `FILE` privilege. +- TiDB local file path: The path must be absolute. Ensure the specified path and files exist on the TiDB node where your session is connected, and confirm you have the required `FILE` privilege. > **Note:** > > If [SEM](/system-variables.md#tidb_enable_enhanced_security) is enabled in the target cluster, the `fileLocation` cannot be specified as a local file path. -In the `fileLocation` parameter, you can specify a single file or use wildcards (`*` and `[` `]`) to match multiple files. Wildcards can be used to match sub-path segments (e.g., a directory level) and filenames. For example, if your files are stored on Amazon S3, you can configure the parameter like this: +In the `fileLocation` parameter, you can specify a single file or use wildcards (`*` and `[]`) to match multiple files. Wildcards can be used to match sub-path segments (e.g., a directory level) and filenames. For example, if your files are stored on Amazon S3, you can configure the parameter like this: - Import a single file: `s3:///path/to/data/foo.csv` - Import all files in a specified path: `s3:///path/to/data/*` @@ -136,7 +136,7 @@ In the `fileLocation` parameter, you can specify a single file or use wildcards ### Format -The `IMPORT INTO` statement supports three data file formats: `CSV`, `SQL`, and `PARQUET`. If the `FORMAT` clause is omitted, TiDB automatically determines the format based on the file’s extension (`.csv`, `.sql`, `.parquet`). Compressed files are supported, and the compression suffix (`.gz`, `.gzip`, `.zstd`, `.zst`, `.snappy`) is ignored when detecting the file format. If the file does not have an extension, TiDB assumes that the file format is `CSV`. +The `IMPORT INTO` statement supports three data file formats: `CSV`, `SQL`, and `PARQUET`. If the `FORMAT` clause is omitted, TiDB automatically determines the format based on the file's extension (`.csv`, `.sql`, `.parquet`). Compressed files are supported, and the compression suffix (`.gz`, `.gzip`, `.zstd`, `.zst`, `.snappy`) is ignored when detecting the file format. If the file does not have an extension, TiDB assumes that the file format is `CSV`. ### WithOptions @@ -188,7 +188,7 @@ For TiDB Self-Managed, `IMPORT INTO ... FROM FILE` supports importing data from > > - The Snappy compressed file must be in the [official Snappy format](https://github.com/google/snappy). Other variants of Snappy compression are not supported. > - Because TiDB Lightning cannot concurrently decompress a single large compressed file, the size of the compressed file affects the import speed. It is recommended that a source file is no greater than 256 MiB after decompression. -> - When `FORMAT` is omitted, TiDB first removes one compression suffix from the file name, then inspects the remaining extension to choose `CSV`, or `SQL`. +> - When `FORMAT` is omitted, TiDB first removes one compression suffix from the file name, then inspects the remaining extension to choose `CSV` or `SQL`. ### Global Sort From c3e5f2c36c775b199e422f6e7ec6a4c85b141563 Mon Sep 17 00:00:00 2001 From: Airton Lastori Date: Wed, 18 Jun 2025 00:21:11 -0400 Subject: [PATCH 3/8] Update sql-statements/sql-statement-import-into.md --- sql-statements/sql-statement-import-into.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql-statements/sql-statement-import-into.md b/sql-statements/sql-statement-import-into.md index 1401642bded4e..e04d91b24c6c5 100644 --- a/sql-statements/sql-statement-import-into.md +++ b/sql-statements/sql-statement-import-into.md @@ -112,7 +112,7 @@ In the left side of the `SET` expression, you can only reference a column name t It specifies where your data files are and which files to import. You can point to a single file or use wildcards to match many files. -- Amazon S3 or GCS URI path: for URI configuration details, see [URI Formats of External Storage Services](/external-storage-uri.md). +- Cloud storage (Amazon S3 or GCS): Provide the full object-storage URI, formatted as described in [URI Formats of External Storage Services](/external-storage-uri.md). - TiDB local file path: The path must be absolute. Ensure the specified path and files exist on the TiDB node where your session is connected, and confirm you have the required `FILE` privilege. From d165b7bd50f441dc1eacacf3b7f52c2a2964bb10 Mon Sep 17 00:00:00 2001 From: Airton Lastori Date: Mon, 23 Jun 2025 12:58:19 -0400 Subject: [PATCH 4/8] Reverting changes after clarification about wildcards and subdirs --- sql-statements/sql-statement-import-into.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql-statements/sql-statement-import-into.md b/sql-statements/sql-statement-import-into.md index e04d91b24c6c5..3ec1f1c81eb16 100644 --- a/sql-statements/sql-statement-import-into.md +++ b/sql-statements/sql-statement-import-into.md @@ -120,7 +120,7 @@ It specifies where your data files are and which files to import. You can point > > If [SEM](/system-variables.md#tidb_enable_enhanced_security) is enabled in the target cluster, the `fileLocation` cannot be specified as a local file path. -In the `fileLocation` parameter, you can specify a single file or use wildcards (`*` and `[]`) to match multiple files. Wildcards can be used to match sub-path segments (e.g., a directory level) and filenames. For example, if your files are stored on Amazon S3, you can configure the parameter like this: +In the `fileLocation` parameter, you can specify a single file, or use the `*` and `[]` wildcards to match multiple files for import. Note that the wildcard can only be used in the file name, because it does not match directories or recursively match files in subdirectories. Taking files stored on Amazon S3 as examples, you can configure the parameter as follows: - Import a single file: `s3:///path/to/data/foo.csv` - Import all files in a specified path: `s3:///path/to/data/*` From b591eb61b03b7323e900448fb41b5bb1baf4d0dc Mon Sep 17 00:00:00 2001 From: Airton Lastori Date: Mon, 23 Jun 2025 13:03:50 -0400 Subject: [PATCH 5/8] Revert changes after clarification about wildcards and subdirs --- sql-statements/sql-statement-import-into.md | 1 - 1 file changed, 1 deletion(-) diff --git a/sql-statements/sql-statement-import-into.md b/sql-statements/sql-statement-import-into.md index 3ec1f1c81eb16..1ccce72c3f1fc 100644 --- a/sql-statements/sql-statement-import-into.md +++ b/sql-statements/sql-statement-import-into.md @@ -128,7 +128,6 @@ In the `fileLocation` parameter, you can specify a single file, or use the `*` a - Import all files with the `foo` prefix in a specified path: `s3:///path/to/data/foo*` - Import all files with the `foo` prefix and the `.csv` suffix in a specified path: `s3:///path/to/data/foo*.csv` - Import `1.csv` and `2.csv` in a specified path: `s3:///path/to/data/[12].csv` -- Import `foo.csv` files from all immediate sub-paths: `s3:///path/to/*/foo.csv` (add another `*/` for each extra directory level, for example `path/to/*/*/foo.csv`) > **Note:** > From b0afab86ab7fed7c3fbff33c6e25a932d7a78abd Mon Sep 17 00:00:00 2001 From: Airton Lastori Date: Mon, 23 Jun 2025 13:05:38 -0400 Subject: [PATCH 6/8] Revert changes after clarification of wildcards and subdirs behavior --- sql-statements/sql-statement-import-into.md | 19 ------------------- 1 file changed, 19 deletions(-) diff --git a/sql-statements/sql-statement-import-into.md b/sql-statements/sql-statement-import-into.md index 1ccce72c3f1fc..8658418b7a3c1 100644 --- a/sql-statements/sql-statement-import-into.md +++ b/sql-statements/sql-statement-import-into.md @@ -293,25 +293,6 @@ If you only need to import `file-01.csv` and `file-03.csv` into the target table IMPORT INTO t FROM '/path/to/file-0[13].csv'; ``` -#### Import data from a nested directory structure - -You can use wildcards to import data from a common directory structure, such as one organized by date. For example, assume your sales data is stored in S3 and organized by year and quarter: - -``` -/path/to/sales-data/2023/q1/data.csv -/path/to/sales-data/2023/q2/data.csv -/path/to/sales-data/2023/q3/data.csv -/path/to/sales-data/2023/q4/data.csv -/path/to/sales-data/2024/q1/data.csv -... -``` - -To import all `data.csv` files from all quarters across all years, you can use a wildcard for each directory level: - -```sql -IMPORT INTO sales FROM '/path/to/sales-data/*/*/data.csv'; -``` - #### Import data files from Amazon S3 or GCS - Import data files from Amazon S3: From 535d7f7f784b03d771aa90bf91e7aef7b77f9a57 Mon Sep 17 00:00:00 2001 From: Airton Lastori Date: Mon, 23 Jun 2025 16:45:23 -0400 Subject: [PATCH 7/8] Added example using range pattern ([1-3].csv) --- sql-statements/sql-statement-import-into.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/sql-statements/sql-statement-import-into.md b/sql-statements/sql-statement-import-into.md index 8658418b7a3c1..d4563aec8e377 100644 --- a/sql-statements/sql-statement-import-into.md +++ b/sql-statements/sql-statement-import-into.md @@ -127,7 +127,8 @@ In the `fileLocation` parameter, you can specify a single file, or use the `*` a - Import all files with the `.csv` suffix in a specified path: `s3:///path/to/data/*.csv` - Import all files with the `foo` prefix in a specified path: `s3:///path/to/data/foo*` - Import all files with the `foo` prefix and the `.csv` suffix in a specified path: `s3:///path/to/data/foo*.csv` -- Import `1.csv` and `2.csv` in a specified path: `s3:///path/to/data/[12].csv` +- Import `1.csv` and `2.csv` in a specified path: `s3:///path/to/data/[12].csv`. This is useful for importing a specific, non-sequential set of files. +- Import `1.csv`, `2.csv`, and `3.csv` using a range: `s3:///path/to/data/[1-3].csv` > **Note:** > From 3e734203be6d1fd83fbb14d44cf43446e58d292e Mon Sep 17 00:00:00 2001 From: Airton Lastori Date: Mon, 23 Jun 2025 18:03:37 -0400 Subject: [PATCH 8/8] Adding the example with negation pattern --- sql-statements/sql-statement-import-into.md | 1 + 1 file changed, 1 insertion(+) diff --git a/sql-statements/sql-statement-import-into.md b/sql-statements/sql-statement-import-into.md index d4563aec8e377..f986e533bf71b 100644 --- a/sql-statements/sql-statement-import-into.md +++ b/sql-statements/sql-statement-import-into.md @@ -129,6 +129,7 @@ In the `fileLocation` parameter, you can specify a single file, or use the `*` a - Import all files with the `foo` prefix and the `.csv` suffix in a specified path: `s3:///path/to/data/foo*.csv` - Import `1.csv` and `2.csv` in a specified path: `s3:///path/to/data/[12].csv`. This is useful for importing a specific, non-sequential set of files. - Import `1.csv`, `2.csv`, and `3.csv` using a range: `s3:///path/to/data/[1-3].csv` +- Import files with a single character name, except `1.csv` or `2.csv` using `^` for negation: `s3:///path/to/data/[^12].csv` > **Note:** >