|
| 1 | +# Directory |
| 2 | + |
| 3 | +Use `DirectoryLoader` to load `Document`s from multiple files in a directory with extensive customization options. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The `DirectoryLoader` is a versatile document loader that allows you to load documents from a directory with powerful filtering, sampling, and customization capabilities. It supports multiple file types out of the box and provides extensive configuration options. |
| 8 | + |
| 9 | +## Basic Usage |
| 10 | + |
| 11 | +```dart |
| 12 | +// Load all text files from a directory recursively |
| 13 | +final loader = DirectoryLoader( |
| 14 | + '/path/to/documents', |
| 15 | + glob: '*.txt', |
| 16 | + recursive: true, |
| 17 | +); |
| 18 | +final documents = await loader.load(); |
| 19 | +``` |
| 20 | + |
| 21 | +## Constructor Parameters |
| 22 | + |
| 23 | +### `filePath` (required) |
| 24 | +- Type: `String` |
| 25 | +- Description: The path to the directory containing documents to load. |
| 26 | + |
| 27 | +### `glob` |
| 28 | +- Type: `String` |
| 29 | +- Default: `'*'` (all files) |
| 30 | +- Description: A glob pattern to match files. Only files matching this pattern will be loaded. |
| 31 | +- Examples: |
| 32 | + ```dart |
| 33 | + // Load only JSON and text files |
| 34 | + DirectoryLoader('/path', glob: '*.{txt,json}') |
| 35 | + |
| 36 | + // Load files starting with 'report' |
| 37 | + DirectoryLoader('/path', glob: 'report*') |
| 38 | + ``` |
| 39 | + |
| 40 | +### `recursive` |
| 41 | +- Type: `bool` |
| 42 | +- Default: `true` |
| 43 | +- Description: Whether to search recursively in subdirectories. |
| 44 | + |
| 45 | +### `exclude` |
| 46 | +- Type: `List<String>` |
| 47 | +- Default: `[]` |
| 48 | +- Description: Glob patterns to exclude from loading. |
| 49 | +- Example: |
| 50 | + ```dart |
| 51 | + DirectoryLoader( |
| 52 | + '/path', |
| 53 | + exclude: ['*.tmp', 'draft*'], |
| 54 | + ) |
| 55 | + ``` |
| 56 | + |
| 57 | +### `loaderMap` |
| 58 | +- Type: `Map<String, BaseDocumentLoader Function(String)>` |
| 59 | +- Default: `DirectoryLoader.defaultLoaderMap` |
| 60 | +- Description: A map to customize loaders for different file types. |
| 61 | +- Default Supported Types: |
| 62 | + - `.txt`: TextLoader |
| 63 | + - `.json`: JsonLoader (with root schema) |
| 64 | + - `.csv` and `.tsv`: CsvLoader |
| 65 | +- Example of extending loaders: |
| 66 | + ```dart |
| 67 | + final loader = DirectoryLoader( |
| 68 | + '/path/to/docs', |
| 69 | + loaderMap: { |
| 70 | + // Add a custom loader for XML files |
| 71 | + '.xml': (path) => CustomXmlLoader(path), |
| 72 | + |
| 73 | + // Combine with default loaders |
| 74 | + ...DirectoryLoader.defaultLoaderMap, |
| 75 | + }, |
| 76 | + ); |
| 77 | + ``` |
| 78 | + |
| 79 | +### `loadHidden` |
| 80 | +- Type: `bool` |
| 81 | +- Default: `false` |
| 82 | +- Description: Whether to load hidden files. |
| 83 | +- Platform Specific: |
| 84 | + - On Unix-like systems (Linux, macOS): Identifies hidden files by names starting with '.' |
| 85 | + - On Windows: May not work as expected due to different hidden file conventions |
| 86 | + - Recommended to use platform-specific checks for comprehensive hidden file handling across different operating systems |
| 87 | +- Example of platform-aware hidden file checking: |
| 88 | + ```dart |
| 89 | + import 'dart:io' show Platform; |
| 90 | +
|
| 91 | + bool isHiddenFile(File file) { |
| 92 | + if (Platform.isWindows) { |
| 93 | + // Windows-specific hidden file check |
| 94 | + return (File(file.path).statSync().modeString().startsWith('h')); |
| 95 | + } else { |
| 96 | + // Unix-like systems |
| 97 | + return path.basename(file.path).startsWith('.'); |
| 98 | + } |
| 99 | + } |
| 100 | + ``` |
| 101 | + |
| 102 | +### `sampleSize` |
| 103 | +- Type: `int` |
| 104 | +- Default: `0` (load all files) |
| 105 | +- Description: Maximum number of files to load. |
| 106 | +- Example: |
| 107 | + ```dart |
| 108 | + // Load only 10 files |
| 109 | + DirectoryLoader('/path', sampleSize: 10) |
| 110 | + ``` |
| 111 | + |
| 112 | +### `randomizeSample` |
| 113 | +- Type: `bool` |
| 114 | +- Default: `false` |
| 115 | +- Description: Whether to randomize the sample of files. |
| 116 | + |
| 117 | +### `sampleSeed` |
| 118 | +- Type: `int?` |
| 119 | +- Default: `null` |
| 120 | +- Description: Seed for random sampling to ensure reproducibility. |
| 121 | +- Example: |
| 122 | + ```dart |
| 123 | + // Consistent random sampling |
| 124 | + DirectoryLoader( |
| 125 | + '/path', |
| 126 | + sampleSize: 10, |
| 127 | + randomizeSample: true, |
| 128 | + sampleSeed: 42, |
| 129 | + ) |
| 130 | + ``` |
| 131 | + |
| 132 | +### `metadataBuilder` |
| 133 | +- Type: `Map<String, dynamic> Function(File file, Map<String, dynamic> defaultMetadata)?` |
| 134 | +- Default: `null` |
| 135 | +- Description: A custom function to build metadata for each document. |
| 136 | +- Example: |
| 137 | + ```dart |
| 138 | + final loader = DirectoryLoader( |
| 139 | + '/path', |
| 140 | + metadataBuilder: (file, defaultMetadata) { |
| 141 | + return { |
| 142 | + ...defaultMetadata, |
| 143 | + 'custom_tag': 'important_document', |
| 144 | + 'processing_date': DateTime.now().toIso8601String(), |
| 145 | + }; |
| 146 | + }, |
| 147 | + ); |
| 148 | + ``` |
| 149 | + |
| 150 | +## Default Metadata |
| 151 | + |
| 152 | +By default, each document receives metadata including: |
| 153 | +- `source`: Full file path |
| 154 | +- `name`: Filename |
| 155 | +- `extension`: File extension |
| 156 | +- `size`: File size in bytes |
| 157 | +- `lastModified`: Last modification timestamp (milliseconds since epoch) |
| 158 | + |
| 159 | +## Lazy Loading |
| 160 | + |
| 161 | +The `DirectoryLoader` supports lazy loading through the `lazyLoad()` method, which returns a `Stream<Document>`. This is useful for processing large numbers of documents without loading everything into memory at once. |
| 162 | + |
| 163 | +```dart |
| 164 | +final loader = DirectoryLoader('/path/to/documents'); |
| 165 | +await for (final document in loader.lazyLoad()) { |
| 166 | + // Process each document as it's loaded |
| 167 | + print(document.pageContent); |
| 168 | +} |
| 169 | +``` |
| 170 | + |
| 171 | +## Error Handling |
| 172 | + |
| 173 | +- Throws an `ArgumentError` if the blob pattern is empty |
| 174 | + |
| 175 | +## Advanced Example |
| 176 | + |
| 177 | +```dart |
| 178 | +final loader = DirectoryLoader( |
| 179 | + '/path/to/documents', |
| 180 | + glob: '*.{txt,json,csv}', // Multiple file types |
| 181 | + recursive: true, // Search subdirectories |
| 182 | + exclude: ['temp*', '*.backup'], // Exclude temp and backup files |
| 183 | + loadHidden: false, // Ignore hidden files |
| 184 | + sampleSize: 50, // Load only 50 files |
| 185 | + randomizeSample: true, // Randomize the sample |
| 186 | + sampleSeed: 123, // Reproducible random sampling |
| 187 | + loaderMap: { |
| 188 | + // Custom loader for a specific file type |
| 189 | + '.json': (path) => CustomJsonLoader(path), |
| 190 | + }, |
| 191 | + metadataBuilder: (file, defaultMetadata) { |
| 192 | + // Add custom metadata |
| 193 | + return { |
| 194 | + ...defaultMetadata, |
| 195 | + 'category': _categorizeFile(file), |
| 196 | + }; |
| 197 | + }, |
| 198 | +); |
| 199 | +
|
| 200 | +final documents = await loader.load(); |
| 201 | +``` |
| 202 | + |
| 203 | +## Best Practices |
| 204 | + |
| 205 | +- Use `lazyLoad()` for large directories to manage memory efficiently |
| 206 | +- Provide specific glob patterns to reduce unnecessary file processing |
| 207 | +- Customize loaders for specialized file types |
| 208 | +- Use `metadataBuilder` to add context-specific information to documents |
| 209 | + |
| 210 | +## Limitations |
| 211 | + |
| 212 | +- Relies on file system access |
| 213 | +- Performance may vary with large directories |
0 commit comments