Skip to content

Commit 4730f2a

Browse files
feat(doc-loaders): Add support for DirectoryLoader (#620)
Co-authored-by: David Miguel <[email protected]>
1 parent 8283c00 commit 4730f2a

File tree

12 files changed

+861
-6
lines changed

12 files changed

+861
-6
lines changed

docs/_sidebar.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@
8080
- [Text](/modules/retrieval/document_loaders/how_to/text.md)
8181
- [JSON](/modules/retrieval/document_loaders/how_to/json.md)
8282
- [Web page](/modules/retrieval/document_loaders/how_to/web.md)
83+
- [Directory](/modules/retrieval/document_loaders/how_to/directory.md)
8384
- [Document transformers](/modules/retrieval/document_transformers/document_transformers.md)
8485
- Text splitters
8586
- [Split by character](/modules/retrieval/document_transformers/text_splitters/character_text_splitter.md)
Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
# Directory
2+
3+
Use `DirectoryLoader` to load `Document`s from multiple files in a directory with extensive customization options.
4+
5+
## Overview
6+
7+
The `DirectoryLoader` is a versatile document loader that allows you to load documents from a directory with powerful filtering, sampling, and customization capabilities. It supports multiple file types out of the box and provides extensive configuration options.
8+
9+
## Basic Usage
10+
11+
```dart
12+
// Load all text files from a directory recursively
13+
final loader = DirectoryLoader(
14+
'/path/to/documents',
15+
glob: '*.txt',
16+
recursive: true,
17+
);
18+
final documents = await loader.load();
19+
```
20+
21+
## Constructor Parameters
22+
23+
### `filePath` (required)
24+
- Type: `String`
25+
- Description: The path to the directory containing documents to load.
26+
27+
### `glob`
28+
- Type: `String`
29+
- Default: `'*'` (all files)
30+
- Description: A glob pattern to match files. Only files matching this pattern will be loaded.
31+
- Examples:
32+
```dart
33+
// Load only JSON and text files
34+
DirectoryLoader('/path', glob: '*.{txt,json}')
35+
36+
// Load files starting with 'report'
37+
DirectoryLoader('/path', glob: 'report*')
38+
```
39+
40+
### `recursive`
41+
- Type: `bool`
42+
- Default: `true`
43+
- Description: Whether to search recursively in subdirectories.
44+
45+
### `exclude`
46+
- Type: `List<String>`
47+
- Default: `[]`
48+
- Description: Glob patterns to exclude from loading.
49+
- Example:
50+
```dart
51+
DirectoryLoader(
52+
'/path',
53+
exclude: ['*.tmp', 'draft*'],
54+
)
55+
```
56+
57+
### `loaderMap`
58+
- Type: `Map<String, BaseDocumentLoader Function(String)>`
59+
- Default: `DirectoryLoader.defaultLoaderMap`
60+
- Description: A map to customize loaders for different file types.
61+
- Default Supported Types:
62+
- `.txt`: TextLoader
63+
- `.json`: JsonLoader (with root schema)
64+
- `.csv` and `.tsv`: CsvLoader
65+
- Example of extending loaders:
66+
```dart
67+
final loader = DirectoryLoader(
68+
'/path/to/docs',
69+
loaderMap: {
70+
// Add a custom loader for XML files
71+
'.xml': (path) => CustomXmlLoader(path),
72+
73+
// Combine with default loaders
74+
...DirectoryLoader.defaultLoaderMap,
75+
},
76+
);
77+
```
78+
79+
### `loadHidden`
80+
- Type: `bool`
81+
- Default: `false`
82+
- Description: Whether to load hidden files.
83+
- Platform Specific:
84+
- On Unix-like systems (Linux, macOS): Identifies hidden files by names starting with '.'
85+
- On Windows: May not work as expected due to different hidden file conventions
86+
- Recommended to use platform-specific checks for comprehensive hidden file handling across different operating systems
87+
- Example of platform-aware hidden file checking:
88+
```dart
89+
import 'dart:io' show Platform;
90+
91+
bool isHiddenFile(File file) {
92+
if (Platform.isWindows) {
93+
// Windows-specific hidden file check
94+
return (File(file.path).statSync().modeString().startsWith('h'));
95+
} else {
96+
// Unix-like systems
97+
return path.basename(file.path).startsWith('.');
98+
}
99+
}
100+
```
101+
102+
### `sampleSize`
103+
- Type: `int`
104+
- Default: `0` (load all files)
105+
- Description: Maximum number of files to load.
106+
- Example:
107+
```dart
108+
// Load only 10 files
109+
DirectoryLoader('/path', sampleSize: 10)
110+
```
111+
112+
### `randomizeSample`
113+
- Type: `bool`
114+
- Default: `false`
115+
- Description: Whether to randomize the sample of files.
116+
117+
### `sampleSeed`
118+
- Type: `int?`
119+
- Default: `null`
120+
- Description: Seed for random sampling to ensure reproducibility.
121+
- Example:
122+
```dart
123+
// Consistent random sampling
124+
DirectoryLoader(
125+
'/path',
126+
sampleSize: 10,
127+
randomizeSample: true,
128+
sampleSeed: 42,
129+
)
130+
```
131+
132+
### `metadataBuilder`
133+
- Type: `Map<String, dynamic> Function(File file, Map<String, dynamic> defaultMetadata)?`
134+
- Default: `null`
135+
- Description: A custom function to build metadata for each document.
136+
- Example:
137+
```dart
138+
final loader = DirectoryLoader(
139+
'/path',
140+
metadataBuilder: (file, defaultMetadata) {
141+
return {
142+
...defaultMetadata,
143+
'custom_tag': 'important_document',
144+
'processing_date': DateTime.now().toIso8601String(),
145+
};
146+
},
147+
);
148+
```
149+
150+
## Default Metadata
151+
152+
By default, each document receives metadata including:
153+
- `source`: Full file path
154+
- `name`: Filename
155+
- `extension`: File extension
156+
- `size`: File size in bytes
157+
- `lastModified`: Last modification timestamp (milliseconds since epoch)
158+
159+
## Lazy Loading
160+
161+
The `DirectoryLoader` supports lazy loading through the `lazyLoad()` method, which returns a `Stream<Document>`. This is useful for processing large numbers of documents without loading everything into memory at once.
162+
163+
```dart
164+
final loader = DirectoryLoader('/path/to/documents');
165+
await for (final document in loader.lazyLoad()) {
166+
// Process each document as it's loaded
167+
print(document.pageContent);
168+
}
169+
```
170+
171+
## Error Handling
172+
173+
- Throws an `ArgumentError` if the blob pattern is empty
174+
175+
## Advanced Example
176+
177+
```dart
178+
final loader = DirectoryLoader(
179+
'/path/to/documents',
180+
glob: '*.{txt,json,csv}', // Multiple file types
181+
recursive: true, // Search subdirectories
182+
exclude: ['temp*', '*.backup'], // Exclude temp and backup files
183+
loadHidden: false, // Ignore hidden files
184+
sampleSize: 50, // Load only 50 files
185+
randomizeSample: true, // Randomize the sample
186+
sampleSeed: 123, // Reproducible random sampling
187+
loaderMap: {
188+
// Custom loader for a specific file type
189+
'.json': (path) => CustomJsonLoader(path),
190+
},
191+
metadataBuilder: (file, defaultMetadata) {
192+
// Add custom metadata
193+
return {
194+
...defaultMetadata,
195+
'category': _categorizeFile(file),
196+
};
197+
},
198+
);
199+
200+
final documents = await loader.load();
201+
```
202+
203+
## Best Practices
204+
205+
- Use `lazyLoad()` for large directories to manage memory efficiently
206+
- Provide specific glob patterns to reduce unnecessary file processing
207+
- Customize loaders for specialized file types
208+
- Use `metadataBuilder` to add context-specific information to documents
209+
210+
## Limitations
211+
212+
- Relies on file system access
213+
- Performance may vary with large directories

examples/browser_summarizer/pubspec.lock

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,14 @@ packages:
171171
url: "https://pub.dev"
172172
source: hosted
173173
version: "2.4.4"
174+
glob:
175+
dependency: transitive
176+
description:
177+
name: glob
178+
sha256: "0e7014b3b7d4dac1ca4d6114f82bf1782ee86745b9b42a92c9289c23d8a0ab63"
179+
url: "https://pub.dev"
180+
source: hosted
181+
version: "2.1.2"
174182
html:
175183
dependency: transitive
176184
description:
@@ -330,10 +338,10 @@ packages:
330338
dependency: transitive
331339
description:
332340
name: path
333-
sha256: "087ce49c3f0dc39180befefc60fdb4acd8f8620e5682fe2476afd0b3688bb4af"
341+
sha256: "75cca69d1490965be98c73ceaea117e8a04dd21217b37b292c9ddbec0d955bc5"
334342
url: "https://pub.dev"
335343
source: hosted
336-
version: "1.9.0"
344+
version: "1.9.1"
337345
path_provider_linux:
338346
dependency: transitive
339347
description:

examples/docs_examples/pubspec.lock

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,14 @@ packages:
119119
url: "https://pub.dev"
120120
source: hosted
121121
version: "2.1.3"
122+
file:
123+
dependency: transitive
124+
description:
125+
name: file
126+
sha256: a3b4f84adafef897088c160faf7dfffb7696046cb13ae90b508c2cbc95d3b8d4
127+
url: "https://pub.dev"
128+
source: hosted
129+
version: "7.0.1"
122130
fixnum:
123131
dependency: transitive
124132
description:
@@ -151,6 +159,14 @@ packages:
151159
url: "https://pub.dev"
152160
source: hosted
153161
version: "0.8.13"
162+
glob:
163+
dependency: transitive
164+
description:
165+
name: glob
166+
sha256: "0e7014b3b7d4dac1ca4d6114f82bf1782ee86745b9b42a92c9289c23d8a0ab63"
167+
url: "https://pub.dev"
168+
source: hosted
169+
version: "2.1.2"
154170
google_generative_ai:
155171
dependency: transitive
156172
description:
@@ -359,10 +375,10 @@ packages:
359375
dependency: transitive
360376
description:
361377
name: path
362-
sha256: "087ce49c3f0dc39180befefc60fdb4acd8f8620e5682fe2476afd0b3688bb4af"
378+
sha256: "75cca69d1490965be98c73ceaea117e8a04dd21217b37b292c9ddbec0d955bc5"
363379
url: "https://pub.dev"
364380
source: hosted
365-
version: "1.9.0"
381+
version: "1.9.1"
366382
petitparser:
367383
dependency: transitive
368384
description:

examples/wikivoyage_eu/pubspec.lock

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,14 @@ packages:
8989
url: "https://pub.dev"
9090
source: hosted
9191
version: "2.1.3"
92+
file:
93+
dependency: transitive
94+
description:
95+
name: file
96+
sha256: a3b4f84adafef897088c160faf7dfffb7696046cb13ae90b508c2cbc95d3b8d4
97+
url: "https://pub.dev"
98+
source: hosted
99+
version: "7.0.1"
92100
fixnum:
93101
dependency: transitive
94102
description:
@@ -113,6 +121,14 @@ packages:
113121
url: "https://pub.dev"
114122
source: hosted
115123
version: "2.4.4"
124+
glob:
125+
dependency: transitive
126+
description:
127+
name: glob
128+
sha256: "0e7014b3b7d4dac1ca4d6114f82bf1782ee86745b9b42a92c9289c23d8a0ab63"
129+
url: "https://pub.dev"
130+
source: hosted
131+
version: "2.1.2"
116132
html:
117133
dependency: transitive
118134
description:
@@ -240,10 +256,10 @@ packages:
240256
dependency: transitive
241257
description:
242258
name: path
243-
sha256: "087ce49c3f0dc39180befefc60fdb4acd8f8620e5682fe2476afd0b3688bb4af"
259+
sha256: "75cca69d1490965be98c73ceaea117e8a04dd21217b37b292c9ddbec0d955bc5"
244260
url: "https://pub.dev"
245261
source: hosted
246-
version: "1.9.0"
262+
version: "1.9.1"
247263
petitparser:
248264
dependency: transitive
249265
description:

melos.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ command:
4141
flutter_markdown: ^0.7.3
4242
freezed_annotation: ^2.4.2
4343
gcloud: ^0.8.13
44+
glob: ^2.1.2
4445
google_generative_ai: ^0.4.6
4546
googleapis: ^13.0.0
4647
googleapis_auth: ^1.6.0
@@ -53,6 +54,7 @@ command:
5354
math_expressions: ^2.6.0
5455
meta: ^1.11.0
5556
objectbox: ^4.0.3
57+
path: ^1.9.1
5658
pinecone: ^0.7.2
5759
rxdart: ">=0.27.7 <0.29.0"
5860
shared_preferences: ^2.3.0
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
export 'directory_io.dart' if (dart.library.js_interop) 'directory_stub.dart';

0 commit comments

Comments
 (0)