Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ AI-Driven Development (vibe coding) on Databricks just got a whole lot better. T


### Install in existing project
By default this will install at a project level rather than a user level. This is often a good fit, but requires you to run your client from the exact directory that was used for the install.
By default this will install at a project level rather than a user level. This is often a good fit, but requires you to run your client from the exact directory that was used for the install.
_Note: Project configuration files can be re-used in other projects. You find these configs under .claude, .cursor, or .gemini_

#### Mac / Linux
Expand Down Expand Up @@ -164,7 +164,7 @@ Works with LangChain, OpenAI Agents SDK, or any Python framework. See [databrick
|-----------|-------------|
| [`databricks-tools-core/`](databricks-tools-core/) | Python library with high-level Databricks functions |
| [`databricks-mcp-server/`](databricks-mcp-server/) | MCP server exposing 50+ tools for AI assistants |
| [`databricks-skills/`](databricks-skills/) | 19 markdown skills teaching Databricks patterns |
| [`databricks-skills/`](databricks-skills/) | 20 markdown skills teaching Databricks patterns |
| [`databricks-builder-app/`](databricks-builder-app/) | Full-stack web app with Claude Code integration |

---
Expand Down
2 changes: 1 addition & 1 deletion databricks-skills/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,10 @@ cp -r ai-dev-kit/databricks-skills/databricks-agent-bricks .claude/skills/
## Available Skills

### 🤖 AI & Agents
- **databricks-ai-functions** - Built-in AI Functions (ai_classify, ai_extract, ai_summarize, ai_query, ai_forecast, ai_parse_document, and more) with SQL and PySpark patterns, function selection guidance, document processing pipelines, and custom RAG (parse → chunk → index → query)
- **databricks-agent-bricks** - Knowledge Assistants, Genie Spaces, Supervisor Agents
- **databricks-genie** - Genie Spaces: create, curate, and query via Conversation API
- **databricks-model-serving** - Deploy MLflow models and AI agents to endpoints
- **databricks-parsing** - Parse documents (PDF, DOCX, images) with ai_parse_document for custom RAG pipelines
- **databricks-unstructured-pdf-generation** - Generate synthetic PDFs for RAG
- **databricks-vector-search** - Vector similarity search for RAG and semantic search

Expand Down
348 changes: 348 additions & 0 deletions databricks-skills/databricks-ai-functions/1-task-functions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,348 @@
# Task-Specific AI Functions — Full Reference

These functions require no model endpoint selection. They call pre-configured Foundation Model APIs optimized for each task. All require DBR 15.1+ (15.4 ML LTS for batch); `ai_parse_document` requires DBR 17.1+.

---

## `ai_analyze_sentiment`

**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_analyze_sentiment

Returns one of: `positive`, `negative`, `neutral`, `mixed`, or `NULL`.

```sql
SELECT ai_analyze_sentiment(review_text) AS sentiment
FROM customer_reviews;
```

```python
from pyspark.sql.functions import expr
df = spark.table("customer_reviews")
df.withColumn("sentiment", expr("ai_analyze_sentiment(review_text)")).display()
```

---

## `ai_classify`

**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_classify

**Syntax:** `ai_classify(content, labels)`
- `content`: STRING — text to classify
- `labels`: ARRAY\<STRING\> — 2 to 20 mutually exclusive categories

Returns the matching label or `NULL`.

```sql
SELECT ticket_text,
ai_classify(ticket_text, ARRAY('urgent', 'not urgent', 'spam')) AS priority
FROM support_tickets;
```

```python
from pyspark.sql.functions import expr
df = spark.table("support_tickets")
df.withColumn(
"priority",
expr("ai_classify(ticket_text, array('urgent', 'not urgent', 'spam'))")
).display()
```

**Tips:**
- Fewer labels = more consistent results (2–5 is optimal)
- Labels should be mutually exclusive and clearly distinguishable
- Not suitable for multi-label classification — run multiple calls if needed

---

## `ai_extract`

**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_extract

**Syntax:** `ai_extract(content, labels)`
- `content`: STRING — source text
- `labels`: ARRAY\<STRING\> — entity types to extract

Returns a STRUCT where each field name matches a label. Fields are `NULL` if not found.

```sql
-- Extract and access fields directly
SELECT
entities.person,
entities.location,
entities.date
FROM (
SELECT ai_extract(
'John Doe called from New York on 2024-01-15.',
ARRAY('person', 'location', 'date')
) AS entities
FROM messages
);
```

```python
from pyspark.sql.functions import expr
df = spark.table("messages")
df = df.withColumn(
"entities",
expr("ai_extract(message, array('person', 'location', 'date'))")
)
df.select("entities.person", "entities.location", "entities.date").display()
```

**Use `ai_query` instead when:** the output has nested arrays or more than ~5 levels of hierarchy.

---

## `ai_fix_grammar`

**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_fix_grammar

**Syntax:** `ai_fix_grammar(content)` — Returns corrected STRING.

Optimized for English. Useful for cleaning user-generated content before downstream processing.

```sql
SELECT ai_fix_grammar(user_comment) AS corrected FROM user_feedback;
```

```python
from pyspark.sql.functions import expr
df = spark.table("user_feedback")
df.withColumn("corrected", expr("ai_fix_grammar(user_comment)")).display()
```

---

## `ai_gen`

**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_gen

**Syntax:** `ai_gen(prompt)` — Returns a generated STRING.

Use for free-form text generation where the output format doesn't need to be structured. For structured JSON output, use `ai_query` with `responseFormat`.

```sql
SELECT product_name,
ai_gen(CONCAT('Write a one-sentence marketing tagline for: ', product_name)) AS tagline
FROM products;
```

```python
from pyspark.sql.functions import expr
df = spark.table("products")
df.withColumn(
"tagline",
expr("ai_gen(concat('Write a one-sentence marketing tagline for: ', product_name))")
).display()
```

---

## `ai_mask`

**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_mask

**Syntax:** `ai_mask(content, labels)`
- `content`: STRING — text with sensitive data
- `labels`: ARRAY\<STRING\> — entity types to redact

Returns text with identified entities replaced by `[MASKED]`.

Common label values: `'person'`, `'email'`, `'phone'`, `'address'`, `'ssn'`, `'credit_card'`

```sql
SELECT ai_mask(
message_body,
ARRAY('person', 'email', 'phone', 'address')
) AS message_safe
FROM customer_messages;
```

```python
from pyspark.sql.functions import expr
df = spark.table("customer_messages")
df.withColumn(
"message_safe",
expr("ai_mask(message_body, array('person', 'email', 'phone'))")
).write.format("delta").mode("append").saveAsTable("catalog.schema.messages_safe")
```

---

## `ai_similarity`

**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_similarity

**Syntax:** `ai_similarity(expr1, expr2)` — Returns a FLOAT between 0.0 and 1.0.

Use for fuzzy deduplication, search result ranking, or item matching across datasets.

```sql
-- Deduplicate company names (similarity > 0.85 = likely duplicate)
SELECT a.id, b.id, a.name, b.name,
ai_similarity(a.name, b.name) AS score
FROM companies a
JOIN companies b ON a.id < b.id
WHERE ai_similarity(a.name, b.name) > 0.85
ORDER BY score DESC;
```

```python
from pyspark.sql.functions import expr
df = spark.table("product_search")
df.withColumn(
"match_score",
expr("ai_similarity(search_query, product_title)")
).orderBy("match_score", ascending=False).display()
```

---

## `ai_summarize`

**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_summarize

**Syntax:** `ai_summarize(content [, max_words])`
- `content`: STRING — text to summarize
- `max_words`: INTEGER (optional) — word limit; default 50; use `0` for uncapped

```sql
-- Default (50 words)
SELECT ai_summarize(article_body) AS summary FROM news_articles;

-- Custom word limit
SELECT ai_summarize(article_body, 20) AS brief FROM news_articles;
SELECT ai_summarize(article_body, 0) AS full FROM news_articles;
```

```python
from pyspark.sql.functions import expr
df = spark.table("news_articles")
df.withColumn("summary", expr("ai_summarize(article_body, 30)")).display()
```

---

## `ai_translate`

**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_translate

**Syntax:** `ai_translate(content, to_lang)`
- `content`: STRING — source text
- `to_lang`: STRING — target language code

**Supported languages:** `en`, `de`, `fr`, `it`, `pt`, `hi`, `es`, `th`

For unsupported languages, use `ai_query` with a multilingual model endpoint.

```sql
-- Single language
SELECT ai_translate(product_description, 'es') AS description_es FROM products;

-- Multi-language fanout
SELECT
description,
ai_translate(description, 'fr') AS description_fr,
ai_translate(description, 'de') AS description_de
FROM products;
```

```python
from pyspark.sql.functions import expr
df = spark.table("products")
df.withColumn(
"description_es",
expr("ai_translate(product_description, 'es')")
).display()
```

---

## `ai_parse_document`

**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_parse_document

**Requires:** DBR 17.1+

**Syntax:** `ai_parse_document(content [, options])`
- `content`: BINARY — document content loaded from `read_files()` or `spark.read.format("binaryFile")`
- `options`: MAP\<STRING, STRING\> (optional) — parsing configuration

**Supported formats:** PDF, JPG/JPEG, PNG, DOCX, PPTX

Returns a VARIANT with pages, elements (text paragraphs, tables, figures, headers, footers), bounding boxes, and error metadata.

**Options:**

| Key | Values | Description |
|-----|--------|-------------|
| `version` | `'2.0'` | Output schema version |
| `imageOutputPath` | Volume path | Save rendered page images |
| `descriptionElementTypes` | `''`, `'figure'`, `'*'` | AI-generated descriptions (default: `'*'` for all) |

**Output schema:**

```
document
├── pages[] -- page id, image_uri
└── elements[] -- extracted content
├── type -- "text", "table", "figure", etc.
├── content -- extracted text
├── bbox -- bounding box coordinates
└── description -- AI-generated description
metadata -- file info, schema version
error_status[] -- errors per page (if any)
```

```sql
-- Parse and extract text blocks
SELECT
path,
parsed:pages[*].elements[*].content AS text_blocks,
parsed:error AS parse_error
FROM (
SELECT path, ai_parse_document(content) AS parsed
FROM read_files('/Volumes/catalog/schema/landing/docs/', format => 'binaryFile')
);

-- Parse with options (image output + descriptions)
SELECT ai_parse_document(
content,
map(
'version', '2.0',
'imageOutputPath', '/Volumes/catalog/schema/volume/images/',
'descriptionElementTypes', '*'
)
) AS parsed
FROM read_files('/Volumes/catalog/schema/volume/invoices/', format => 'binaryFile');
```

```python
from pyspark.sql.functions import expr

df = (
spark.read.format("binaryFile")
.load("/Volumes/catalog/schema/landing/docs/")
.withColumn("parsed", expr("ai_parse_document(content)"))
.selectExpr(
"path",
"parsed:pages[*].elements[*].content AS text_blocks",
"parsed:error AS parse_error",
)
.filter("parse_error IS NULL")
)

# Chain with task-specific functions on the extracted text
df = (
df.withColumn("summary", expr("ai_summarize(text_blocks, 50)"))
.withColumn("entities", expr("ai_extract(text_blocks, array('date', 'amount', 'vendor'))"))
.withColumn("category", expr("ai_classify(text_blocks, array('invoice', 'contract', 'report'))"))
)
df.display()
```

**Limitations:**
- Processing is slow for dense or low-resolution documents
- Suboptimal for non-Latin alphabets and digitally signed PDFs
- Custom models not supported — always uses the built-in parsing model
Loading