Skip to content

chore: blob export reference#2771

Open
sumerman wants to merge 1 commit intomainfrom
valeriy/blob-export-field-reference
Open

chore: blob export reference#2771
sumerman wants to merge 1 commit intomainfrom
valeriy/blob-export-field-reference

Conversation

@sumerman
Copy link
Copy Markdown
Contributor

@sumerman sumerman commented Apr 2, 2026

No description provided.

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
langfuse-docs Ready Ready Preview, Comment Apr 2, 2026 3:23pm

Request Review

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Apr 2, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

@claude review

@dosubot dosubot bot added the documentation Improvements or additions to documentation label Apr 2, 2026
Comment on lines +81 to +91
| `name` | string | User-defined observation name. | Group/filter by name (e.g. function name, model call label). |
| `metadata` | object | User-supplied key-value metadata. | Arbitrary context. Extract keys relevant to your analytics. |
| `level` | string | Log level: `DEBUG`, `DEFAULT`, `WARNING`, `ERROR`. | Filter for errors or warnings. |
| `status_message` | string | Status or error message. | Inspect for debugging failed observations. |
| `version` | string | User-provided version string set via the SDK. | Informational. |
| `input` | string | Observation input payload. | For generations: the prompt/messages sent to the LLM. May be plain text or JSON; may be large. |
| `output` | string | Observation output payload. | For generations: the LLM response. May be plain text or JSON; may be large. |
| `provided_model_name` | string | Model name as provided by the user/SDK. | The raw model string (e.g. `gpt-4o`, `claude-sonnet-4-20250514`). This is what the API returns as `model`. |
| `model_parameters` | string | Model call parameters as a JSON-encoded string (e.g. `"{\"temperature\":0.7}"`). | Parse as JSON. Useful for analyzing how model settings affect quality/cost. |
| `usage_details` | object (string → integer) | Token usage breakdown by category. | Extract keys: `input` for input tokens, `output` for output tokens, `total` for total. May contain additional keys like `input_cached_tokens`, `reasoning_tokens`, etc. |
| `cost_details` | object (string → number) | Cost breakdown by category (USD). | Extract keys: `input` for input cost, `output` for output cost. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Eight fields in the observations table are documented as non-nullable but will be null in practice: name, version, input, and output are optional SDK parameters that may be absent on any observation type; provided_model_name and model_parameters are GENERATION-specific and null for SPAN/EVENT observations; usage_details and cost_details are token/cost fields that are also null for non-GENERATION observations. All eight should be annotated with "or null" to match the documents own convention used for adjacent nullable fields like model_id, prompt_version, and end_time.

Extended reasoning...

What the bug is and how it manifests

The observations table in the new blob-storage-export-fields.mdx documents eight fields without a null annotation, implying they are non-nullable. However, all eight can and will be null in real exports depending on the observation type and how the SDK was called. The document itself establishes a clear convention of X or null for nullable fields (e.g., parent_observation_id: string or null, end_time: string (timestamp) or null, model_id: string or null, prompt_version: integer or null), making the omission on these eight fields an internally inconsistent documentation error.

The specific code paths that trigger nulls

Fields name, version, input, output (lines 81-87): These are all optional parameters in the Langfuse SDK. A user can call langfuse.span(trace_id=..., start_time=..., end_time=...) or langfuse.event(trace_id=..., start_time=...) without providing any of these. EVENT observations in particular are point-in-time markers that often have no name, no input, and no output. The docs own usage notes reinforce this: the input note says "For generations: the prompt/messages sent to the LLM" and output says "For generations: the LLM response" - both explicitly scope these fields to GENERATION type, implying they are absent (null) for SPAN and EVENT rows. The inconsistency is visible within the same table: prompt_version is correctly typed integer or null, but the semantically equivalent version field is plain string.

Fields provided_model_name, model_parameters (lines 88-89): These are LLM/model-specific fields. The type field documentation lists three values: SPAN, GENERATION, or EVENT, with the note "Generations are LLM calls; spans are arbitrary operations; events are point-in-time markers." Only GENERATION observations involve an LLM call and therefore have a model name or call parameters. SPAN and EVENT rows will be null for both fields. The document already acknowledges this pattern for the derived field model_id, which is correctly typed string or null with the note "Null if no model definition matched" - if the resolved ID is nullable, provided_model_name is even more clearly nullable for non-GENERATION observations that never supply a model at all.

Fields usage_details, cost_details (lines 90-91): These track LLM token counts and monetary costs. The usage_details description says "Extract keys: input for input tokens, output for output tokens" - tokens only exist for GENERATION calls. A SPAN tracking a retrieval step or an EVENT marker has no token usage and no cost. All adjacent model-specific cost fields (input_price, output_price, total_price, usage_pricing_tier_name) are already correctly documented as or null. Documenting usage_details and cost_details as non-nullable while all their derived pricing fields are nullable is internally contradictory.

Why existing documentation does not prevent it

This is a new documentation file added entirely by this PR. No automated linter or schema validator checks type accuracy in MDX prose tables. The inconsistency was introduced by applying or null correctly to many nullable fields but omitting it from these eight. The contrast is visible on the same table row: prompt_version is correctly typed integer or null, but version is plain string.

Impact

Data engineers reading this reference to define warehouse schemas will reasonably treat absent or null as meaning the column is always populated. A data engineer who writes name TEXT NOT NULL, input TEXT NOT NULL, provided_model_name TEXT NOT NULL in their DDL, or who unconditionally dereferences usage_details keys in their pipeline, will encounter NOT NULL constraint violations or null-dereference errors on the first batch containing any non-GENERATION observation or any observation created without these optional fields.

Step-by-step proof

  1. A user instruments a RAG pipeline: SPAN observations for retrieval, GENERATION observations for LLM calls, EVENT observations for cache hits.
  2. In the export, every SPAN and EVENT row has null for provided_model_name, model_parameters, usage_details, and cost_details. Many rows also have null for name, input, output, and version since those were not provided.
  3. A data engineer reads the observations table and sees name: string, input: string, provided_model_name: string, usage_details: object - all without or null.
  4. They create a warehouse table with NOT NULL constraints on these columns and pipeline code that always accesses usage_details["input"].
  5. The first export load fails: NOT NULL violations on every non-GENERATION row; null-dereference errors when parsing usage_details on SPAN/EVENT rows.

How to fix

Change all eight fields to include or null: name: string or null, version: string or null, input: string or null, output: string or null, provided_model_name: string or null, model_parameters: string or null, usage_details: object (string -> integer) or null, cost_details: object (string -> number) or null. Optionally add usage notes clarifying that provided_model_name, model_parameters, usage_details, and cost_details are null for non-GENERATION observations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant