Skip to content

Conversation

mjwolf
Copy link
Contributor

@mjwolf mjwolf commented Aug 14, 2025

This change introduces a new interactive prompt to the create data-stream command. When a user creates a data stream of type logs, they can now select from a list of common input types (e.g., aws-s3, gcp-pubsub, udp, tcp).

Based on the user's selection, the command now automatically populates the data_stream//manifest.yml with the corresponding input streams and their required variables. This streamlines the creation process for new log-based data streams by providing sensible defaults and reducing manual configuration.

This also updates the package README generation step to support a new function "inputDocs". When this is used in the readme template, all inputs used in the package will be found, and all will be listed in the readme. For inputs which have additional documentation added to elastic-package, the input documentation will also be inserted into the rendered readme.

Key changes include:

Added a multi-select prompt to the create data-stream command for choosing log input types.

Added internal/packages/_static/inputs/*.yml files, which contains information on each input type supported by the create command, with information on default name, title, description and variables.

Introduced internal/packages/archetype/data_stream_inputs.go to manage the logic for populating input variables from the new static inputs yml files.

Updated the dataStream-manifest.yml.tmpl to dynamically render streams and their associated variables.

Refactored the packages.DataStreamManifest to use a new packages.Stream struct for better organization.

Add "inputDocs" to the readme render functions, which will find all inputs used in the package, and render documentation for them.

Add docs/_static/inputs/*.yml for keeping input documentation (this is separate from archetype yml file, to avoid a circular dependancy between the packages).

How to test:

Create command
Run the elastic-package create data-stream command.

Select logs as the data stream type.

Select one or more input types from the list.

Verify that the generated data_stream//manifest.yml contains the correct streams configuration for the inputs you selected.

Build readme
Update a package readme to have the {{inputDocs}} function in the readme template

Build the package

Check the generated readme for the input documentation.

mjwolf added 6 commits August 12, 2025 17:15
With the create data-stream command, ask the user for inputs they will
use, and populate the created manifest with the selected values.
Render data stream manifests with the complete and valid input streams,
based on the data stream descriptor provided to the renderer.
Fix an error generating data stream manifest (by setting subojects
default to true) and fix system test_runner to use new Stream struct.
@mjwolf mjwolf added the enhancement New feature or request label Aug 14, 2025
The partition consumer waits up to a "receive count" or a "receive timeout", whichever comes first.

Default is `100` messages.
- cel:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- cel:
- name: cel

Prompt: &survey.MultiSelect{
Message: "Select input types which will be used in this data stream",
Options: []string{
"aws-cloudwatch",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UX question,
Do we want to present a more descriptive text to user?
That way we can add notes such as "deprecated" or additional context right when the user selects the input

ex:
etw -> Event Tracing for Windows
aws-s3 -> AWS S3 (S3 polling / SQS Notifications)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that too. It would be nice, but I don't think it's possible with the survey module we're using.

It might be nice to switch to another TUI module which can do more in the future, but that'll be a big job. For now I can add a link to the input documentation in the first prompt.

I think that's the best that can be done with this module.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to switch to another TUI module which can do more in the future, but that'll be a big job. For now I can add a link to the input documentation in the first prompt.

Yep, actually the one we are using now is archived 😕 But we haven't considered yet the effort to migrate to another one.

I think that's the best that can be done with this module.

Multiselect accepts a Description function that can be used to give more details about each option. Not sure though how this is displayed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I didn't know about Description.

I've added that, and it shows the description next to each option:
Screenshot 2025-08-18 at 11 56 31 AM

Copy link
Contributor

@kgeller kgeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general feedback/questions:

  1. It feels a bit awkward without the changes to the agent/stream template since the manifest will align with what's there. I know we have that slotted for a subtask, but I feel it would make more sense to include all together. Unless we can guarantee they'll be merged together around the same time and will definitely make it into the same release

  2. I appreciate that the user doesn't have to select an input, but then we get a pretty empty generated manifest file (and later no agent/stream files). I feel it could be beneficial if no input is selected, we fall back to what gets generated today with the currently existing logfile details

  3. Not sure how possible this is, but I think it could also be great if the manifest sections could be broken out into their own files. The one big massive file feels less ideal for maintaining

  4. Is there anything we can do sync up the package manifest with the new datastream details?

"gcp-pubsub",
"gcs",
"http_endpoint",
"httpjson",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we drop httpjson since cel is now the preferred?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken out httpjson

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see it when testing locally and in the PR?

@mjwolf mjwolf marked this pull request as draft August 15, 2025 18:41
@mjwolf
Copy link
Contributor Author

mjwolf commented Aug 15, 2025

Some general feedback/questions:

  1. It feels a bit awkward without the changes to the agent/stream template since the manifest will align with what's there. I know we have that slotted for a subtask, but I feel it would make more sense to include all together. Unless we can guarantee they'll be merged together around the same time and will definitely make it into the same release

I agree with that now. I was thinking that it'll be a big PR to have both changes in the same PR. But now that I'm working on the 2nd, a lot of the changes are closely related, and I think it makes sense to do them both at the same time.

So I converted this PR to draft, and I'll keep working on it to do them both at the same time.

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With all the information on this PR we could create many input packages, that could be later reused as part of "composable" packages. Input packages would then become the source of truth for these settings and docs.

I am fine with adding something like this by now, till we have more input packages, and composable packages. But I am a bit concerned about the long-term maintenance of docs and variables.

Prompt: &survey.MultiSelect{
Message: "Select input types which will be used in this data stream",
Options: []string{
"aws-cloudwatch",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to switch to another TUI module which can do more in the future, but that'll be a big job. For now I can add a link to the input documentation in the first prompt.

Yep, actually the one we are using now is archived 😕 But we haven't considered yet the effort to migrate to another one.

I think that's the best that can be done with this module.

Multiselect accepts a Description function that can be used to give more details about each option. Not sure though how this is displayed.

"redis",
"tcp",
"udp",
"winlog",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if there is some source of truth that could allow us to keep this list automatically updated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know of a source of truth that could be used for this. The inputs here were selected because they are the most-used. I think the idea of composable inputs could be good to become the source of truth. It would be good if that had some concept of deprecated/discouraged inputs, since there's now some inputs that we don't want to add new usages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a plan to keep these docs updated in packages?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a plan to keep these list of variables updated with the actual inputs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about "composable" packages. I think that could be a better way to handle this. These files were added here mainly because there isn't currently a good way to get information on all inputs in a single way. If there was a common location to get information on all inputs that could be used instead.

Improving scalability of integration maintenance is something my team will work on more in the near future, and there's this ticket, which has a similar problem: https://github.com/elastic/integration-experience/issues/164. So we don't have a very solid plan to maintain this yet, but it's something we plan to improve very soon.

@mjwolf
Copy link
Contributor Author

mjwolf commented Aug 18, 2025

This is what the panw integration rendered README looks like in Kibana when using this markdown in the "References" section:

### Inputs Used

These inputs can be used with this integration:

{{ inputDocs }}

Uploading Screenshot 2025-08-21 at 4.26.06 PM.png…

(This isn't published in panw yet)

@mjwolf mjwolf changed the title Allow selecting inputs for log data streams in data stream create wizard Generate data stream input configuration and documentation Aug 18, 2025
@mjwolf mjwolf marked this pull request as ready for review August 18, 2025 23:49
Copy link
Contributor

@kgeller kgeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what the panw integration rendered README looks like in Kibana when using this markdown in the "References" section:

@mjwolf is it possible to take advantage of the collapsable sections here with md now that we have support for that in kibana? I worry these big sections at the bottom will create unnecessary noise for the users when browsing the page

Kibana supports collapsable in the standard md format per elastic/kibana#223916 and we could add support to the docs page to identify that and collapse them as well (they don't support the standard md collapse, only a dropdown directive that they insert


Can we also have the top-level package manifest get when we add the datastream?

@mjwolf
Copy link
Contributor Author

mjwolf commented Aug 21, 2025

This is what the panw integration rendered README looks like in Kibana when using this markdown in the "References" section:

@mjwolf is it possible to take advantage of the collapsable sections here with md now that we have support for that in kibana? I worry these big sections at the bottom will create unnecessary noise for the users when browsing the page

I've changed to use collapsible sections for each input's doc (and updated the screenshot above)

@mjwolf mjwolf force-pushed the create-datastream-updates branch from f411f9b to d320f9e Compare August 21, 2025 23:41
Comment on lines 157 to 192
switch value {
case "aws-cloudwatch":
return "AWS Cloudwatch"
case "aws-s3":
return "AWS S3"
case "azure-blob-storage":
return "Azure Blob Storage"
case "azure-eventhub":
return "Azure Eventhub"
case "cel":
return "Common Expression Language (CEL)"
case "entity-analytics":
return "Entity Analytics"
case "etw":
return "Event Tracing for Windows (ETW)"
case "filestream":
return "Filestream"
case "gcp-pubsub":
return "GCP PubSub"
case "gcs":
return "Google Cloud Storage (GCS)"
case "http_endpoint":
return "HTTP Endpoint"
case "journald":
return "Journald"
case "netflow":
return "Netflow"
case "redis":
return "Redis"
case "tcp":
return "TCP"
case "udp":
return "UDP"
case "winlog":
return "WinLogBeat"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ensure that we keep this list updated with the options of the multiselect, we could keep this info in a map, and use the ordered keys as options for the multiselect.

Suggested change
switch value {
case "aws-cloudwatch":
return "AWS Cloudwatch"
case "aws-s3":
return "AWS S3"
case "azure-blob-storage":
return "Azure Blob Storage"
case "azure-eventhub":
return "Azure Eventhub"
case "cel":
return "Common Expression Language (CEL)"
case "entity-analytics":
return "Entity Analytics"
case "etw":
return "Event Tracing for Windows (ETW)"
case "filestream":
return "Filestream"
case "gcp-pubsub":
return "GCP PubSub"
case "gcs":
return "Google Cloud Storage (GCS)"
case "http_endpoint":
return "HTTP Endpoint"
case "journald":
return "Journald"
case "netflow":
return "Netflow"
case "redis":
return "Redis"
case "tcp":
return "TCP"
case "udp":
return "UDP"
case "winlog":
return "WinLogBeat"
}
return inputOptions[value]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's a good idea to use a map. I've done that, so there's now just one map for the names and descriptions, instead of the duplicated lists.

inputTcp,
inputUdp,
inputWinlog,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list is duplicated in internal/packages/archetype/resources.go? Could we reuse it?

Copy link
Contributor Author

@mjwolf mjwolf Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this list completely, and switched to embed.FS. It will now walk all yaml files in the embedded dir to find inputs, so maintenance should be a bit less now.

I also moved all static input files into the docs module (it had to be in docs to avoid a circular dependency), so now we also won't have to maintain two sets of yaml files.


// Input definitions

//go:embed _static/inputs/aws-cloudwatch.yml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it also possible to embed a whole directory. In case this helps, maybe when adding new inputs in the future.

//go:embed _static/inputs
var inputResources embed.FS

mjwolf and others added 6 commits August 22, 2025 09:27
Changes are:
* Move all static input docs into `docs` module
* Use embed.FS instead of embedding individual files
* Remove list of inputs; instead the file dir will be walked, removing
  the need to maintain this list.
* In `cmd/create_data_stream.go`, use a map for input names and
  descriptions.
When a new data stream is created, add a simple policy to the main
package manifest, to help users get started with the manifest.
@mjwolf
Copy link
Contributor Author

mjwolf commented Aug 25, 2025

  1. Is there anything we can do sync up the package manifest with the new datastream details?

@kgeller, I had updated this to add a new policy_template to the package manifest when a new data stream is created, but it then caused some unit tests to fail, so I took it out. I think there's some interaction with the package-spec that makes it a bit more complicated then I thought. What do you think of leaving this out of this PR, since it's already pretty big.

We could look at updating the package spec later.

(This is what I did to update the package manifest when a data stream is added: 05acf87)

@mjwolf mjwolf requested a review from vinit-chauhan August 25, 2025 14:16
@kgeller
Copy link
Contributor

kgeller commented Aug 25, 2025

@mjwolf

I had updated this to add a new policy_template to the package manifest when a new data stream is created, but it then caused some unit tests to fail, so I took it out. I think there's some interaction with the package-spec that makes it a bit more complicated then I thought. What do you think of leaving this out of this PR, since it's already pretty big.

We could look at updating the package spec later.

(This is what I did to update the package manifest when a data stream is added: 05acf87)

Thanks for giving it a shot, I think it's fine to leave for now as you mentioned, especially since that is today's current behavior anyway. Definitely feel free to consider it for phase two 😉


I've changed to use collapsible sections for each input's doc (and updated the screenshot above)

Do we have any issue/sub-task to track the support being also added to the docs?

Comment on lines 62 to 70
lines := strings.Split(s, "\n")
indent := ""
for range numSpaces {
indent = indent + " "
}
for i, line := range lines {
lines[i] = indent + line
}
return strings.Join(lines, "\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. A scanner could be used to split the lines, and a string builder to append them. Something like this:

Suggested change
lines := strings.Split(s, "\n")
indent := ""
for range numSpaces {
indent = indent + " "
}
for i, line := range lines {
lines[i] = indent + line
}
return strings.Join(lines, "\n")
indent := strings.Repeat(" ", numSpaces)
var b strings.Builder
var s scanner.Scanner
s.Init(strings.NewReader(s))
for tok := s.Scan(); tok != scanner.EOF; tok = s.Scan() {
fmt.Fprintln(&b, indent, s.TokenText())
}
return b.String()

description: Client secret used for OAuth2 authentication
show_user: true
required: false
secret: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to create a data stream with the cel input, and some settings like secret are not preserved 🤔

      - name: oauth_secret
        type: password
        show_user: true
        description: |-
          Client secret used for OAuth2 authentication

In the case of secret this leads to errors like:

variable identified as possible secret, secret parameter required to be set to true or false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Secret field wasn't in the updated manifest template, so it wasn't added into the resulting manifest, even though it was in the inputs yaml definition.

I've added secrets, and other missing fields to the manifest template now.

Comment on lines +112 to +113
(Remove the spaces between curly braces when using)
{ { inputDocs } }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still have to take a look at how to avoid this, but this can go in a follow up.

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants