-
Notifications
You must be signed in to change notification settings - Fork 34
This changeset includes a few major changes to support testing multiple files on a device under test with a different workload running on each file. Such a workload is important for evaluating streaming and open channel SSDs. #46
base: master
Are you sure you want to change the base?
Conversation
…le files on a device under test with a different workload running on each file. Such a workload is important for evaluating streaming and open channel SSDs.
The test operator passes the following command line to test disk 2 using a multi-target workload:
$> StoreScore.cmd --target=2 --recipe=multitarget.rcp
As with previous versions of StorScore, the command line defines the device under test, and the recipe file defines the workload. New with these changes, it also defines the number of files to create and test. For example:
multitarget.rcp:
test(
description => "Multitarget Sample"
target_count => 2,
xml_profile => "sample.xml",
initialize => 0,
warmup_time => 5,
run_time => 60,
);
Unlike single-target tests, multi-target tests define the workload in a separate xml profile template. StorScore checks to make sure the recipe points to a profile file whenever the target count is greater than 1.
During each test, StorScore wipes the disk, creates a partition and file system, and then divides the disk space evenly among the number of targets given by the recipe. It injects a few variables into the xml template, such as file names and run time, exports the xml document to the results directory and then points diskspd at this profile. (Diskspd.exe -Xmodified-sample.xml)
NB: In the future, I would like to generate the xml document from scratch to reduce user error and frustration. The XML profiles can be tedious to generate, and they can easily get out of synch with the recipe.
These changes also update the parsed data. Previously, the excel file printed one row for each test:
Disk | Description | W Mix | IO Size | MB/s | Avg Latency
-------------------------------------------------------------------------------------
PM953 | 4k Rand Read | 0 | 4k | 502.45 | 0.092
PM953 | 4k Rand Write | 100 | 4k | 235.45 | 1.201
PM953 | 8k Rand Read | 0 | 8k | 252.45 | 0.181
Now, each row shows either per-target workload and results or the aggregate workload and results:
Disk | Description | Target | W Mix | IO Size | MB/s | Avg Latency
------------------------------------------------------------------------------------------------------
PM953 | 4k Rand Read | Total | 0 | 4k | 502.45 | 0.092
PM953 | Example Multi | Total | 50 | | 235.45 | 1.201
PM953 | Example Multi | E:\file1.dat | 0 | 4k | 285.12 | 0.181
PM953 | Example Multi | E:\file2.dat | 100 | 8k | 102.48 | 1.589
The outlier and score calculation is based only on the measurements listed in the aggregated rows.
To support this output format, I added a level to the parser's main data structure:
Before:
stats =
{
“FW version” => “10010L00”,
…
“Read Mix” => 100,
“QD” => 1,
…
“MB/s Read” => 1175,
“IOs Read” => 20398,
…
}
After:
stats =
{
“FW version” => “10010L00”,
…
“Workloads” =>
{
“Total” =>
{
“Read Mix” => 50,
“QD” => 129,
…
},
“E:\file1.dat” =>
{
“Read Mix” => 100,
“QD” => 1,
…
},
“E:\file2.dat” =>
{
“Read Mix” => 0,
“QD” => 128,
…
}
},
“Measurements” =>
{
“Total” =>
{
“MB/s Read” => 285.12,
“IOs Read” => 10128,
…
},
“E:\file1.dat” =>
{
“MB/s Read” => 285.12,
“IOs Read” => 10128,
…
},
“E:\file2.dat” =>
{
“MB/s Read” => 0,
“IOs Read” => 0,
…
}
}
}
|
Hi @lauracaulfield, I'm your friendly neighborhood Microsoft Pull Request Bot (You can call me MSBOT). Thanks for your contribution!
TTYL, MSBOT; |
|
Does this enable us to test multiple files on a single volume, or multiple volumes? The terminology is starting to get confusing. If the command line specifies --target=2 (a single target, disk number 2) how can the recipe then specify multiple targets? I looked at one of the XML files and it doesn't seem to specify a file name, just a list of unnamed targets. I guess this works when you want StorScore.cmd to test multiple files on a single drive? Can you test what happens when you use a recipe like this and target an existing volume or file (--target=D:\ or --target=D:\myfile)? Both of those use cases are supported today. Is it possible for the targets specified in XML to conflict with the command line --target? What if the recipe "target_count" property doesn't match the XML file? Why not parse the XML and just count the targets, instead of requiring the user to do this themselves? Seems like an easy way to avoid a whole class of bug. Today we have some guard rails to prevent a user from accidentally destroying the wrong disk. Does that all still work with this? |
marksantaniello
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think overall my worry is that this is really a huge design change. The original design always assumed a target was a single file on a single device. If you want to change that assumption, I'd expect more of a re-design, but this feels more like a patchwork of hacks. I'd be shocked if it didn't break some existing functionality.
I think the cleaner way to do it would be:
- Support --targets=X,Y,Z on the command line, in addition to --target=X
- Simply launch multiple copies of the IO generator in parallel to target multiple targets
There are a few nice things about this approach:
- It keeps the knowledge of the targets on the command line, where it feels like it belongs, rather than in the recipe.
- It enables testing multiple devices, not just multiple files on a single device
- It keeps the flexibility of StorScore to drive any IO generator, not just the ones who can target multiple files simultaneously
- We don't "sub out" from our nice recipe syntax to some ugly XML thing we don't control
| $self->_active_range( 1 ); | ||
| $self->_test_time_override( 5 ); | ||
| $self->_warmup_time_override( 0 ); | ||
| $self->_test_time_override( 10 ); # seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes seem unrelated to multiple targets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are unrelated. I made the change so that demo mode would demo a the warmup and test.
|
|
||
| if( $cmd_line =~ /-Z.*?([^\\]*)\.bin/ ) | ||
| #search for the format of Storscore's pre-generated entropy files | ||
| if( $cmd_line =~ /-Z.*?([^\\]*)_pct_comp\.bin/ ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess maybe you're intentionally breaking backward compatibility with really old results directories here. It's probably OK, but you should be aware of it. A lot of of things in the parser like this were done in order to permit backward compatibility with results obtained before some storscore.cmd change was made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intent was to preserve backwards compatibility. I'm not sure why I made that change -- I reverted it and it works just fine.
| my $test_ref = $args{'test_ref'}; | ||
| my $output_dir = $args{'output_dir'}; | ||
|
|
||
| # preconditioner.exe does not support multiple files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, the PreconditionRunner::run_to_steady_state() no longer always calls preconditon.exe and no longer runs to steady-state. This is going to feel like giant tech debt some day. Do words mean nothing? Today FooRunner literally always runs Foo, right?
Why not just run two copies of precondition.exe simultaneously?
Or, if you absolutely must use diskspd for this case, maybe it's better to do it one layer above in the caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one area I'm not totally happy with yet. I can't run multiple copies of preconditioner.exe, because I need them all to keep running until they all reach steady state. I put the functionality in precondition runner because this is conceptually a precondition step (to use SNIA's terminology).
One level up is in Target.pm. I want to give it some more thought, but I'm open to putting it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting.... I think I understand what it means to drive a device to steady-state, but I'm not sure what it means to drive more than one file on a single device to steady-state. Hmm...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a new area, so I haven't seen any data for the effect. But, I bet you can think of it like several independent drives. Say, two for an example, and you're driving a random write workload to each. Each needs to get fragmented before we enter the test phase. Stream A might fragment faster (maybe because it's 8kB writes and stream B is 4kB writes). If we used the preconditioner, it would exit and the drive might start cleaning up stream A's blocks while stream B is still preconditioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something else is funny here. You don't want to launch multiple precondition.exe instances, because you want them to all reach steady-state before proceeding to the test, right? And yet, if I understand the DiskSpd alternative code, there's literally no dynamic detection of steady-state at all there -- you just run for an hour. It seems like you've given up on steady-state detection entirely.
One way to fix this would be to add an option to precondition.exe akin to "--run_forever_even_after_reaching_steady_state", except with a much better name. The preconditioner could print something to stdout like, "Steady state attained, send CTRL-C to terminate" or something.
StorScore would just have to launch multiple child processes of precondition.exe, wait for them to all reach steady-state, and then just kill them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooo, I like that idea. You're correct -- as-is the code just runs diskspd for an hour with the same XML profile that I use for the test. To implement your preconditioner-based solution, StorScore needs to understand each workload -- information which right now is locked in the XML file. I might need to wait for the next change, when I work out how the recipe should describe the workload for each "sub target."
|
That's really odd... GitHub lets me edit your comment, but doesn't let me reply directly to it. Hmmm... what embarrassing thing can I make Mark say on the internet... j/k :-) This is a big feature to add to StorScore, and I'd like to work out better terminology. But, it's a useful feature for evaluating a new functionality ("Streaming") that's coming to an SSD near you. I think I need to do a better job of explaining why StorScore should be able to test that feature. The top-level goals are the same as always
The technology change that's motivating the change to StorScore is called streaming (there's some information about it here: https://www.usenix.org/system/files/conference/hotstorage14/hotstorage14-paper-kang.pdf). Basically, the streaming interface allows the host and drive to communicate about which data should be grouped together on the SSD. This is really helpful in preventing GC when you're, say, writing each of 4 files sequentially and simultaneously. In a normal SSD, the 4 streams mix together and create a high WAF. In a streaming drive, the controller places the data from each stream in it's own set of blocks. Each stream can write at it's own rate, and the host writes will invalidate all the data in a block at roughly the same time, and WAF will be 1. Now, enter StorScore. The goal is to be able to define a streaming test (like the 4-file example above) in the recipe, and be able to parse the results and compare them across drives. In the parsed data, I would like to be able to see the performance of each stream, the aggregate performance, and the WAF (of the whole drive) to validate that the streams are implemented properly. Now, for some specific answers. The multi-stream/target/file test is still meant to evaluate a single volume or disk -- the volume or disk specified on the command line. Volume and disk targets are supported for this type of test, but file is not. When the user targets a file with this type of test, StorScore should give a useful error message. The XML "template" file doesn't list file names (and a few other parameters like the test's run time) because StorScore generates those and adds them to the final profile before running the test. StorScore gets the size of the target given on the command line, and divides this space evenly among the number files given in the test, and creates the files. This way StorScore can still purge between tests when the target is a disk. Also, the user doesn't have to know about the valid data length flag or do preconditioning manually. If the number of files (currently "target_count" in the recipe, but I'm seriously thinking about renaming it) doesn't match the number of targets in the XML file, StorScore errors. I prefer to generate XML rather than parse it, because I want StorScore defining the workload rather than the manually generated (read: error prone) XML files. Ultimately, I would like StorScore to generate the entire XML file from scratch. I'm using the template idea temporarily until we get a better feel for the knobs we want in this type of test. Regarding this: Are you talking about the initial message "are you sure you want to destroy..."? If so, yes. This is still functional for all types of tests. This is a big design change, but from my perspective it fits well within our original goals. I'm surprised it didn't require more changes as well, but after working with it, I think that's just because it has a structure that matches our broader goals -- even the ones we hadn't defined yet. I tested with a conventional test (for StorScore changes) and conventional set of data (for parser changes), so I'm as confident in backward-compatibility as I can be without a regressions test. |
|
I guess DiskSpd's XML file uses the word "target", which is adding to the confusion over terminology. Would things make more sense if we kind of ignored that and used the word "stream" in some places? So we still test one "target" with StoreScore, but that "target" can potentially have many "streams" ? |
|
Stream is a fairly overloaded term I try to avoid. It might be ok here, though, since StorScore isn't using the term for anything else. Whatever term we use, I think a clean separation from diskspd's terminology will help. Some options: Technically speaking, stream is the SSD-level name. You may not always have a one-to-one mapping between stream and file. Even so, I think stream is still my favorite. |
|
So... one more idea I'd like to suggest to you: If there's some tension between a "nice design" and "getting something working now", you could just fork StorScore and create StreamingStorScore, for the latter, to break the dependency. The fancier design could specify multiple streams directly in our recipe format, and delegate to the IO generator how to accomplish it. For DiskSPD or (someday, I hope) FIO, maybe you would have some code to construct a configuration input file (XML, in the case of DiskSPD). This could live in DiskSpdRunner/SqlioRunner/FioRunner because it's pretty analogous to the code that translates our recipe into the right command line flags. If the IO generator doesn't support multiple streams natively (like SQLIO) maybe you could simulate it by launching multiple child processes (similar to the idea of launching multiple precondition.exes). By forking, you get something out today that vendors can use, etc., but avoid taking on a lot of tech debt that will get in the way later. |
The test operator passes the following command line to test disk 2 using a multi-target workload:
$> StoreScore.cmd --target=2 --recipe=multitarget.rcp
As with previous versions of StorScore, the command line defines the device under test, and the recipe file defines the workload. New with these changes, it also defines the number of files to create and test. For example:
multitarget.rcp:
test(
description => "Multitarget Sample"
target_count => 2,
xml_profile => "sample.xml",
initialize => 0,
warmup_time => 5,
run_time => 60,
);
Unlike single-target tests, multi-target tests define the workload in a separate xml profile template. StorScore checks to make sure the recipe points to a profile file whenever the target count is greater than 1.
During each test, StorScore wipes the disk, creates a partition and file system, and then divides the disk space evenly among the number of targets given by the recipe. It injects a few variables into the xml template, such as file names and run time, exports the xml document to the results directory and then points diskspd at this profile. (Diskspd.exe -Xmodified-sample.xml)
NB: In the future, I would like to generate the xml document from scratch to reduce user error and frustration. The XML profiles can be tedious to generate, and they can easily get out of synch with the recipe.
These changes also update the parsed data. Previously, the excel file printed one row for each test:
Disk | Description | W Mix | IO Size | MB/s | Avg Latency
PM953 | 4k Rand Read | 0 | 4k | 502.45 | 0.092
PM953 | 4k Rand Write | 100 | 4k | 235.45 | 1.201
PM953 | 8k Rand Read | 0 | 8k | 252.45 | 0.181
Now, each row shows either per-target workload and results or the aggregate workload and results:
Disk | Description | Target | W Mix | IO Size | MB/s | Avg Latency
PM953 | 4k Rand Read | Total | 0 | 4k | 502.45 | 0.092
PM953 | Example Multi | Total | 50 | | 235.45 | 1.201
PM953 | Example Multi | E:\file1.dat | 0 | 4k | 285.12 | 0.181
PM953 | Example Multi | E:\file2.dat | 100 | 8k | 102.48 | 1.589
The outlier and score calculation is based only on the measurements listed in the aggregated rows.
To support this output format, I added a level to the parser's main data structure:
Before:
stats =
{
“FW version” => “10010L00”,
…
“Read Mix” => 100,
“QD” => 1,
…
“MB/s Read” => 1175,
“IOs Read” => 20398,
…
}
After:
stats =
{
“FW version” => “10010L00”,
…
“Workloads” =>
{
“Total” =>
{
“Read Mix” => 50,
“QD” => 129,
…
},
“E:\file1.dat” =>
{
“Read Mix” => 100,
“QD” => 1,
…
},
“E:\file2.dat” =>
{
“Read Mix” => 0,
“QD” => 128,
…
}
},
“Measurements” =>
{
“Total” =>
{
“MB/s Read” => 285.12,
“IOs Read” => 10128,
…
},
“E:\file1.dat” =>
{
“MB/s Read” => 285.12,
“IOs Read” => 10128,
…
},
“E:\file2.dat” =>
{
“MB/s Read” => 0,
“IOs Read” => 0,
…
}
}
}