-
Notifications
You must be signed in to change notification settings - Fork 738
Cts clone gaters #8881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Cts clone gaters #8881
Conversation
Signed-off-by: arthurjolo <[email protected]>
Signed-off-by: arthurjolo <[email protected]>
Signed-off-by: arthurjolo <[email protected]>
Signed-off-by: arthurjolo <[email protected]>
Signed-off-by: arthurjolo <[email protected]>
Signed-off-by: arthurjolo <[email protected]>
Signed-off-by: arthurjolo <[email protected]>
Signed-off-by: arthurjolo <[email protected]>
Signed-off-by: arthurjolo <[email protected]>
Signed-off-by: arthurjolo <[email protected]>
Signed-off-by: arthurjolo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
| void findLongEdges( | ||
| stt::Tree& clkSteiner, | ||
| int driverID, | ||
| odb::Point driverPt, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: no header providing "odb::Point" is directly included [misc-include-cleaner]
src/cts/include/cts/TritonCTS.h:15:
+ #include "odb/geom.h"
src/cts/src/TritonCTS.cpp
Outdated
| int dx = block_->getDieAreaPolygon().dx(); | ||
| int dy = block_->getDieAreaPolygon().dy(); | ||
|
|
||
| int64_t threshold = std::max(dx, dy) / 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: no header providing "int64_t" is directly included [misc-include-cleaner]
src/cts/src/TritonCTS.cpp:9:
- #include <cstring>
+ #include <cstdint>
+ #include <cstring>Signed-off-by: arthurjolo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
| odb::Point branchPt = {branch.x, branch.y}; | ||
|
|
||
| odb::Point neighborPt = {neighbor->x, neighbor->y}; | ||
| int64_t dist = odb::Point::manhattanDistance(branchPt, neighborPt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: no header providing "int64_t" is directly included [misc-include-cleaner]
src/cts/src/TritonCTS.cpp:9:
- #include <cstring>
+ #include <cstdint>
+ #include <cstring>|
Great to see this PR! Gate cloning is exceptionally important for clock gating to have reasonable performance.
I am curious - do you check the timing on the enable signal as well as the clock signal? Both the clock signal and the enable signal could be the critical path, so it could be that optimizing for the clock signal actually slows the enable signal down and makes the overall path worse.
Can you elaborate on this? If the die size is very small, would it do cloning because it is greater than 1/5 the die size? Is there a rollback if cloning the clock gate doesn't actually improve timing?
I think RSZ already calculates a "maximum wire length" for buffering. Perhaps this value can be used somehow? |
|
Have you tried this with the new cgt module. I'm curious how the two interact |
|
CI is showing very good results for ibex on sky130hs and sky130hd, but some degradation on ibex ng45 and gf180. One small metric failure that I will create a PR to fix.
We only focus on the clock signal. On the entire flow, if I am not mistaken timing paths don't consider the enable signal path during timing analysis.
Indeed this initial value is not ideal and it is preliminary, but it showed to have good results in our testcases. There is no rollback, the cloning is done pre clock tree building, it is hard to determine the improvement or not as we are still going to build the tree. The value for the resizer is one of the candidates I want to try. But I also wan't to have a way for CTS to determine this value (not only for clock gate cloning but for repair_clock_nets as well). So my idea was to create this initial PR that has good results for our testcases and work on the improvements next. |
No, I don't think we have any examples on the CI that uses ctg. Is there any examples using that use ctg and complete the flow? |
STA definitely considers (or should consider) the enable signal path. While it typically wouldn't be considered part of the clock tree, it does affect the timing of signals that pass through it. Thought experiment: Imagine an enable source on one side of the chip, and the clock and sink on the other side of the chip. If you move the clock gate close to the sink, now the enable source is very far away and will take longer than the clock. |
This PR implements 2 new features for CTS: moving and cloning clock gaters.
Moving clock gaters
Some clock gaters can be placed to far from the sinks it connects, creating a long wire to connect the clock gater to the center of its tree increasing the internal delay of the clock gater

eg.: sky130hs/ibex
The highlighted net is the connection between the clock gater and first buffer of its tree, and in the tree viewer we can see the large internal delay of this clock gater.
To fix this issue we enable CTS to move the clock gaters to the center of its output net bounding box, reducing the connection between the clock gater and the first buffer of the tree, reducing the internal delay of the clock gater and the overall latency of the gated sub-clock.

eg.: sky130hs/ibex
Cloning clock gaters
Still related to the placement, since clock gaters drive a small portion of the sinks they can some time be very spread out throw the design, causing some clusters of cells far from one and the other. This can lead to some big skew between this clusters.
To fix this this PR identifies these clusters and separates them into different trees, by creating a clone of the original clock gater for each new cluster. And in the final step the new trees are balanced by the
LatencyBalancer.The clusters are identified using a steiner tree, from the tree we find the long edges and separate the net into the clusters.
eg.: unity test gated_clock1
Original gated clock net:

Steiner tree and long edge separation:

New cloned gated clock and original gated clock:
