Skip to content

Conversation

@arthurjolo
Copy link
Contributor

This PR implements 2 new features for CTS: moving and cloning clock gaters.

Moving clock gaters

Some clock gaters can be placed to far from the sinks it connects, creating a long wire to connect the clock gater to the center of its tree increasing the internal delay of the clock gater
eg.: sky130hs/ibex
The highlighted net is the connection between the clock gater and first buffer of its tree, and in the tree viewer we can see the large internal delay of this clock gater.
image

To fix this issue we enable CTS to move the clock gaters to the center of its output net bounding box, reducing the connection between the clock gater and the first buffer of the tree, reducing the internal delay of the clock gater and the overall latency of the gated sub-clock.
eg.: sky130hs/ibex
image

Cloning clock gaters

Still related to the placement, since clock gaters drive a small portion of the sinks they can some time be very spread out throw the design, causing some clusters of cells far from one and the other. This can lead to some big skew between this clusters.
To fix this this PR identifies these clusters and separates them into different trees, by creating a clone of the original clock gater for each new cluster. And in the final step the new trees are balanced by the LatencyBalancer.
The clusters are identified using a steiner tree, from the tree we find the long edges and separate the net into the clusters.

  • In this PR the "long edge" is defined by a threshold value, that is currently determined as: the size of biggest size of the die / 5.
  • This is a provisional value and it should be refined as it does not take into account PDK data, but it is set to this value as we do not have many examples of cases where we need cloning.

eg.: unity test gated_clock1

Original gated clock net:
Captura de tela de 2025-11-09 14-35-50

Steiner tree and long edge separation:
image

New cloned gated clock and original gated clock:
Captura de tela de 2025-11-09 14-55-23

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

void findLongEdges(
stt::Tree& clkSteiner,
int driverID,
odb::Point driverPt,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: no header providing "odb::Point" is directly included [misc-include-cleaner]

src/cts/include/cts/TritonCTS.h:15:

+ #include "odb/geom.h"

int dx = block_->getDieAreaPolygon().dx();
int dy = block_->getDieAreaPolygon().dy();

int64_t threshold = std::max(dx, dy) / 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: no header providing "int64_t" is directly included [misc-include-cleaner]

src/cts/src/TritonCTS.cpp:9:

- #include <cstring>
+ #include <cstdint>
+ #include <cstring>

Signed-off-by: arthurjolo <[email protected]>
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

odb::Point branchPt = {branch.x, branch.y};

odb::Point neighborPt = {neighbor->x, neighbor->y};
int64_t dist = odb::Point::manhattanDistance(branchPt, neighborPt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: no header providing "int64_t" is directly included [misc-include-cleaner]

src/cts/src/TritonCTS.cpp:9:

- #include <cstring>
+ #include <cstdint>
+ #include <cstring>

@rovinski
Copy link
Collaborator

Great to see this PR! Gate cloning is exceptionally important for clock gating to have reasonable performance.

To fix this issue we enable CTS to move the clock gaters to the center of its output net bounding box, reducing the connection between the clock gater and the first buffer of the tree, reducing the internal delay of the clock gater and the overall latency of the gated sub-clock.

I am curious - do you check the timing on the enable signal as well as the clock signal? Both the clock signal and the enable signal could be the critical path, so it could be that optimizing for the clock signal actually slows the enable signal down and makes the overall path worse.

the size of biggest size of the die / 5.

Can you elaborate on this? If the die size is very small, would it do cloning because it is greater than 1/5 the die size? Is there a rollback if cloning the clock gate doesn't actually improve timing?

it does not take into account PDK data

I think RSZ already calculates a "maximum wire length" for buffering. Perhaps this value can be used somehow?

@QuantamHD
Copy link
Collaborator

Have you tried this with the new cgt module. I'm curious how the two interact

@arthurjolo
Copy link
Contributor Author

CI is showing very good results for ibex on sky130hs and sky130hd, but some degradation on ibex ng45 and gf180. One small metric failure that I will create a PR to fix.

Great to see this PR! Gate cloning is exceptionally important for clock gating to have reasonable performance.

To fix this issue we enable CTS to move the clock gaters to the center of its output net bounding box, reducing the connection between the clock gater and the first buffer of the tree, reducing the internal delay of the clock gater and the overall latency of the gated sub-clock.

I am curious - do you check the timing on the enable signal as well as the clock signal? Both the clock signal and the enable signal could be the critical path, so it could be that optimizing for the clock signal actually slows the enable signal down and makes the overall path worse.

We only focus on the clock signal. On the entire flow, if I am not mistaken timing paths don't consider the enable signal path during timing analysis.

the size of biggest size of the die / 5.

Can you elaborate on this? If the die size is very small, would it do cloning because it is greater than 1/5 the die size? Is there a rollback if cloning the clock gate doesn't actually improve timing?

it does not take into account PDK data

I think RSZ already calculates a "maximum wire length" for buffering. Perhaps this value can be used somehow?

Indeed this initial value is not ideal and it is preliminary, but it showed to have good results in our testcases. There is no rollback, the cloning is done pre clock tree building, it is hard to determine the improvement or not as we are still going to build the tree.

The value for the resizer is one of the candidates I want to try. But I also wan't to have a way for CTS to determine this value (not only for clock gate cloning but for repair_clock_nets as well). So my idea was to create this initial PR that has good results for our testcases and work on the improvements next.

@arthurjolo
Copy link
Contributor Author

arthurjolo commented Nov 20, 2025

Have you tried this with the new cgt module. I'm curious how the two interact

No, I don't think we have any examples on the CI that uses ctg. Is there any examples using that use ctg and complete the flow?

@rovinski
Copy link
Collaborator

We only focus on the clock signal. On the entire flow, if I am not mistaken timing paths don't consider the enable signal path during timing analysis.

STA definitely considers (or should consider) the enable signal path. While it typically wouldn't be considered part of the clock tree, it does affect the timing of signals that pass through it.

Thought experiment: Imagine an enable source on one side of the chip, and the clock and sink on the other side of the chip. If you move the clock gate close to the sink, now the enable source is very far away and will take longer than the clock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants