# Modeling and Optimization of Low Power Resonant Clock Mesh

Wulong Liu<sup>†</sup>, Guoqing Chen <sup>‡</sup>, Yu Wang<sup>†</sup>, Huazhong Yang<sup>†</sup>

<sup>†</sup>Department of Electronic Engineering, Tsinghua National Laboratory for

Information Science and Technology (TNList), Tsinghua University, Beijing, China;

<sup>‡</sup>Advanced Micro Devices, Inc., Beijing, China;

<sup>†</sup>Email: liuwl10@mails.tsinghua.edu.cn

Abstract-Power consumption is becoming more critical in modern integrated circuit (IC) designs and clock network is one of the major contributors for on-chip power. Resonant clock has been investigated as a potential solution to reduce the power consumption in clock network by recycling the energy with on-chip inductors. Most of the previous resonant clock work focuses on H-tree structures, while in this work, we propose a modeling and optimization method for the mesh structure, which suffers from the high power consumption more seriously than the tree structure. Closed-form expressions for the transfer function, skew, and power are derived. Based on these expressions, impacts of design factors, such as the buffer size, LC tank location, grid size, wire width, and the sparsity of buffers and LC tanks, are fully explored to make trade-offs among power, skew, and area, which can be used as design guidelines for top level resonant clock mesh in early design stages. The exploration is also extended to 3D ICs and different mesh structures are evaluated. Matlabbased implementation of the proposed simplified circuit model can achieve over  $10^5$  times speedup compared to SPICE-based simulation.

#### I. INTRODUCTION

With the technology scaling, power consumption has become one of the major concerns in modern high-performance processor designs. Specifically, clock distribution network (CDN) consumes a significant portion of chip power due to the high switching activity, the large scale distribution, and the great amount of clock loads. The power consumption of on-chip CDN can range from 30% to 70% of the total chip power [1], [2], [3]. Many techniques have been proposed to reduce the power consumption of on-chip CDN, such as reducing the switching activities [4], optimizing the number of buffers and the length of interconnects [5], and dynamically adjusting the voltage swings and frequencies [6]. However, these efforts on CDN power saving are still insufficient to meet the strict power requirement [2].

Resonant clocking provides a promising technique to reduce the CDN power. The distributed LC tanks attached to a CDN can recycle the energy by transforming the energy between electrical and magnetic forms. In this approach, when the CDN resonates at the desired frequency, only a small amount of input power is needed to sustain the full swing clock signal, therefore, reducing the overall power consumption. Resonant clock has drawn much attention in both academia and industries (e.g., a 4.6 GHz H-tree based resonant CDN was reported by Chan *et al.*), and the resonant clock mesh structure was first introduced in product processor by AMD in the *Piledriver* core [7]. In the most recently released IBM  $POWER8^{TM}$ , resonant clock mesh results in about 30% power reduction in top level CDN [8], [9].

Most of the previous research work of resonant clock focuses on H-tree based CDNs [1], [10]. The mesh structure, however, is also widely used, especially in the top level clock network of very large scale microprocessors [7], [8], [9] due to the small skew and high reliability. Compared with the H-tree structure, one of the main drawbacks of the mesh structure is the severe power consumption due to the large wire loads. Since the mesh structure has multiple drivers, it is also more difficult to analyze analytically as compared to tree structure. Hu et al. [2], [3] presented an algorithm to synthesize the LC resonant clock grid, however, the optimization iteration is based on time-consuming SPICE simulations. In addition, the impact of parasitic inductance of the interconnect is neglected, which can cause inaccurate resonant frequency calculation [11]. Furthermore, the starting point of this synthesis is an existing non-resonant grid (mesh), and the impact of the grid characteristics such as the grid size, buffer location, and wire width on the power/skew of the resonant clock can not be explored. There are, therefore, still several major problems needed to be solved for resonant clock mesh. For example: 1) How to evaluate the skew and power in a much faster way rather than depending on the timeconsuming SPICE simulations? 2) How would the locations of LC tanks and buffers affect the power efficiency and skew? 3) How to decide the grid size, the wire width, and the sparsity of buffers and LC tanks to trade-off the skew, power, and area?

Consequently, in this paper, we provide an equivalent circuit model to simplify the resonant clock mesh structure by converting the multi-driver system into a transmission line system driven by a single driver, and provide closed-form expressions to evaluate the power efficiency and clock skew. Based on this model, a number of design factors are optimized to minimize the skew, power, and area overhead, such as the grid size, the wire width, and the sparsity of buffers and LC tanks. With on-chip process-voltage-temperature (PVT) variations and practical floorplan constraints, the mesh structure may not be perfectly symmetric and drivers of the mesh may have nonzero skews. We also investigate the impact of these non-ideal factors on the power and skew of the resonant clock mesh and extended the analysis to 3D IC design cases. In summary, the main contribution of this work is to provide a convenient solution for the designers to determine the proper resonant clock mesh structure in the early design stage.

<sup>&</sup>lt;sup>1</sup>This work was supported by 973 project 2013CB329000, National Natural Science Foundation of China (No.61261160501, 61373026), The Importation and Development of High-Caliber Talents Project of Beijing Municipal Institutions, and Tsinghua University Initiative Scientific Research Program.

### II. BACKGROUND AND PROBLEM FORMULATION

#### A. Resonant Clock Theory

In a resonant clock structure, the energy alternates between the electrical form in the capacitors and the magnetic form in the inductors. The inductors and the capacitors are in parallel. At a specific frequency, the resonant frequency, the reactance of the inductors equals to that of the capacitors [2], and the total impedance of the clock structure reaches the maximum, resulting in a minimum input current (i.e., minimum power). The *resonant frequency* is obtained as follows:

$$f_0 = \frac{1}{2\pi\sqrt{L_r C_r}},\tag{1}$$

where  $L_r$ ,  $C_r$  are the equivalent resonant inductance and capacitance of the system, respectively.  $C_r$  is formed by the wire capacitance and buffer capacitance of the clock network.  $L_r$  is formed by the inserted inductors and parasitic wire inductance. In practice, a large decoupling capacitor  $C_{de}$  is needed in series with the inserted inductor to generate the AC ground and eliminate the static leakage power.  $C_{de}$  needs to be sufficiently large such that the series resonance formed by  $L_r$  and  $C_{de}$  will not affect the parallel resonance:

$$\frac{1}{2\pi\sqrt{L_rC_{de}}} \ll \frac{1}{2\pi\sqrt{L_rC_r}}.$$
(2)

#### B. Transmission Line Theory

For a transmission line depicted in Fig. 1, l is the length, and R, L, and C are the parasitic resistance, inductance, and capacitance per unit length, respectively.  $R_d$ ,  $C_d$  are the equivalent resistance and capacitance of the driver.  $Z_L$  is the impedance of the load. From the ABCD parameters in the transmission line theory [12], the input impedance seen from the near end, and the transfer function from the near end to the far end of a transmission line are [13]

$$Z_{in} = Z_c \frac{Z_L + Z_c \tanh \theta}{Z_c + Z_L \tanh \theta}$$
(3)

$$H_{line}(s) = \frac{1}{\cosh \theta + \frac{Z_c}{Z_L} \sinh \theta}$$
(4)

where  $\theta = l\sqrt{(R+sL)sC}$  and  $Z_c = \sqrt{(R+sL)/sC}$ . In order to achieve a full swing clock signal on the wire, the magnitude of the transfer function from the input to any point of the interconnect should be equal to or greater than 0.9 [10], i.e.

$$|H(j\omega)| \ge 0.9. \tag{5}$$



Fig. 1. Equivalent circuit model of a transmission line

## C. Problem Formulation

For a resonant clock mesh structure shown in Fig. 2, the design problem can be formulated as: under the constraints of expressions (1), (2), and (5), minimizing the skew, power, and LC tank area by optimizing the numbers and locations of buffers and LC tanks, grid size, and wire width. The optimization objective can be stated as follows:

$$\begin{array}{l} \text{Minimize: } f(N_b, N_{LC}, l_{B\_LC}, w_{grid}, l_{grid}) & (6) \\ = \alpha f_{skew}(N_b, N_{LC}, l_{B\_LC}, w_{grid}, l_{grid}) + \\ \beta f_{power}(N_b, N_{LC}, l_{B\_LC}, w_{grid}, l_{grid}) + \\ \gamma f_{area}(N_{LC}), \end{array}$$

where  $N_b, N_{LC}, l_{B_LC}, w_{grid}, l_{grid}$  are the number of buffers, the number of LC tanks, the relative distance between the buffer and the LC tank, and the width and the length of grid wire;  $f_{skew}, f_{power}, f_{area}$  are the functions of skew, power consumption, and area overhead of LC tanks;  $\alpha$ ,  $\beta$ , and  $\gamma$ are the user-defined weighting coefficients to make a trade-off among skew, power, and area overhead.



Fig. 2. An example with different LC tank and buffer locations in the mesh structure, each black triangle represents a buffer and each red star represents a LC tank.

## III. SIMPLIFIED CIRCUIT MODEL FOR RESONANT MESH NETWORK

Global mesh local tree clock structure has been widely used in industries for large scale microprocessors since it combines the low skew and high reliability property of the mesh and low power property of the tree. In such a structure, the global mesh delivers the clock signal to local clock root buffers, and a local tree is built from each root buffer to drive the final clock sinks. The local clock root buffers are usually evenly distributed along with the mesh wires. These loads are much smaller compared to the mesh wires and are ignored in our analysis. In this paper, the per unit length wire parasitic values and equivalent driver resistance/capacitance are extracted from an industrial 28 nm process technology.

Based on the above assumptions, the mesh structure is a symmetric structure. This symmetry is maintained during the buffer and LC tank placement. For a resonant clock mesh, there are several symmetric schemes with different placements of the buffers and LC tanks. An example scheme is shown in Fig. 2, where the buffers are placed such that the buffer distance is twice of the grid pitch, and the LC tanks are interleaved with the buffers at the node marked with the number '1'. If the buffer location is fixed, there are also two

other symmetric schemes where the LC tanks can be placed at the locations marked with '2' or '3'. These three schemes are referred as case 1, case 2, and case 3, respectively.



Fig. 3. Equivalent circuit model of the case 1 resonant mesh structure

For case 1 in Fig. 2, the region each buffer can drive is marked with the blue dotted-line. In this region, the buffer is driving four grid lines in four directions, and each grid line is series-connected with two other grid lines, which are shared with the neighboring buffer regions. From the symmetric property of the structure, the four nodes where the LC tanks are inserted should have the same voltage level, and can be virtually shorted [14]. The same assumption can be applied to other symmetric nodes. Because the eight peripheral grid lines with characteristics (l, R, C, L) in this region are shared with the neighboring buffer regions, each of them is equivalent to two parallel-connected lines with characteristics (l, 2R, 1/2C, 2L). In addition, each LC tank is shared by four neighboring buffer regions, and this LC tank with  $(L_{re}, R_{re}, C_{de})$  is also equivalent to four parallelconnected LC tanks with  $(4L_{re}, 4R_{re}, 1/4C_{de})$ , where  $R_{re}$ is the parasitic resistance of the resonant inductor.  $R_{re}$  is related to  $L_{re}$  by Q factor, and Q is assumed to be 30 in this paper. According to the symmetric property, there should be no current across the buffer region boundaries. In this way, the equivalent circuit model for each buffer region can be obtained as shown in Fig. 3. And the model can be further reduced into a distributed *RLC* transmission line driven by a buffer and loaded with an LC tank. In the same manner, the other two cases with different LC tank locations can be modeled, and the simplified models are summarized in Fig. 4. Note that, for case 1 and case 3, the simplified circuit models will remain the same even if the grid is not an ideal square shape due to the design limitations.

For simplification, we also take case 1 as an example to derive the expressions of transfer function, skew, and power. According to the transmission line theory presented in Section II, the transfer function from the input voltage source to the node with LC tank is

$$H(s) = H_1(s)H_2(s)$$
 (7)

in which  $H_1(s)$  is the transfer function from the voltage source to the buffer output node,

$$H_1(s) = \frac{Z_{in1}}{R_d(1 + Z_{in1}sC_d) + Z_{in1}};$$
(8)

 $H_2(s)$  is the transfer function from the buffer output node to the LC tank node,



Fig. 4. Equivalent circuit models with different LC tank placement.

$$H_2(s) = \frac{1}{\cosh \theta' + \frac{Z'_c}{Z_L} \sinh \theta'},\tag{9}$$

where the equivalent parameters of the transmission lines are

$$\theta' = 2l\sqrt{\left(\frac{R}{4} + s\frac{L}{4}\right)s4C} = 2l\sqrt{(R+Ls)Cs},\tag{10}$$

$$Z'_{c} = \sqrt{\frac{\frac{R}{4} + s\frac{L}{4}}{s4C}} = \frac{1}{4}\sqrt{\frac{R+Ls}{Cs}}.$$
 (11)

The load impedance at the node with LC tank is

$$Z_L = R_{re} + sL_{re} + \frac{1}{sC_{de}}.$$
(12)

The input impedance for the equivalent transmission line is

$$Z_{in1} = Z'_c \frac{Z_L + Z'_c \tanh \theta'}{Z'_c + Z_L \tanh \theta'}.$$
(13)

For resonant clock working at the resonant frequency, the base frequency component is maintained while other high order frequency components are suppressed. The Fourier series method presented by Chen *et al.* [13] can be used to analyze these circuit models, and two or three harmonics are sufficient to obtain accurate results. The skew solution can be obtained by referring to the method [13] for the delay calculation. For a clock signal with period of T and transition time  $\tau$ , the total power consumption of a buffer region is the summation of the power of different frequency components and can be obtained as.

$$P = \frac{1}{2} \sum_{m=1,3,\dots} \Re(\frac{1}{Z_{in0}(jm\omega_0)}) A_m^2,$$
(14)

where  $A_m = \frac{2TV_{dd}}{\tau m^2 \pi^2}$  is the magnitude of the *m*-th order harmonic of the input signal which is explained in detail in the work [13], and  $\omega_0 = 2\pi/T$ .  $Z_{in0}(jm\omega_0)$  is the impedance seeing by the equivalent voltage source in the buffer at the *m*th order angle frequency, and can be obtained by replacing *s* with  $jm\omega_0$  in the following expression,

$$Z_{in0}(s) = R_d + \frac{Z_{in1}}{1 + Z_{in1}C_d s}.$$
(15)

Meanwhile, the "large" area overhead of the on-chip inductor and the decoupling capacitor should be carefully evaluated. In this work, we assume the area of the on-chip inductor (capacitor) is proportional to its inductance (capacitance) value. For easy comparison, we normalize the inductor area and capacitor area as  $A_L$  and  $A_C$ , respectively.

## IV. OPTIMIZATION OF THE RESONANT CLOCK MESH

In order to solve the optimization problem shown in expression (6), there are mainly three design factors needed to be carefully explored: 1) the method to determine the most efficient inductor/buffer size with the constraints defined in expressions (1), (2), and (5); 2) the most efficient relative placement of buffers and LC tanks; and 3) the impact of buffer sparsity, wire width, and grid size. In this work, we implement the simplified model of the resonant clock mesh with Matlab, and verify the accuracy of the proposed model with SPICE as shown in Fig. 5. The runtime is only within milliseconds, which is over  $10^5$  times faster than SPICE simulation. The error of the power and delay of our simplified model is within 10% as compared to SPICE simulation, which is similar to the work proposed by Rosenfeld et al. [10] and is sufficient for early stage design exploration. Note that in the following exploration, we implement a top-level mesh structure with 20x20 grids.



Fig. 5. Comparison of our proposed model with Spice for case 1 configuration in Fig. 2 with grid size of 500 $\mu$ m, wire width of 4  $\mu$ m, and frequency of 3GHz. (a) The current waveform of the input voltage source; (b) The voltage waveform at the LC tank node.

#### A. Determine the inductor/buffer size

For a given mesh structure, the inductor size determines the resonant frequency, which should be managed to match the input clock frequency. After the inductor size is determined, the buffer size is tuned to satisfy the constraint described in expression (5). Note that when the buffer size is changed, the buffer output capacitance will slightly shift the resonant frequency. Our experiments show that this resonant frequency shift is minor and can be ignored. The impact of buffer size on the transfer function and power consumption is shown in Fig. 6. In the rest of this paper, the power value is referred to the power consumed in each buffer region. It can be seen from the figure that, the power consumption is proportional to the buffer size, therefore the buffer should be sized such that the maximum magnitude of the transfer function equals to 0.9, which will result in a minimum power consumption while maintaining the full clock swing. In this case, the buffer size can be obtained by referring to the method proposed by Rosenfeld et al. [10].



Fig. 6. The impact of buffer size on the power and maximum magnitude of the transfer function. The equivalent resistance and capacitance of the default 1X buffer are  $100\Omega$  and 0.063 pf, respectively.

## B. Optimize the location of LC tanks

Based on the above design guide for determining the inductor/buffer size, the different configuration cases shown in Fig. 2 and Fig. 4 are used to evaluate the impact of the relative placement of the LC tanks and buffers on the power, skew, and area overhead of the mesh structure. The experimental results are listed in Table I. It can be seen that case 2 is the most power efficient, however, suffers from the severe area overhead of LC tanks. Case 1 has higher power consumption, but the lowest skew and area overhead. Placing the LC tanks and buffers in an interleaved mode (case 1) benefits the skew while compromises the power, which agrees with the structure adopted in *Piledriver* [7]. Generally, clock skew is more critical than power in resonant clock, since the power has already been reduced significantly. Therefore, we set the weighting coefficients,  $\alpha$ ,  $\beta$ , and  $\gamma$ , as 100, 200, 100, respectively to calculate the optimization objective shown in expression (6) (*Min Obj*). The normalized inductor area and capacitor area,  $A_L$  and  $A_C$ , are calculated by referring to the references [15] and [16], respectively. In addition, the experiment also demonstrates that the resonant clock mesh structure can save more than 80% of power consumption, compared to the case without utilizing the resonant LC tanks.

TABLE I Power, skew, and area overhead with different LC tank

| LOCATIONS.           |               |               |               |  |  |  |  |
|----------------------|---------------|---------------|---------------|--|--|--|--|
| Case                 | Case 1        | Case 2        | Case 3        |  |  |  |  |
| Power                | 1.5mW         | 1.0mW         | 1.2mW         |  |  |  |  |
| Skew                 | 1.0ps         | 1.3ps         | 3.0ps         |  |  |  |  |
| Area Overhead        | $1A_L + 1A_C$ | $4A_L + 1A_C$ | $1A_L + 1A_C$ |  |  |  |  |
| $R_d$                | 21.20Ω        | 49.37Ω        | 32.57Ω        |  |  |  |  |
| $C_d$                | 0.2970pf      | 0.1276pf      | 0.1934pf      |  |  |  |  |
| Lre                  | 1.65nH        | 3.3nH         | 1.65nH        |  |  |  |  |
| Rre                  | 1.0456Ω       | 1.0456Ω       | 1.0456Ω       |  |  |  |  |
| Grid Length          | 500um         | 500um         | 500um         |  |  |  |  |
| Non-resonant Dynamic | 7.005mW       | 7.487mW       | 7.684mW       |  |  |  |  |
| Power $(CV^2f)$      | 7.335111 **   | 7.407111      |               |  |  |  |  |
| Power Reduction      | 81.24%        | 86.64%        | 84.38%        |  |  |  |  |
| Min_Obj              | 577           | 1096          | 947           |  |  |  |  |

## C. Explore the impact of buffer sparsity

By varying the buffer sparsity in a given mesh structure, the wire length and the number of LC tanks in a buffer region are different. In case 1 shown in Fig. 2, the buffer pitch is two grids. We also define another two symmetric cases in which the buffer pitches are one grid and  $\sqrt{2}$  grids, named as cases 4 and 5, respectively. For the case 4, LC tanks are placed at the midpoint of each gird line. For case 5, LC tanks are placed at nodes which do not have buffers. The experimental

results shown in Table II demonstrate that, with denser buffers, the power and skew become lower, however, the area overhead increase significantly, due to the large size and the high density of LC tanks. Note that the power values are multiplied by the density factor for a fair comparison as shown in Table II.

|            |        |        | IADLE II |      |           |        |
|------------|--------|--------|----------|------|-----------|--------|
| POWER, SKI | EW, AN | D AREA | OVERHEAD | WITH | DIFFERENT | BUFFER |
|            |        |        | SDADSITV |      |           |        |

| 514K5111.                       |                              |                  |                  |  |  |  |  |
|---------------------------------|------------------------------|------------------|------------------|--|--|--|--|
| Buffer pitch                    | Buffer pitch Case 4 (1 grid) |                  | Case 1 (2 grids) |  |  |  |  |
| Power                           | 0.168mW×4                    | 0.375mW×2        | 1.5mW×1          |  |  |  |  |
|                                 | =0.6720mW                    | =0.75mW          | =1.5mW           |  |  |  |  |
| Skew                            | 0.04ps                       | 0.14ps           | 1.00ps           |  |  |  |  |
| Area Overhead $48.48A_L + 1A_C$ |                              | $3.03A_L + 1A_C$ | $1A_L + 1A_C$    |  |  |  |  |
| $R_d$                           | 231.6Ω                       | 100.2Ω           | 21.2Ω            |  |  |  |  |
| $C_d = 0.0272 pf$               |                              | 0.0628pf         | 0.2972 pf        |  |  |  |  |
| Lre                             | 10.0nH                       | 2.5nH            | 1.65nH           |  |  |  |  |
| $R_{re}$                        | 3.14Ω                        | 1.57Ω            | 1.05Ω            |  |  |  |  |
| Grid Length                     | 500um                        | 500um            | 500um            |  |  |  |  |
| Min_Obj 8673                    |                              | 908              | 577              |  |  |  |  |

#### D. The impact of grid size and wire width

In this work, we also vary the grid size and the wire width for case 1 configuration to explore the parasitic effect of the interconnect on the performance of the resonant clock. The impacts of the wire width on the skew, power, and resonant inductance are listed in Table III. As shown in the table, when the wire width increases, the wire capacitance becomes larger, require a smaller resonant inductor; however, the power consumption increases. The impact of the grid size on the resonant clock mesh is shown in Table IV. With increasing grid size, the skew is increasing, the normalized area overhead is decreasing. The normalized power density first decreases and then increases, resulting an optimal power at grid size of 400um. When the grid size is further increasing to 900um, a full swing clock can no longer be obtained due to the resistive degradation on the wires.

| THE IMPACT OF DIFFERENT WIRE WIDTHS. ORID SIZE IS 5000M. |         |                |         |       |          |      |         |
|----------------------------------------------------------|---------|----------------|---------|-------|----------|------|---------|
| Width                                                    | $C_u$   | $R_u$          | $L_u$   | Power | $L_{re}$ | Skew | Buffer  |
| (um)                                                     | (fF/um) | $(m\Omega/um)$ | (pH/um) | (mW)  | (nH)     | (ps) | (Size)  |
| 1                                                        | 0.320   | 20.735         | 0.237   | 0.877 | 2.68     | 1.60 | 2.5857X |
| 1.5                                                      | 0.359   | 14.304         | 0.238   | 0.882 | 2.45     | 1.20 | 2.5574X |
| 2                                                        | 0.405   | 10.601         | 0.243   | 0.947 | 2.25     | 0.95 | 2.7237X |
| 2.5                                                      | 0.451   | 8.478          | 0.246   | 1.000 | 2.10     | 0.70 | 2.9745X |
| 3                                                        | 0.498   | 7.084          | 0.249   | 1.200 | 1.90     | 0.60 | 3.4937X |
| 3.5                                                      | 0.543   | 6.090          | 0.251   | 1.200 | 1.80     | 0.70 | 3.6353X |
| 4                                                        | 0.592   | 5.346          | 0.252   | 1.500 | 1.65     | 1.00 | 4.7168X |
|                                                          |         |                |         |       |          |      |         |

 TABLE III

 THE IMPACT OF DIFFERENT WIRE WIDTHS. GRID SIZE IS 500um.

 TABLE IV

 The impact of different grid sizes. Wire width is 4um

| Grid | Power | Normalized | $L_{re}$ | Normalized                 | Skew | Buffer   |
|------|-------|------------|----------|----------------------------|------|----------|
| (um) | (mW)  | Power      | (nH)     | Area                       | (ps) | Size     |
|      |       | Density    |          | Overhead                   |      |          |
| 100  | 0.18  | 1.000      | 5.70     | $A_L + A_C$                | 0.04 | 0.4296X  |
| 200  | 0.29  | 0.403      | 3.40     | $0.149A_L + \frac{A_C}{2}$ | 0.14 | 0.7617X  |
| 300  | 0.54  | 0.333      | 2.50     | $0.049A_L + \frac{A_C}{3}$ | 0.25 | 1.4203X  |
| 400  | 0.89  | 0.309      | 2.00     | $0.022A_L + \frac{A_C}{4}$ | 0.40 | 2.4884X  |
| 500  | 1.50  | 0.333      | 1.65     | $0.012A_L + \frac{A_C}{5}$ | 1.00 | 4.7168X  |
| 600  | 2.30  | 0.355      | 1.51     | $0.007A_L + \frac{A_C}{6}$ | 1.60 | 7.7556X  |
| 700  | 3.60  | 0.408      | 1.40     | $0.005A_L + \frac{A_C}{7}$ | 3.00 | 13.7740X |
| 800  | 6.90  | 0.599      | 1.28     | $0.004A_L + \frac{A_C}{8}$ | 8.00 | 40.0700X |
| 900  | X     | Х          | X        | X                          | Х    | X        |

## E. The impact of input signal skew and grid line variation

The mesh structure is usually driven by a buffer tree. Due to design limitation and process variations, there are skews at the mesh driver buffers. We also take case 1 in Fig. 2 as an example to explore the effect of input clock skews on the resonant clock performance. We vary the clock latency at one of the mesh buffers to create an input skew between this buffer and other buffers. Since input skew will break the symmetry of the mesh, the experiment about the input skew is implemented with SPICE. Fig. 7 shows the impact of the input skew on the power and output skew. From the experiments, the impact of the input skew is limited only to its direct neighbors; therefore, the total power consumption in the figure is the summation of the power of the skewed buffer and its four neighbor buffers. When the input clock skew is within  $\pm 10$  ps, the output clock skew is in the range of 1ps-1.5ps, which demonstrates the averaging property of mesh structure. When the input clock skew is beyond  $\pm 10$  ps, which exceeds the mesh *averaging* ability, the output clock skew increases dramatically. When the input skew is negative, the power is proportional to the input skew. This behavior happens because the skewed buffer is switching earlier than all other buffers, such that it drives the whole clock network during the out-phase time period, which contributes more power consumption than the case when the input clock skew is positive. With a practical range (<10ps), the impact of the input skew on the skew and power of the mesh is minor.



Fig. 7. The impact of input clock skews on the power and output skew.

In addition to the impact of input signal skew, the design limitation and process variation would also lead to the unideal asymmetric mesh structure in the real design. To evaluate the impact of grid length mismatch in horizontal and vertical directions, we change the ratio between the vertical grid length and horizontal grid length while keep the total wire length fixed. As shown in Fig. 8 (a), the power consumption is not affected by the grid length ratio and there is little fluctuation for the clock skew. Meanwhile, the process variation can lead to non-uniform grid width. To evaluate this impact, we vary the vertical wire width, it can be seen from Fig. 8 (b) that there is little impact on the power and skew with  $\pm 5\%$  process variation on the grid line width. These results demonstrate that our proposed simplified model can also be utilized to determine the efficient resonant mesh structure in the early design stage even if the symmetry of the mesh is broken due to design limitations and process variations afterwards.





#### V. RESONANT CLOCK DESIGN FOR 3D ICS

3D stacking technology is treated as one of the most promising methods to continue the "Moore's Law". The 3D clock network design is much more challenging than the 2D IC case due to the increased design space, the severer power consumption and reliability issues. While most of the previous work mainly focuses on the 3D clock tree design, we also explore the efficient resonant clock mesh structure for the 3D clock network.



Fig. 9. Two 3D-stacked resonant mesh structures. (a) Case 1 for both dies. (b) Hybrid of case 1 and case 3.

From the aspect of pre-bond testability, each die should have a fully connected clock mesh structure and different dies are connected through through-silicon-vias (TSVs). Two different 3D resonant clock mesh structures are explored as shown in Fig. 9. The first case named as 3DCase1 is stacked by two dies with the same resonant clock mesh structure (case 1 in Fig. 2), while the second case named as 3DCase2 is stacked by two dies with different resonant clock mesh structures (the top die is referred to case 1, and the bottom die is referred to case 3 in Fig. 2).



Fig. 10. Output skew and power for two 3D-stacked mesh structures with different input clock skews.

Due to different process variations on different dies, we apply an input skew between the two layers. The power consumption and output clock skew for different input skews are shown in Fig. 10. The power values are the summation of the power of one buffer region at each layer. As shown in the figure, compared with 3DCase1, the "hybrid" 3D architecture, 3DCase2, is much more power efficient, due to the reduced short circuit power of vertically adjacent buffers, and the higher power efficiency of case 3 structure as shown in Table I. In addition, although case 3 structure suffers from much more severe skew issue than case 1 as shown in Table I, the 3D integration compensates this drawback and results in a small skew in 3DCase2 similar to 3DCase1.

The 3D resonant clock meshes shown in Fig. 9 are fully connected by TSVs (i.e., TSVs are inserted at each mesh node, which is area consuming). For the above discussed power-efficient 3D resonant clock mesh structure (3DCase2), we remove the TSV inserted at the mesh buffer location, which can save half of the TSVs. Compared with the fullyconnected case, the TSV-reduced case can save up to 5.6% power consumption as shown in Fig. 11, but with minor degradation on clock skew.



Fig. 11. For the 3DCase2 with full TSVs and reduced TSVs, the relationship of output skew and power with different input clock skews.

#### VI. CONCLUSION

In this work, we propose a simplified modeling and optimization method for resonant clock mesh structure. The buffer size, LC tank location, grid size, wire width, and the sparsity of buffers and LC tanks are fully explored to make trade-offs among power, skew, and area. The analysis is extended to 3D mesh structures to further explore the advantages of resonant clock. The analysis can be used to guide the top level resonant clock mesh designs in microprocessors in early design stages.

#### REFERENCES

- M. R. Guthaus, "Distributed LC resonant clock tree synthesis," in *ISCAS*, 2011, pp. 1215–1218.
- [2] X. Hu *et al.*, "Distributed resonant clock grid synthesis (rocks)," in *DAC*, 2011, pp. 516–521.
- [3] —, "Distributed LC resonant clock grid synthesis," *TCAS-I*, vol. 59, no. 11, pp. 2749–2760, 2012.
- [4] Q. Wu et al., "Clock-gating and its application to low power design of sequential circuits," TCAS-I, vol. 47, no. 3, pp. 415–420, 2000.
- [5] J. G. Xi et al., "Buffer insertion and sizing under process variations for low power clock distribution," in DAC, 1995, pp. 491–496.
- [6] J. Pangjun et al., "Low-power clock distribution using multiple voltages and reduced swings," TVLSI, vol. 10, no. 3, pp. 309–318, 2002.
- [7] V. S. Sathe *et al.*, "Resonant-clock design for a power-efficient, high-volume x86-64 microprocessor," *JSSC*, vol. 48, no. 1, pp. 140–149, 2013.
- [8] P. Restle *et al.*, "Wide-frequency-range resonant clock with on-thefly mode changing for the *POWER8<sup>TM</sup>* microprocessor," in *ISSCC*, 2014, pp. 100–101.
- [9] E. Fluhr et al., "The 12-core POWER8<sup>TM</sup> processor with 7.6 tb/s io bandwidth, integrated voltage regulation, and resonant clocking," in ISSCC, vol. PP, no. 99, 2014, pp. 1–14.
- [10] J. Rosenfeld *et al.*, "Design methodology for global resonant H-tree clock distribution networks," *TVLSI*, vol. 15, no. 2, pp. 135–148, 2007.
- [11] S. Rahimian et al., "Design of resonant clock distribution networks for 3-D integrated circuits," in *Integrated Circuit and System Design. Power* and Timing Modeling, Optimization, and Simulation. Springer Berlin Heidelberg, 2011, vol. 6951, pp. 267–277.
- [12] L. N. Dworsky, Modern transmission line theory and applications, 1979, vol. 260.
- [13] G. Chen *et al.*, "An RLC interconnect model based on fourier analysis," *TCAD*, vol. 24, no. 2, pp. 170–183, 2005.
- [14] H. B. Bakoglu et al., Circuits, interconnections, and packaging for VLSI. Addison-Wesley Reading, MA, 1990, vol. 6.
- [15] A. Hatzopoulos *et al.*, "Analysis of coil parameter extraction methods for on-chip inductor design," in *ECCTD*, 2005, pp. 39–42.
- [16] C. Zhen *et al.*, "A study of mimim on-chip capacitor using cu/sio2 interconnect technology," in *Microwave and Wireless Components Letters*, *IEEE*, vol. 12, no. 7, 2002, pp. 246–248.