12.12  Optimization of the Viterbi Decoder

Returning to the Viterbi decoder example (from Section 12.4), we first set the environment for the design using the following worst-case conditions: a die temperature of 25 C (fastest logic) to 120 C (slowest logic); a power supply voltage of V DD = 5.5 V (fastest logic) to V DD = 4.5 V (slowest logic); and worst process (slowest logic) to best process (fastest logic). Assume that this ASIC should run at a clock frequency of at least 33 MHz (clock period of 30 ns). An initial synthesis run gives a critical path delay at nominal conditions (the default setting) of about 25 ns and nearly 35 ns under worst-case conditions using a high-density 0.6 m m standard-cell target library.

Estimates (using simulation and calculation) show that data arrives at the input pins 5 ns (worst-case) after the rising edge of the clock. The reset signal arrives 10 ns (worst-case) after the rising edge of the clock. The outputs of the Viterbi decoder must be stable at least 4 ns before the rising edge of the clock. This allows these signals to be driven to another ASIC in time to be clocked. These timing constraints are particularly devastating. Together they effectively reduce the clock period that is available for use by 9 ns. However, these figures are typical for board-level delays.

The initial synthesis runs reveal the critical path is through the following six modules:

subset_decode -> compute_metric ->
compare_select -> reduce -> metric -> output_decision

The logic synthesizer can do little or no optimization across these module boundaries. The next step, then, is to rearrange the design hierarchy for synthesis. Flattening ( merging or ungrouping) the six modules into a new cell, called critical , allows the synthesizer to reduce the critical path delay by optimizing one large module.

At present the last module in the critical path is output_decision . This combinational logic adds 2–3 ns to the output delay requirement of 4 ns (this means the outputs of the module metric must be stable 6–7 ns before the rising clock edge). Registering the output reduces this overhead and removes the module output_decision from the critical path. The disadvantage is an increase in latency by one clock cycle, but the latency is already 12 clock cycles in this design. If registering the output decreases the critical path delay by more than a factor of 12 / 13, performance will still improve.

To register the output, alter the code (on pages 575–576) as follows:

module viterbi_ASIC


wire [2:0] Out, Out_r; // Change: add Out_r.


asPadOut #(3,"30,31,32") u30 (padOut, Out_r); // Change: Out_r.

Outreg o_1 (Out, Out_r, Clk, Res); // Change: add output register.



module Outreg (Out, Out_r, Clk, Res); // Change: add this module.

input [2:0] Out; input Clk, Rst; output [2:0] Out_r;

dff #(3) reg1(Out, Out_r, Clk, Res);


These changes move the performance closer to the target. Prelayout estimates indicate the die perimeter required for the I/O pads will allow more than enough area to hold the core logic. Since there is unused area in the core, it makes sense to switch to a high-performance standard-cell library with a slightly larger cell height (96 l versus 72 l ). This cell library is less dense, but faster.

Typically, at this point, the design is improved by altering the HDL, the hierarchy, and the synthesis controls in an iterative manner until the desired performance is achieved. However, remember there is still no information from the layout. The best that can be done is to estimate the contribution of the interconnect using wire-load models. As soon as possible the netlist should be passed to the floorplanner (or the place-and-route software in the absence of a floorplanner) to generate better estimates of interconnect delays.

TABLE 12.13  Critical-path timing report for the Viterbi decoder.

Instance name

Delay information 1


















inPin --> outPin incr arrival trs rampDel cap(pF) cell


CP --> QN 1.65 1.65 F .20 .10 dfctnb

A1 --> ZN .63 2.27 R .14 .08 ao01d1

B --> ZN .84 3.12 F .15 .08 ao04d1

B2 --> ZN .91 4.03 F .35 .17 fn03d1

I --> ZN .39 4.43 R .23 .12 in01d1

S --> Z .91 5.33 F .34 .17 mx21d1

B0 --> CO 2.20 7.54 F .24 .14 ad02d1


... 28 other cell instances omitted ...


B0 --> CO 2.25 23.17 F .23 .13 ad02d1

CI --> CO .53 23.70 F .21 .09 ad01d1

A1 --> Z .69 24.39 R .19 .07 xo02d1

setup: D --> CP .17 24.56 R .00 .00 dfctnb

slack: MET .44

Table 12.13 is a timing report for the Viterbi decoder, which shows the critical path starts at a sequential logic cell (a D flip-flop in the present example), ends at a sequential logic cell (another D flip-flop), with 37 other combinational logic cells in-between. The first delay is the clock-to-Q delay of the first flip-flop. The last delay is the setup time of the last flip-flop. The critical path delay is 24.56 ns, which gives a slack of 0.44 ns from the constraint of 25 ns (reduced from 30 ns to give an extra margin). We have met the timing constraint (otherwise we say it is violated ).

In Table 12.13 all instances in the critical path are inside instance v_1.u100 . Instance name u100 is the new cell (cell name critical ) formed by merging six blocks in module viterbi (instance name v_1 ).

The second column in Table 12.13 shows the timing arc of the cell involved on the critical path. For example, CP --> QN represents the path from the clock pin, CP , to the flip-flop output pin, QN , of a D flip-flop (cell name dfctnb ). The pin names and their functions come from the library data book. Each company adopts a different naming convention (in this case CP represents a positive clock edge, for example). The conventions are not always explicitly shown in the data books but are normally easy to discover by looking at examples. As another example, B0 --> CO represents the path from the B input to the carry output of a 2-bit full adder (cell name ad02d1 ).

The third column ( incr ) represents the incremental delay contribution of the logic cell to the critical path.

The fourth column ( arrival ) shows the arrival time of the signal at the output pin of the logic cell. This is the cumulative delay to that point on the critical path.

The fifth column ( trs ) describes whether the transition at the output node is rising ( R ) or falling ( F ). The timing analyzer examines each possible combination of rising and falling delays to find the critical path.

The sixth column ( rampDel ) is a measure of the input slope (ramp delay, or slew rate). In submicron ASIC design this is an important contribution to delay.

The seventh column ( Cap ) is the capacitance at the output node of the logic cell. This determines the logic cell delay and also the signal slew rate at the node.

The last column ( cell ) is the cell name (from the cell-library data book). In this library suffix 'd1' represents normal drive strength with 'd0' , 'd2 ', and 'd5' being the other available strengths.

1. See the text for explanations of the column headings.

Chapter start ] [ Previous page ] [ Next page ]

S2C: FPGA Base prototyping- Download white paper

Internet Business Systems © 2016 Internet Business Systems, Inc.
595 Millich Dr., Suite 216, Campbell, CA 95008
+1 (408)-337-6870 — Contact Us, or visit our other sites:
AECCafe - Architectural Design and Engineering TechJobsCafe - Technical Jobs and Resumes GISCafe - Geographical Information Services  MCADCafe - Mechanical Design and Engineering ShareCG - Share Computer Graphic (CG) Animation, 3D Art and 3D Models
  Privacy Policy