Posted on:
|
Hi everyone. Currently Im designing systolic array using Verilog HDL. Systolic array is consist of an array of Processing Element (PE). The design is synthesise usng Xilinx ISE. Below is my previous design for 2 PEs.
module SystolicArray(Clk,Rst,SubSec,QueSec,Gap,Score); parameter ComputeDataWidth = 15; parameter PE = 2; //LENGTH localparam N_A = 3'b000, //nucleotide "A" N_C = 3'b001, //nucleotide "C" N_G = 3'b010, //nucleotide "G" N_T = 3'b011, //nucleotide "T" G_P = 3'b100; //gap '-' input Clk,Rst; input [2:0] SubSec,Gap; input wire [PE*3-1:0] QueSec; output signed [ComputeDataWidth-1:0] Score; wire [ComputeDataWidth-1:0] d [PE-1:0]; wire [ComputeDataWidth-1:0] t [PE-1:0]; wire [ComputeDataWidth-1:0] l [PE-1:0]; wire [ComputeDataWidth-1:0] dl [PE-1:0]; wire [ComputeDataWidth-1:0] score [PE-1:0]; wire [PE*3-1:0] SSout,GAPout; assign Score = score[PE-1] ; genvar i; generate for (i=0; i < PE; i = i + 1) begin : pe_block if (i == 0) //first processing element in auto-generated chain begin:pe ProcessingElement pe0 ( .Clk (Clk), .Rst (Rst), .SS (SubSec[2:0]), .QC (QueSec[2:0]), .Gap (Gap[2:0]), .DDiag (15'b0), .DTop (15'b0), .DLeft (15'b0), .LDiag (15'b0), .LLeft (15'b0), .Zero (15'b0), .SCORE_PE_Previous (15'b0), .DiagOut (d[i]), .LeftOut (l[i]), .TopOut (t[i]), .SCORE_Left_D (dl[i]), .SS_Out (SSout[2:0]), .Gap_Out (GAPout[2:0]), .Best_So_Far_PE_SCORE (score[i]) ); end else //processing elements other than first one begin:pe ProcessingElement pe1 ( .Clk (Clk), .Rst (Rst), .SS (SSout[(3*i)-1:(3*i)-3]), .QC (QueSec[(3*i)+2:3*i]), .Gap (GAPout[(3*i)-1:(3*i)-3]), .DDiag (d[i-1]), .DTop (t[i-1]), .DLeft (dl[i-1]), .LDiag (d[i-1]), .LLeft (l[i-1]), .Zero (15'b0), .SCORE_PE_Previous (score[i-1]), .DiagOut (d[i]), .LeftOut (l[i]), .TopOut (t[i]), .SCORE_Left_D (dl[i]), .SS_Out (SSout[(3*i)+2:3*i]), .Gap_Out (GAPout[(3*i)+2:3*i]), .Best_So_Far_PE_SCORE (score[i]) ); end end endgenerate endmodule |
The problem with this design is that the Query input is large (3*no of PE). If i want to generate 100 PEs, it is impossible as it can only fit up to 75 PEs for the available I/O of the choosen devices (shown in the attached image). So if come with another solution shown below.
module SA_Test1(Clk,Rst,SubSec,Gap,Score); parameter ComputeDataWidth = 8; parameter PE = 8; //LENGTH localparam N_A = 3'b000, //nucleotide "A" N_C = 3'b001, //nucleotide "C" N_G = 3'b010, //nucleotide "G" N_T = 3'b011, //nucleotide "T" G_P = 3'b100; //gap '-' input Clk,Rst; input [2:0] SubSec,Gap; output signed [ComputeDataWidth-1:0] Score; wire [ComputeDataWidth-1:0] d [PE-1:0]; wire [ComputeDataWidth-1:0] t [PE-1:0]; wire [ComputeDataWidth-1:0] l [PE-1:0]; wire [ComputeDataWidth-1:0] score [PE-1:0]; wire [PE*3-1:0] SSout,GAPout; assign Score=score[PE-1]; genvar i; generate for (i=0; i < PE; i = i + 1) begin : pe_block if (i == 0) //first processing element in auto-generated chain begin:pe ProcessingElement pe0 ( .Clk (Clk), .Rst (Rst), .SS (SubSec), .QC ({N_T}), .Gap (Gap), .DDiag (8'b0), .DTop (8'b0), .DLeft (8'b0), .LDiag (8'b0), .LLeft (8'b0), .Zero (8'b0), .SCORE_PE_Previous (8'b0), .DiagOut (d[i]), .LeftOut (l[i]), .TopOut (t[i]), .SS_Out (SSout[2:0]), .Gap_Out (GAPout[2:0]), .Max_PE_Score (score[0]) ); end if (i == 1) begin:pe ProcessingElement pe1 ( .Clk (Clk), .Rst (Rst), .SS (SSout[(3*i)-1:(3*i)-3]), .QC ({N_G}), .Gap (GAPout[(3*i)-1:(3*i)-3]), .DDiag (d[i-1]), .DTop (t[i-1]), .DLeft (l[i-1]), .LDiag (d[i-1]), .LLeft (l[i-1]), .Zero (8'b0), .SCORE_PE_Previous (score[i-1]), .DiagOut (d[i]), .LeftOut (l[i]), .TopOut (t[i]), .SS_Out (SSout[(3*i)+2:3*i]), .Gap_Out (GAPout[(3*i)+2:3*i]), .Max_PE_Score (score[i]) ); end if (i == 2) begin:pe ProcessingElement pe2 ( .Clk (Clk), .Rst (Rst), .SS (SSout[(3*i)-1:(3*i)-3]), .QC ({N_C}), .Gap (GAPout[(3*i)-1:(3*i)-3]), .DDiag (d[i-1]), .DTop (t[i-1]), .DLeft (l[i-1]), .LDiag (d[i-1]), .LLeft (l[i-1]), .Zero (8'b0), .SCORE_PE_Previous (score[i-1]), .DiagOut (d[i]), .LeftOut (l[i]), .TopOut (t[i]), .SS_Out (SSout[(3*i)+2:3*i]), .Gap_Out (GAPout[(3*i)+2:3*i]), .Max_PE_Score (score[i]) ); end if (i == 3) begin:pe ProcessingElement pe3 ( .Clk (Clk), .Rst (Rst), .SS (SSout[(3*i)-1:(3*i)-3]), .QC ({N_T}), .Gap (GAPout[(3*i)-1:(3*i)-3]), .DDiag (d[i-1]), .DTop (t[i-1]), .DLeft (l[i-1]), .LDiag (d[i-1]), .LLeft (l[i-1]), .Zero (8'b0), .SCORE_PE_Previous (score[i-1]), .DiagOut (d[i]), .LeftOut (l[i]), .TopOut (t[i]), .SS_Out (SSout[(3*i)+2:3*i]), .Gap_Out (GAPout[(3*i)+2:3*i]), .Max_PE_Score (score[i]) ); end if (i == 4) begin:pe ProcessingElement pe4 ( .Clk (Clk), .Rst (Rst), .SS (SSout[(3*i)-1:(3*i)-3]), .QC ({N_C}), .Gap (GAPout[(3*i)-1:(3*i)-3]), .DDiag (d[i-1]), .DTop (t[i-1]), .DLeft (l[i-1]), .LDiag (d[i-1]), .LLeft (l[i-1]), .Zero (8'b0), .SCORE_PE_Previous (score[i-1]), .DiagOut (d[i]), .LeftOut (l[i]), .TopOut (t[i]), .SS_Out (SSout[(3*i)+2:3*i]), .Gap_Out (GAPout[(3*i)+2:3*i]), .Max_PE_Score (score[i]) ); end if (i == 5) begin:pe ProcessingElement pe5 ( .Clk (Clk), .Rst (Rst), .SS (SSout[(3*i)-1:(3*i)-3]), .QC ({N_G}), .Gap (GAPout[(3*i)-1:(3*i)-3]), .DDiag (d[i-1]), .DTop (t[i-1]), .DLeft (l[i-1]), .LDiag (d[i-1]), .LLeft (l[i-1]), .Zero (8'b0), .SCORE_PE_Previous (score[i-1]), .DiagOut (d[i]), .LeftOut (l[i]), .TopOut (t[i]), .SS_Out (SSout[(3*i)+2:3*i]), .Gap_Out (GAPout[(3*i)+2:3*i]), .Max_PE_Score (score[i]) ); end if (i == 6) begin:pe ProcessingElement pe6 ( .Clk (Clk), .Rst (Rst), .SS (SSout[(3*i)-1:(3*i)-3]), .QC ({N_T}), .Gap (GAPout[(3*i)-1:(3*i)-3]), .DDiag (d[i-1]), .DTop (t[i-1]), .DLeft (l[i-1]), .LDiag (d[i-1]), .LLeft (l[i-1]), .Zero (8'b0), .SCORE_PE_Previous (score[i-1]), .DiagOut (d[i]), .LeftOut (l[i]), .TopOut (t[i]), .SS_Out (SSout[(3*i)+2:3*i]), .Gap_Out (GAPout[(3*i)+2:3*i]), .Max_PE_Score (score[i]) ); end if (i == 7) begin:pe ProcessingElement pe7 ( .Clk (Clk), .Rst (Rst), .SS (SSout[(3*i)-1:(3*i)-3]), .QC ({N_A}), .Gap (GAPout[(3*i)-1:(3*i)-3]), .DDiag (d[i-1]), .DTop (t[i-1]), .DLeft (l[i-1]), .LDiag (d[i-1]), .LLeft (l[i-1]), .Zero (8'b0), .SCORE_PE_Previous (score[i-1]), .DiagOut (d[i]), .LeftOut (l[i]), .TopOut (t[i]), .SS_Out (SSout[(3*i)+2:3*i]), .Gap_Out (GAPout[(3*i)+2:3*i]), .Max_PE_Score (score[i]) ); end end endgenerate endmodule |
However, I found it hard to copy all this foe 100 PEs and change the DNA character at QC input. Can anyone give me some idea hoe to improve this design?
:
Edited by Moderator
Posted on:
|
Why don't you define an array of length 100 holding the respective nucleotide code for each PE index, exactly as for the other signals?
Posted on:
|
@Vancouver Im sorry. I dont really understand what are you trying to say.
Posted on:
|
Dayana S. wrote: > However, I found it hard to copy all this foe 100 PEs and change the DNA > character at QC input. If I understand you correctly, you want to assign a constant to the QC input of each PE individually, so you have to write each PE instance individually as shown in your second example. This is necessary since you have defined the nucleotide codes as individual localparams. It would be much easier to generate all PEs in a loop and connect the ports to signals that are indexed by the loop index (as you have done for the other signals). For doing so, you need define an array holding all nucleotide codes that need to be assigned to the QC input and then address this array with the loop index as well. Of course you have to specify an array with as many entries as the number of PEs, but this is much less work than writing each PE instance by hand. If this is not your problem... sorry then I do not understand what your problem is.
Posted on:
|
Vancouver wrote: > For doing so, you need define an array holding > all nucleotide codes that need to be assigned to the QC input and then > address this array with the loop index as well. This type of array wire [ComputeDataWidth-1:0] d [PE-1:0]; Is that what u mean?
:
Edited by User
Posted on:
|
I mean something like that (just an outline in pseudo-Verilog, you have to use the correct syntax) wire [2:0] nucleotide [99:0]; initial begin nucleotide[0] = 3'b011; // N_T goes into PE0 nucleotide[1] = 3'b010; // N_B goes into PE1 ... nucleotide[7] = 3'b010; // N_A goes into PE7 ... nucleotide[99] = 3'bwhetwever; // N_x goes into PE99 end // generate loop for the PEs: for (i=0; <100; i++) ProcessingElement pe ( .Clk (Clk), .Rst (Rst), .SS (SSout[(3*i)-1:(3*i)-3]), .QC (nucleotide[i]), // HERE!!!! .Gap (GAPout[(3*i)-1:(3*i)-3]), ... ); end Did you get the point?
Posted on:
|
Btw, does the Query input (QueSec) in your first example change completely in every clock cycle? If not, you could load the QueSec vector sequentially (e.g. 32-bit word sequentially) into an internal shift register before starting computation. This will save ressources if your platform is IO-constrained.
Posted on:
|
Vancouver wrote: > Btw, does the Query input (QueSec) in your first example change > completely in every clock cycle? No, the input is not clocked. The 1st waveform attached as the waveform for example 2 which is the output that I want. > If not, you could load the QueSec > vector sequentially (e.g. 32-bit word sequentially) into an internal > shift register before starting computation. This will save ressources if > your platform is IO-constrained. I tried the above mention method too. But the waveform is not the same as the the first waveform attached in the second waveform. I do not really know why. The schematic diagram shown in third attachment. This is my shift register verilog code: `timescale 1ns / 1ps module ShiftReg (Clk,En,In,Out); input Clk,En; input [2:0] In; output [2:0] Out; reg [5:0] sr ; //3 x 2QC = 6 bits always@(posedge Clk) begin if (En) begin sr <= 1; end else begin sr <= {sr[2:0],In}; //shift by 3 and concatenate Input end end assign Out = sr[5:3]; //top 3 bits to output endmodule
:
Edited by User
Posted on:
|
Hm, currently it is not clear to me what you are doing here. Let me recapitulate. You have a number of processing elements, lets say 8, and each PE has a 3-bit wide QC-input. Before starting the array processing, a certain value must be applied to the QC input of each PE which remains constant for a certain time. Is that correct? That is what I can see in your first timing diagram. Furthermore, you do not have enough FPGA pins to apply QC-values directly from FPGA ports, if you have a larger number of PEs, lets say 100. So you changed your design such that the QC-Values are hard-coded in the FPGA design, as you did in example 2. This clearly saves FPGA pins, but for the downside you have to make a new FPGA design each time the constants need to be changed. Correct? My suggestion is as follows for n PE: To keep it as simple as possible, build a shift register of length 3*n and 1 bit wide. This register can be loaded bit serially in a similar way as in your code in https://embdev.net/topic/441489#5252091 The output of the shift register however, needs to be parallel. The bits 0..2 go into QC input of PE0, bits 3..5 go into PE1 and so on. Generally, bits 3k...3k+2 go into PE[k], k=0...numberPE Here is an example code (not tested, please check by yourself):
module ShiftReg (clk, reset, load, SIn, POut); input clk, reset, load, SIn; output [3*8-1:0] POut; reg [3*8-1:0] sr ; //3 x 8QC = 24 bits always@(posedge clk) begin if (reset) // reset shift register to all-zero begin sr <= 0; end else if (load=1) // enable serial load of the shift register begin sr <= {sr[3*8-2:0], SIn}; //shift by 1 and concatenate Input end end assign POut = sr; // complete shiftreg is parallel output endmodule |
Posted on:
|
Vancouver wrote:
Yes ... what you mentioned is true.
Ive tried to connect SIPO and SA.
As you can see in the attachment, the nucleotide is a bit off.
I think I need a signal to choose a valid data (circled) for all the QCs
input.
What do you think?
Posted on:
|
Sorry I am only occasionally online these days. Of course while serially loading the shiftreg, the parallel outputs show up some garbage. If this is a problem for the PE, your either need an additional register stage (3*n bits wide) at the output of the SIPO and which is clocked only once after the SIPO contents have become stable. The other way would be just to stop the PE chain while loading the SIPO, i.e. an enable input on each PE which keeps the PE from working (which is the preferred method). What do mean with a signal to choose a valid data? After loading, the SIPO should contain only valid data at the right output for each PE. This is the purpose of the SIPO. Only during loading, the SIPO is inconsistent, as said above.
Posted on:
|
Vancouver wrote: > Sorry I am only occasionally online these days. Its ok ... Its festive season. :) > If this is a problem for the PE, your either need an additional register > stage (3*n bits wide) at the output of the SIPO and > which is clocked only once after the SIPO contents have become stable. You mean a register with enable? or a signal that can pass the stable SIPO content to SA? > The other way would be just to stop the PE chain while loading the SIPO, > i.e. an enable input on each PE which keeps the PE from working (which > is the preferred method). I dont have enable input for each PE. So i prefer the top method.
Posted on:
|
Hi, happy new year. Ive try your opinion. Attach is the schematic diagram and waveform. Its working as I wanted. However I have 1 little problem. In the waveform, the nucleotides for all the QC's inputs are 40 ns. Previously, the Subject Sequence (SS) was 40 ns for each Nucleotide. Should I reduce the time for SS to fit in 40 ns?
:
Edited by User
Posted on:
|
Hi, also my best wishes for you for the new year! I am happy to see that you succeeded with your design. Yes I meant a register that passes the SIPO contents to the PE chain as soon as the SIPO is fully loaded. About the timing: A timing delay of 40ns is incredible large. Are these diagrams from a simulation? If so, the timing specs in the source code are irrelevant, they are ignored by the synthesis tools. The real timing in the final FPGA design results from the placement and routing and the number of logic levels. Timing in the FPGA can be controlled by giving timing constraints to the synthesis tools (you must read the timing constraints guide of you tools). In many cases, a specification of a clock frequency is sufficient. The tools will try to make a design that can operate at the specified clock. If this is not possible, you will get a timing error in the tool reports. Are you using Vivado or ISE? What clock frequency are you planning to use in your real FPGA?
Posted on:
|
I see. I didnt notice that 40 ns is too fast. Yup that is a timing diagram from a functional simulation. Im using ISE Im not really sure what clock frequency I should use. Actually, I tried any clock frequency that did not have any hold any setup violations during timing analysis. I have used the clock frequency reported during synthesis as benchmark and to find the fastest clock frequency. It is wrong? I have another problem. In the previous functional waveform attached, my QC is 40ns since the output is updated every clock cycle (squared). If I reduce the period, the QC period will also reduce. I need QC input longer that 40ns. Should I add more flip flops FF after the SIPO to make it longer?
:
Edited by User
Posted on:
|
Hi, you should use the following strategy: 1. Do a pure functional simulation without any timing considerations. Just take care that your design provides correct results. You can use any clock frequency for simulation, since all gate delays are assumed to be zero for now. Only count clock cycles here, but not nanoseconds. 2. Specify a performance that you like to get from your design. How many computation results per second do you want to obtain? "As fast as possible" is not a meaningful guide. From the functional simulation you can see how many clock cycles per result are required, and specifying a clock period gives you an absolute time. If you have no idea, assume 100MHz for the beginning, which is a frequency that you can reach on most modern FPGA architectures without too much effort. A good starting point may also to compare against a pure software implementation. 3. Decide what hardware platform you will use. What FPGA type is planned? Is it an evaluation board or any custom-designed platform? The performance of your design is heavily affected by the interfaces that you will use. How do you get data into the FPGA and how the results out of it?. Possibly all this hardware stuff is not your job and you just want to design an IP core that will be used by somebody else. Good for you. Proceed with 4) :-) 4. Once you have fixed a clock frequency, create an FPGA bitfile and see whether you can reach the clock rate. There are many ways to influence the effort of the design tools. If you cannot reach the desired clock rate, you need to modify your design, i.e. insert additional register stages. However: Never start with this step. You QC issue is not a timing problem. It is a functional problem that you need so solve in (1) About you QC problem: If I understand correctly, you need the QC input to be unchanged for more than one lock cycle. This can be reached by a register stage with an enable input. The enable is granted only when the register content is allowed to change. For the enable you will need a state machine that counts clock cycles and sets the enable every 40 clock cycles or whatever you need.
Posted on:
|
Hi :) The QC input works. thank you for teaching me. Also thanks for strategies that you shared. Can I ask for your opinion for my design? Attached is a desgin of 2 adder and 1 comparator. As you can see I have one loop connected from the output the comparator to a input of one of the adders. The functional simulation (attached) doesnt show the desired function. Im aware that functional loop is bad for a design. How can I redesign it? Thank you.
Posted on:
|
You're welcome. Functional loop are very common, every state machine and every counter contains one. I think what you mean are combinatorial loops. These are problematic, but there is no combinatorial loop in your design since all operators are clocked. In order to tell you what is wrong with your design, I should know the purpose of it. Please explain what you expect from it if it would work correctly. And please post the HDL code instead of the schematics, it is easier to understand (at least for me).
Posted on:
|
Hi :) My Design consist of 3 modules (Figure 1), ShiftReg, Dff and SA. As Ive mentioned, SA consist of an array of PE. Each PE has 3 modules (Figure 2), LUT, SW_Affine and MAXScore. SW_Affine is the direct implementation of Smith Waterman algorithm with affine gap penalty (Figure 3). So, the submodule for SW_Affine are Diagonal, Top, Left and 4-input comparator (Figure 4). So for now im having a computation problem at Top sub module (Figure 5). The computation start there is a comparision of stabilized QC and SubSec at 340ns (Figure 6) . Thus Ive delayed SubSec until the data is compared with QC. The adder with a input of feedback loop doesnt give any output. Ive a lot of sub modules. Im not sure which one to give. Ill give the codes for Top Module. `timescale 1ns / 1ps module Top(Clk,Rst,Diag,GAP_EXTEND_TOP,SCORE_Top,SCORE_Top_Out); parameter ComputeDataWidth = 8; input Clk,Rst; input signed [ComputeDataWidth-1:0] Diag,GAP_EXTEND_TOP; output signed [ComputeDataWidth-1:0] SCORE_Top,SCORE_Top_Out; wire signed [ComputeDataWidth-1:0] TopM,TopIx,In; Sync_Rst_Three_Input_Adder #( .ComputeDataWidth (ComputeDataWidth)) Mi1j ( .Clk (Clk), .Rst (Rst), .D (Diag), .E (GAP_EXTEND_TOP), .AdditionOUT (TopM) ); Sync_Rst_TWO_Input_Adder #( .ComputeDataWidth (ComputeDataWidth)) Ixi1j ( .Clk (Clk), .Rst (Rst), .A (In), .B (GAP_EXTEND_TOP), .AdditionOUT (TopIx) ); Sync_Rst_CompTop #( .ComputeDataWidth (ComputeDataWidth)) TopComp ( .Clk (Clk), .Rst (Rst), .M_i_1_j (TopM), .Ix_i_1_j (TopIx), .SCORE_Top (SCORE_Top), .SCORE_Top_In (In), .SCORE_Top_Out (SCORE_Top_Out) ); endmodule Three input adder `timescale 1ns / 1ps module Sync_Rst_Three_Input_Adder(Clk,Rst,D,E,AdditionOUT); //Parameters must come here now until may parameter ComputeDataWidth = 8; parameter F = -12; input Clk,Rst; input signed [ComputeDataWidth-1:0] D,E; output reg signed [ComputeDataWidth-1:0] AdditionOUT; wire [ComputeDataWidth-1:0] PostAdd; //Perform the addition generate begin : Adding assign PostAdd = D + E + F; end endgenerate always@(posedge Clk) begin if(Rst) AdditionOUT <= 8'b0; else AdditionOUT <= PostAdd; end endmodule Two input adder `timescale 1ns / 1ps module Sync_Rst_TWO_Input_Adder(Clk,Rst,A,B,AdditionOUT); //Parameters must come here now until may parameter ComputeDataWidth = 8; input Clk,Rst; input signed [ComputeDataWidth-1:0] A,B; output reg signed [ComputeDataWidth-1:0] AdditionOUT; wire [ComputeDataWidth-1:0] PostAdd; //Perform the addition generate begin : Adding assign PostAdd = A + B; end endgenerate always@(posedge Clk) begin if(Rst) AdditionOUT <= 8'b0; else AdditionOUT <= PostAdd; end endmodule Comparator `timescale 1ns / 1ps module Sync_Rst_CompTop(Clk,Rst,M_i_1_j,Ix_i_1_j,SCORE_Top,SCORE_Top_In,SCORE_T op_Out); //Parameters must come here now until may parameter ComputeDataWidth = 8; input Clk,Rst; input signed [ComputeDataWidth-1:0] M_i_1_j,Ix_i_1_j; output reg signed [ComputeDataWidth-1:0] SCORE_Top,SCORE_Top_In,SCORE_Top_Out; //Decode the results always@(posedge Clk) begin if(Rst) begin SCORE_Top <= 8'b0; end else if (M_i_1_j > Ix_i_1_j) begin SCORE_Top <= M_i_1_j; end else begin SCORE_Top <= Ix_i_1_j; end end always@* begin if(M_i_1_j > Ix_i_1_j) begin SCORE_Top_In <= M_i_1_j; end else begin SCORE_Top_In <= Ix_i_1_j; end end always@* begin if(M_i_1_j > Ix_i_1_j) begin SCORE_Top_Out <= M_i_1_j; end else begin SCORE_Top_Out <= Ix_i_1_j; end end endmodule
Posted on:
|
Hi again :-) As I can see from the timing diagram, the output of the 2-input adder is always 'X' even after the reset becomes active. In the 2-input adder you have used a synchronous reset, i.e. the reset input is evaluated only at a rising clock edge. In your simulation, however, there is no rising clock edge while reset=1. Due to the feedback loop, the undefined state will be kept forever. Try to keep reset=1 for at least 1 full clock cycle in the simulation. That is what I can see from the code. I hope this solves the problem, otherwise I have to run the simulation here. The other way would be to use an asynchronous reset (add Rst to the sensitivity list of the always@ block). This way, the reset is evaluated immediately whenever it occurs. For FPGA, however, sync reset is better for timing, except if you have a safety critical system, that must be resettable even if the clock stops from oscillating for some failure. I think this is not the case for you, so stay with sync reset. Why do you encapsulate the addition in an extra generate environment? This is not wrong but not necessary. Generates are used only for conditional code generation or in generate loops (as for the PE as we discussed at the beginning). Your code can be simplified: ... always@(posedge Clk) begin if(Rst) AdditionOUT <= 8'b0; else AdditionOUT <= D + E + F; end ... Regards
Posted on:
|
BTW, I noticed that the HDL code for 2- and 3-input adders are almost identical except for the addition of constant parameter F. You could an use the 3-input adder module also for 2-input addition by setting the F parameter to 0 at instantiation:
Sync_Rst_Three_Input_Adder #( .ComputeDataWidth (ComputeDataWidth), .F (0)) // w/o this , it is 2-input Ixi1j ( .Clk (Clk), .Rst (Rst), .D (In), .E (GAP_EXTEND_TOP), .AdditionOUT (TopIx) ); |