Hi, I am given a systemverilog design. My task is to optimize (reduce) the size of the design in terms of FPGA synthesis, and I need to increase the speed of Design. The target FPGA is Xilinx Artix and IDE is Vivado. I want to know: 1) What may be the programming approaches i.e. what things I can make right in order to achieve higher speeds and smaller sizes (Just general guidelines). 2) What options in Vivado are available to achieve the same. All answers, links, and guides are warmly welcome. Looking forward to hearing from the experienced guys. Best regards Cheeku
Muhammad Tahir R. wrote: > My task is to optimize (reduce) the size of the design in terms of FPGA > synthesis, and I need to increase the speed How much? Just a few percent smaller and 10MHz faster? Or double the speed and cut the resources to the half? > what things I can make right in order to achieve higher speeds and > smaller sizes Depends somehow on what things are wrong at the moment. First is: who wrote the code? Was is a experienced hardware specialist? Or was it a software programmer urged to do some FPGA "programming". If the first: forget it, the design is already perfect. When the second: you have a reasonable chance to get significant improvement to size as well as to speed. For speeding up a design you have to do a STA (aka static timing analysis) to find the critical paths.If there are only a few, then you have to check whether you can split them up by adding flipflops to that paths. To reduce resources you must have a very good knowledge how much the design actually needs. When you can't figure that out, then you can't do ressource optimization. Because you're not able to see, when a design fits absolutely perfect.
General approaches: -identify the bottle necks (aka critical pathes) of your design by using STA -compare estimated chip level (stages of LUT/Slices between FF) with real (Log file map) -Know how to replace the different types of ressources by each other, i.e if you run out of FF know how to use distributed RAM instead -Know why the usage of one features blocks the usage of another (i.e. https://www.xilinx.com/support/documentation/white_papers/wp275.pdf) -use dedicated Hardware macros instead of LUT-based logic (i.e. SRL32, FIFO,MUX7) (https://www.xilinx.com/support/documentation/application_notes/xapp466.pdf)
Muhammad Tahir R. wrote: > I am given a systemverilog design. My task is to optimize (reduce) the > size of the design in terms of FPGA synthesis, and I need to increase > the speed of Design. As others stated previously, it really depends on the skills of your predecessor. If he was an expert: there's probably not that much you can do. If the basic design is already reasonable, he was likely also smart enough to enable/implement commonly known optimization. If he wasn't, it's pretty likely it's already the basic design/architecture that's faulty. It might not even worth to look into the implementation but rather start from scratch from the initial specs (provided they are available). If the design appears to be reasonable, area can be traded for performance (pipelining, wider buses, ...) and vice versa, but you generally cannot have both. Nothing comes from nothing... In the unlikely event that your predecessor was an expert, but didn't really put any emphasis on efficient implementation, I would start by looking at the timing analysis (estimate how much margin you have) and just try to increase clock frequency. This should show you where the bottlenecks are.
Muhammad Tahir R. wrote: > My task is to optimize (reduce) the > size of the design in terms of FPGA synthesis, and I need to increase > the speed of Design. This are contrary targets - you probably cannot achieve both!
Mampf wrote: (reduce) >> size of the design in terms of FPGA synthesis, and I need to increase >> the speed of Design. > > This are contrary targets - you probably cannot achieve both! Consider this scenario: -FPGA-architecture with 4-Input LUT's -a FSM with 2**4 + 1 states -binary encoding Eliminate one state and the FSM is about 50% smallerand 50+% faster, because of reducing chiplevel from 2 to 1.