Hardware co-simulation of 1024-point FFT and its Implementation , in Simulink , Xilinx Vivado IDE on Zynq-7000 FPGA

In this paper a 1024-point FFT Algorithm is implemented on Zynq-7000 FPGA device. The design implementation uses Hardware co-simulation in Simulink and Xilinx Vivado environments with Zynq-7000 FPGA target evaluation board using JTAG setup.  The power parameter for the configured FFT IP core for 1024 point and the signal source DDS block are estimated. The DDS with both sine and cosine signal outputs enabled, consume a power of 0.277 W, whereas, the 1024 point FFT core consume a power of 0.044 W. Further when 2 DDSs were instanced to generate orthogonal sine and cosine sources for OFDM signals of same frequency 1MHz each, a total of 0.277W power is consumed. When a single DDS core is configured for both sine or cosine signal only configuration by instancing a 1024-point FFT core the total power consumed is 0.268W and 0.267W, respectively, a 1mW higher to cosine case. Further, when 1024 point FFT core power alone is calculated it is found to be 0.044W (or 44mW). When a single DDS is instanced for OFDM signal generation by opting both the sine cosine signals, it consumed a total power of 0.233 W saving a power of 0.044W or 44mW by sine or cosine data re-use from the LUT ROM of DDS. Thus saving a power of 44mW by using data re-use through LUT’s of DDS. This is a significant power saving. In this, hardware co-simulation process, Xilinx system generator tool (Sysgen) is used. This implementation is coded using Verilog HDL, verified on Xilinx Vivado platform on the  Zynq-7000 FPGA device. Note that Zynq-7000 is supporting hardware co-simulation, hence the 1024-point FFT  has been implemented on this device. The simulation results are captured on Xilinx signal viewer for a proper conclusion.


I. INTRODUCTION
The mobile nature of devices and also the obstacles makes the communication channel a time variant in nature.So, this time variant nature becomes an additional sources of bit error rate (BER).To capture their frequency response while the channel is slow or fast fading, The 1024-point FFT hardware simulated and implemented in this paper has important applications in modern communication channel characterization which are of time variant in nature.The communication may be through point to point channel or multipath fading channel as in cellular telephony [1,2] by understanding the nature of channel by its FFT analysis provides a means of error correcting mechanism thus enhances the transmission efficiency and throughput [1].
The Doppler shift of transmitted signal frequency can be determined by observing a mere shift in prominent spectral line.The multipath fading channels have direct impact on the BER v/s SNR characteristics.The Real Time(RT) capturing of signals through FPGA I/O pins requires the virtual Input Output (VIO) features to be used, because sampling 1024-point transformation signals with limited IO (physical) pins is a challenge.Also, we either have to adopt bit streaming sequentially or burst capture modes.The integrated logic analyser (ILA) in Zynq-7000 FPGA does not necessitate the external logic analyser for monitoring digital signals at various nodes of the implemented FFT datapath.The Doppler shift due to time variant nature of multipath impulse response is simple, and is determined by time averaged spectrum of Fig. 5 and 6 over an appropriate time window.The FFT hardware implemented suitably herein make the FFT calculation at faster rate within that time window, giving a visual perception of such frequency shift phenomena.
Hardware realization of FFT algorithm [3] has practical difficulty [4].Mainly due to high cost of implementation of arithmetic (PEs) processing elements such as multiplier, adders and shifters or butterfly PEs [5,6,7,8,9].Secondly, due to the memory requirement to store the intermediate results poses a serious implications on real time processing which demands high speed and/or "Low power"(LP) architecture [10].Many VLSI architecture have been proposed to meet the requirements of "Real-Time"(RT) communication/processing.The difference among various architectures lies in terms of required number of multipliers, adders, memory registers, and on-chip memory [10].This requirement leads to increase in power consumption which triggers the designer to look forward for the efficient power aware hardware architecture in RT scenario.The use of programmable logic to meet the varying demand within the market window is gaining importance with the broad commercial accessibility of FPGA for reconfigurability and faster prototyping.With the advancement in FPGA [11] based technologies have made it suitable for DSP application in wireless communication [12,13].To design a complex system, the very proven approach is to follow bottom up approach.In the digital system design, the bottom up approach is extensively followed due to the availability of standard leaf cells (library cells) at various levels of abstraction; device level (Transistor level), circuit level, and sub-system level [5].The one another notion of this approach is to meet the constraint of time to market while Hardware co-simulation of 1024-point FFT and its Implementation, in Simulink, Xilinx Vivado IDE on Zynq-7000 FPGA K. S. Shashidhara.and H. C. Srinivasaiah transforming the prototype to working model for the given design.Reconfigurability features in semi-custom devices facilitates the designer to implement the design under consideration with freedom of varying parameter in each iterations [14].The iterations refer to back annotation in digital system design, which allows the designer to tweak the system parameters to meet desired constraint and expected response [14].The reconfigurability features leads to the concepts of design re-sue.In the subsequent section of this paper, a bottom up approach of realizing a complex 1024 point FFT sub-blocks is discussed.Mainly, the library cells available from Xilinx Block sets are used to build the 1024 point FFT algorithm [4]   Fig. 1 depicts the design flow adopted for Hardware Cosimulation process.The design under consideration has been modelled using Simulink/Sysgen environment.The Hardware Co-simulation option has been chosen to investigate and evaluate the design under considerations in view of simplified, well matured and methodical 'Hardware in a loop' (HIL) approaches.With HIL and Simulink environment the back annotation made easy so as to incorporate the necessary modifications against design error occurs from simulation through emulation steps.
The rich set of optimized Digital Signal Processing Xilinx blocksets enables the designer to develop and verify the model built (prototype) in order to meet the design constraint of time to market.The FPGA device used to implement the design is an ultra-scale device with highest performance in terms of power, area and performance [15].The Hardware Co-simulation has been initiated by adding System generator token in the Simulink model under consideration according to Fig. 1.Hardware configurations i.e., selection of target FPGA device, synthesis strategy, generation of stimulus, compilation strategy such as IP catalog or HDL netlist or co-simulation options are configured in this flow.The required blocksets are picked from the library of Simulink, they are mainly Fast Fourier Transform(FFT), DDS compiler, Complex Multiplier, constants, re-interpreter for data compatibility and registers to capture the sampled versions of signals from input source blocks to FFT processors.Once the design is integrated by way of pick, plug and play mode, the design will be simulated using inbuilt options in the Simulink environment.The "Run simulation" option facilitates the designer to verify the functionality at first level.Further, signal viewer from Xilinx Tool suite embedded into each of blocksets enables the designer to view the waveform on-the-fly upon simulation to verify the functionality.The Target FPGA device has been programmed by generating the hardware co-simulation block with JTAG options.Generated hwcosim block is then placed into Sysgen editor window to enable emulation process.Again, the "Run" command option is used to implement the design on target FPGA device.
Implementations of the design are observed in the FPGA device through 'DONE' LED indicator when it starts glowing continuously as shown in Fig. 14.

II. REALIZATION OF 1024-POINT FFT IN SYSTEM GENERATOR ENVIRONMENT.
The 1024-point FFT IP core with input source has been modelled in MATLAB Simulink System generator environment as shown in Fig. 2. Input source viewed as a sub system consists of DDS compiler, registers, and reinterpreters.Source signals and Output signals i.e., Real and Imaginary are probed and exported to MATLAB environment through "To workspace" and "Gateway Out" blocks.Source blocks are interfaced to FFT sub blocks through intermediate Registers, appropriate constant values either zero or one are applied to control signals.A simple interface is configured, though an Advanced extensible Interface (AXI) feature available in the FFT IP core.Note that the terminators are connected at the output of FFT IP core when it does not drive any other blocks.Having designed the complete module, a system generator token has been placed in the Simulink editor window in order to initialize the hardware co-simulation process.The detailed design integration is as shown in Fig. 2, the major groups are Input source, Fast Fourier Transform Processor/core (configured), data and data capture unit to MATLAB environment.A. Input source.In order to generate the complex valued data for Processor block of Fig. 2

B. FFT IP Cores.
The FFT processors/cores are configured to a transform length of 1024 with target of 50MPS (Millions of Floating point operations per second) throughput.AXI compatible ports are handled by using constant blocks.
Input Ports are connected to Input source block through appropriate blocks keeping the data and signal integrity issues in mind.Out ports of FFT processor are terminated, and only the output data such as real, imaginary and index value (k) are captured to MATLAB environment for further analysis as shown in Fig. 2.

C. Data capturing unit.
The "Gateway out" and "To workspace" sink blocks are used to capture the data for further analysis.A Verilog Hardware language option is chosen which generates the Verilog code along with the Top module.Target directory local to specific path is chosen for the sake of debugging and implementation of design in Vivado environment.The configuration wizard will pop-up when the System Generator token is enabled by clicking on it.The Sysgen icon is a unique block in the Simulink library (Xilinx Blockset library) that contains the configuration parameters for configuring it as shown in -Fig.9 [17].Basically Sysgen captures information related to model under consideration using the wizard of Fig. 9.The parameters are distributed over various tabs in wizard that can be followed in [17].In the Compilation target Zynq-7000 option being chosen as it is highlighted in Fig. 9.The test bench/stimulus creation for a target project is optional and useful when a different target FPGA board is used.The Synthesis strategy and Implementation strategy are chosen as default in this work.The RTL schematic shown in Fig. 10 shows the symbolic representation of various component blocks that are part of 1024-point FFT processor implemented on this FPGA.The interconnections between various elements with usual notations (which are tool generated) are shown in RTL schematic.The I/O along with buffers is appended for proper signal propagation to overcome the problem of signal integrity.The nodes listed in this RTL netlist will be prominently appearing in the schematic which enables the designer to debug or trace the signals of interest.The RTL schematic after synthesis is shown in Fig. 11 consists of various component blocks which are part of 1024-point FFT implemented on this FPGA.The various analysis such as Timing closure, Placement and Routing (P&R) analysis, Static timing analysis (STA) [4] can be performed over synthesis RTL netlist.The critical path analysis can be analysed by tracing path with the help of initial setting and constraint file.Further, the proper synthesis attributes will give optimal results.In Vivado environment, the synthesis tools provide the control to user through directives/attributes setting which allows the RTL and or Xilinx Design Constraint (XDC) file to modify or fine tune the same.Sometimes the default mapping of synthesis is retained which gives optimized results and it is context based not generic.Fig. 13 shows the device utilization through proper placement and routing after implementation.The detailed report of Utilization summary, Timing summary are also generated and available as log file for further analysis.Manual tweaking provision is made to obtain best performance.IV.TESTBED SETUP.
The complete test bed setup for hardware co-simulation is as shown in Fig. 14.As depicted in this figure, target FPGA board (Zedboard) is connected to host computer through USB port.Burst data transfer mode approach is chosen for speeding up JTAG Hardware co-simulation environment.Through Hardware co-simulation methodology, the design is loaded into the target FPGA device.The host system inputs the test vectors to the designed module programmed into FPGA device through the Hardware cosim interface (JTAG or point-to-point Ethernet mode) and the response of the system are observed by way of post processing it.These responses are observed through signal viewer tool of Xilinx Vivado.Hardware co-simulation methodology is prominent for verifying the functionality for the desired response and also to improve time taken for simulation while verifying the model in a hardware coverification scenario.Alternatively, Point-to-Point Ethernet [2] interface options provides data transfer at GbPs rate.FPGA is powered by external source through regulated adopter shown in Fig. 14.Initially, once the model is built using Simulink block-sets, simulation process begins by enabling "Run Simulation" button in Simulink window.At this point of time, it need not necessary to connect FPGA hardware to host computer.Simulation results are prompted through Xilinx signal viewer (configured by default in system generator).In order to implement the design on target FPGA board, a Hard ware co-sim block called hwcosim wrapper is generated by using generate option available with Sysgen configuration wizard.As shown in the host computer screen, incorporating these generated blocks in the model (in Simulink editor window) enables the complete hardware co-simulation process with target FPGA device that is Zynq-7000 in the loop.Now, by clicking "Run" option, it in turn programs the active target device specified in the project configuration and successful programming has been indicated through turning green LED into "ON" as shown in Fig. 14.

V. CONCLUSION
In this paper, a 1024-point FFT has been implemented in System Generator (Sysgen) Environment.The hardware cosimulation approach speeds up the FFT calculation, the availability of DDS compiler to use as signal source made it possible to explore the concept of data re-use for OFDM design.The power savings due to data re-use is found to be 44mW as the focus in this work is to appreciate the data reuse for power saving.When a single DDS is used to obtain quadrature component signals by sine co-sine options this power saving would have lost.Thus it is understood that the importance of data re-use in DSP chips.On the contrary with two separate DDSs for inphase and quadrature signal components of OFDM, the reported power with the usage of FFT IP core (Hard IP) has shown less power consumption upon reconfiguration as 1024-point FFT by Sysgen assisted Verilog coding and hardware co-simulation.This work provides basis for realtime FFT computation of real time signals using FPGA because of VLSI technology advancement.In this work, the power estimation for a hardware co-simulated FFT hard-IP for transform length of 1024 point is estimated to be a total power of 0.044 W found to be through data re-use technique.
, a Direct Digital Synthesizer (DDS) compiler is used, to provide the option of choosing desired output frequency fout with allowable spurious free dynamic range (SFDR) with 60dB option in this research.The Complex multiplier blocks are used to combine the signals from DDS compiler which are complex in nature.AXI (Advanced eXtensible Interface derived from AMBA bus architecture) compatible signals are handled appropriately by using constant blocks.The various complex signals such as sine-cosine signals, sine signals, and cosine signals as per requirements at different level of integration are generated through configurations.The output from -Input source block is captured using "Gateway out" blocks to the MATLAB environment for further analysis.

Fig. 3 :
Fig. 3: part of FFT waveform generated from Out1 to Out8 ports.As depicted in Fig. 3, the simulation results of 1024-point FFT algorithm along with various control signals such as data_transmit_ready, data_transmit_valid is shown.The input complex signals (2 nd and 3 rd waveform) sine and cosine has quadrature offset between them.The Complex signals starts its generations in line with t_data_ready signals (1 st waveform).Output signals mainly the Real and Imaginary along with K_index (X(k)) can be observed as 7 th and 8 th waveform from the top.The FFT core has been con Fig. d to function as FFT (Not as IFFT) by feeding constant value of '1' to config_t_data_fwd_inv control signals.

Fig. 4 :
Fig. 4: Part of FFT waveform generated from Out9 to Out13 ports.The out9 to out13 ports corresponds to the control signals as shown in Fig. 4.These signals are not used for further interface but still it is observed as part of FFT core functionality verification.The output control signals are used to analyse the latency and throughput of the FFT core.Note that the latency observed in Fig..4 is 2178 time unit (clock time T period) [17].

Fig. 5
Fig. 5 shows the expanded version of output imaginary signals.The captured waveform is for one complete cycle.The individual values for k ranges from 0 to 1023 i.e., X(k) can be located by pointing marker on the waveform and same can be verified in the MATLAB environment.The portion corresponding to prior to spectrum indicates the latency observed through FFT processing pipeline stages indicated as NaN (Not a Number) in MATLAB environment.Latency of 2178 time unit has been observed in this implementation.

Fig. 6 :
Fig. 6: Expanded view of real output signal with a begin yellow marker.

Fig. 6
Fig.6shows the expanded version of output real signal.The captured waveform is for one complete cycle.The individual values for k ranges from 0 to 1023 i.e., X(k) can be located by pointing marker on the waveform and same can be verified in the MATLAB environment.The portion corresponding prior to spectrum is again indicates the NaN.

Fig. 7 :
Fig. 7: The Sysgen spectrum of 1024 point FFT for the two orthogonal 1MHz The Fig..7 shows the spectrum plot for real and imaginary values obtained through simulation results of Simulink model.These data are first ported into MatLab workspace from Simulink sink block and the is considered for plotting the magnitude spectrum.The signal under consideration is of 1 MHz.Fig. 7 consists of plot for real part, plot for

Fig. 8 :
Fig. 8: Configuration of Target FPGA Board for Hardware Co-simulation Implementation.

Fig. 9 :
Fig. 9: Configuration of Target FPGA Board for Hardware Co-simulation Implementation

Fig. 10 :
Fig. 10: RTL Schematic generated for the Sysgen model in Xilinx Vivado after simulation.

Fig. 12 :
Fig. 12: Target Device I/O Layout view showing used physical I/O's after Implementation (unlike VIO) The device Package view of implemented design is shown in Fig. 12.It shows the pin layout that is being con Fig. d for the design under consideration.The package has been split into several banks designated as Bank-13,32,33,34,35 etc., depending on the voltage level categorized as High range and GND.The lower banks except bank-0 are user inaccessible [15].The portions with white back ground indicate the GND pins.The clock regions are exclusively specified with specific Bank.The I/O pins are also indicated in the above figure are specific to certain I/O Bank.Fig.13shows the device utilization through proper placement and routing after implementation.The detailed report of Utilization summary, Timing summary are also generated and available as log file for further analysis.Manual tweaking provision is made to obtain best performance.