The purpose of this section is to give a basic overview of how the FFT was implemented. Hopefully it will make the source code easier to follow.

The source consists of three files, `dit.v`, `butterfly.v` and
`twiddlefactors_N.v` where `N` is the length of the fourier
transform. `dit.v` contains a module `dit` which implements most
of the algorithm. `butterfly.v` contains a module `butterfly`
which takes inputs `A`, `B`, and `T` and outputs `A+TB` and
`A-TB`, to take care of equations 3 and 4. Finally
`twiddlefactors_N.v` contains previously calculated values of
\(T_{n}\).

The implementation uses 5 buffers large enough to hold N complex
numbers. Two of these are input buffers, two are working buffers, and
one is an output buffer. The module takes inputs `x_in` and
`nd_in`, where `x_in` is wide enough to hold a complex number and
`nd_in` indicates whether new data is on `x_in`. The outputs are
`x_out` and `nd_out` which act in a similar fashion along with
`overflow` which is set to 1 if the module cannot keep up with the
input sample stream.

When data arrives in `x_in` (`nd_in` == 1), it is copied into the
active input buffer. When that input buffer is full, and the previous
FFT is finished, the contents on the input buffer are used and stage 1
of the FFT is generated and written to a working buffer. The data
then moves back and forth between the two working buffers as
sequential stages of the FFT are processed. Finally the last stage of
the FFT is written onto the output buffer and then streamed out in
`x_out`. The second input buffer is used to collect new input data
while the first input buffer is being read during the calculation of the
first FFT stage. In this way, the input buffers swap roles back and
forth.

The real number-crunching is the calculation of each stage from the previous one using equations 3 and 4 which were,

\begin{equation*}
\begin{aligned}
P_{s,kS+j} & = P_{s-1,2kS+j} + T_{kS} P_{s-1,2kS+S+j} \\
P_{s,kS+j+N/2} & = P_{s-1,2kS+j} - T_{kS} P_{s-1,2kS+S+j}
\end{aligned}
\end{equation*}

To do this we step \(i\) from 0 to \(N/2\) and let
\(i=kS+j\). On each step we calculate the input addresses
(\(2kS+j\), and \(2kS+S+j\)), output addresses(\(kS+j\),
and \(kS+j+N/2\)), and twiddle factor address (\(2kS\)).
Calculation of the addresses is helped by realising that the integer
\(i\) is \(log_2(N)\) bits long, and that smallest
\(log_2(S)\) bits give \(j\) while the remainder give
\(kS\). Calculating \(2kS+j\) is then done by shifting the
high \(log_2(N/S)\) bits up one and leaving the lower bits alone.
Similarly calculating \(2kS\) to get the address of the twiddle
factors is similar but we set the lowers bits to 0. \(kS+j+N/2\)
is found by setting the high bit of \(i\) to 1. \(2kS+S+j\)
is calculated by adding \(S\) to \(2kS+j\). Once we have these
addresses we can assign the inputs and outputs of the `butterfly`
module appropriately to update the outputs.

My first attempt at verification was done using verilog, however using a hardware-definition language (HDL) like verilog for verification was painful to me, so I looked around and discoved the python package MyHDL.

MyHDL allows one to interface standard python code with a HDL like verilog so that the standard python unittesting framework can be used to test verilog code via the opensource iverilog simulator.