I’m currently working on an ALU in VHDL. To get warmed up, I tried my hand at some basic arithmetic, which I’m going to discuss in this blog post. Here’s the diagram of the ALU:

A and B are the two inputs and F will be the result. Each input and output is a 4-bit number. S represents the function selected. For now I’m going to use the following:

- 1 = Add
- 2 = Subtract
- 3 = Multiply
- 4 = Divide

I can add new functions as I need them. I also have the option of expanding the number of bits to work with. For now, I’ll just keep this simple.

I’m a software developer, so I look at this problem and I think “switch/case statement”. As it turns out, there is a case statement for VHDL. Some searching on Bing turns up this website: VHDL-Online. Which I found to be very clear and easy to read. I learn quicker from examples, so this site will be my go-to site for looking up VHDL syntax. As I looked over the syntax examples, I noticed that the case statement doesn’t work without a process block. I just wrapped my case statement with a generic process block and came up with this block of code:

entity smallalu is port ( S: in integer range 0 to 15; A,B: in signed(3 DOWNTO 0); F: out signed(3 DOWNTO 0) ); end smallalu; architecture Behavioral of smallalu is begin process (S,A,B) begin case S is when 1 => F <= A+B; when 2 => F <= A-B; when 3 => F <= RESIZE(A*B,4); when 4 => F <= A/B; when others => end case; end process; end Behavioral;

You’ll have to include “use IEEE.numeric_std.ALL;” at the top in order to use the math functions (and “signed” data types). Most of the code is pretty obvious: When the selector is set to “1”, then add the two inputs and assign to the output and so on. The multiply was a bit of a challenge. My simulation was showing “U” for the outputs of a multiply. I did some investigating and discovered (or rather, rediscovered) that multiplying two 4-bit numbers results in an 8-bit number. At one time, I knew that, but it’s been a while. So I did some research and discovered the “resize” function that allowed me to take the 8-bit result and resize to a 4-bit result. The understanding is that I can’t really multiple any more than 2-bits from A with 2-bits from B, otherwise, it’ll overflow. So I’ll need to figure out a solution to that issue in the future, when I decided to expand the data path width.

There is also an unsigned data type. If you change your inputs to unsigned, you must also change F to unsigned. Everything will work correctly (except you’ll be working with positive numbers only).

Next, I wanted to add a reset or clear function. Technically, it’s just a constant zero output because there are not latches inside this ALU. This code is pure logic. Here is what the code looks like after the change:

entity smallalu is port ( S: in integer range 0 to 15; A,B: in signed(3 DOWNTO 0); F: out signed(3 DOWNTO 0) ); end smallalu; architecture Behavioral of smallalu is begin process (S,A,B) begin case S is when 0 => F <= to_signed(0,4); when 1 => F <= A+B; when 2 => F <= A-B; when 3 => F <= RESIZE(A*B,4); when 4 => F <= A/B; when others => end case; end process; end Behavioral;

As you can see, assigning a zero to F is not just a matter of using an assignment. A constant zero is assumed to be an integer data type. The “to_signed()” function can be used to convert it into a signed data type. This function requires the number of bits, so I put in a 4. The simulation look like this:

The first block up to 10ns is just the clear. From 10ns to 20ns is “add”, from 20-30 is “subtract”, 30-40 is multiply and finally 40-50 is divide (as you can see I’m dividing 4 by 2).

One last test, I decided to compile this code for the mimas board, just to see what kind of resources it would occupy on my FPGA. I didn’t map any inputs and outputs, and I didn’t transfer this to the board since I don’t have enough dip switches to represent S, A and B (though I’m sure I could get creative and use the push buttons for “B” inputs or something). Anyway, here is the result:

Slice Logic Utilization: Number of Slice Registers: 0 out of 11,440 0% Number of Slice LUTs: 47 out of 5,720 1% Number used as logic: 47 out of 5,720 1% Number using O6 output only: 38 Number using O5 output only: 0 Number using O5 and O6: 9 Number used as ROM: 0 Number used as Memory: 0 out of 1,440 0% Slice Logic Distribution: Number of occupied Slices: 19 out of 1,430 1% Number of MUXCYs used: 8 out of 2,860 1% Number of LUT Flip Flop pairs used: 47 Number with an unused Flip Flop: 47 out of 47 100% Number with an unused LUT: 0 out of 47 0% Number of fully used LUT-FF pairs: 0 out of 47 0% Number of slice register sites lost to control set restrictions: 0 out of 11,440 0%

As you can see 47 LUTs are used for the logic as well as 19 slices. This represents about 1% of the chip resources. Not bad. I’m betting that a multiplier scales up exponentially. So an 8-bit alu is going to take up more than double the resources. Let’s find out…

Slice Logic Utilization: Number of Slice Registers: 0 out of 11,440 0% Number of Slice LUTs: 112 out of 5,720 1% Number used as logic: 112 out of 5,720 1% Number using O6 output only: 101 Number using O5 output only: 0 Number using O5 and O6: 11 Number used as ROM: 0 Number used as Memory: 0 out of 1,440 0% Slice Logic Distribution: Number of occupied Slices: 42 out of 1,430 2% Number of MUXCYs used: 32 out of 2,860 1% Number of LUT Flip Flop pairs used: 112 Number with an unused Flip Flop: 112 out of 112 100% Number with an unused LUT: 0 out of 112 0% Number of fully used LUT-FF pairs: 0 out of 112 0% Number of slice register sites lost to control set restrictions: 0 out of 11,440 0%

Hmmm…. Only a little over double (2.38 x). Time to setup a multiply only and see what resources it takes to multiply two numbers together. Here’s my basic code:

entity multiplier is port ( A,B: in signed(3 DOWNTO 0); Y: out signed(3 DOWNTO 0) ); end multiplier; architecture Behavioral of multiplier is begin Y <= RESIZE(A*B,4); end Behavioral;

Slice Logic Utilization: Number of Slice Registers: 0 out of 11,440 0% Number of Slice LUTs: 15 out of 5,720 1% Number used as logic: 15 out of 5,720 1% Number using O6 output only: 10 Number using O5 output only: 0 Number using O5 and O6: 5 Number used as ROM: 0 Number used as Memory: 0 out of 1,440 0%

That’s 15 LUTs to multiply two 4-bit numbers together. 8-bit numbers:

Device Utilization Summary: Slice Logic Utilization: Number of Slice Registers: 0 out of 11,440 0% Number of Slice LUTs: 0 out of 5,720 0% Slice Logic Distribution: Number of occupied Slices: 0 out of 1,430 0% Number of MUXCYs used: 0 out of 2,860 0% Number of LUT Flip Flop pairs used: 0 IO Utilization: Number of bonded IOBs: 24 out of 200 12% Specific Feature Utilization: Number of RAMB16BWERs: 0 out of 32 0% Number of RAMB8BWERs: 0 out of 64 0% Number of BUFIO2/BUFIO2_2CLKs: 0 out of 32 0% Number of BUFIO2FB/BUFIO2FB_2CLKs: 0 out of 32 0% Number of BUFG/BUFGMUXs: 0 out of 16 0% Number of DCM/DCM_CLKGENs: 0 out of 4 0% Number of ILOGIC2/ISERDES2s: 0 out of 200 0% Number of IODELAY2/IODRP2/IODRP2_MCBs: 0 out of 200 0% Number of OLOGIC2/OSERDES2s: 0 out of 200 0% Number of BSCANs: 0 out of 4 0% Number of BUFHs: 0 out of 128 0% Number of BUFPLLs: 0 out of 8 0% Number of BUFPLL_MCBs: 0 out of 4 0% Number of DSP48A1s: 1 out of 16 6% Number of ICAPs: 0 out of 1 0% Number of MCBs: 0 out of 2 0% Number of PCILOGICSEs: 0 out of 2 0% Number of PLL_ADVs: 0 out of 2 0% Number of PMVs: 0 out of 1 0% Number of STARTUPs: 0 out of 1 0% Number of SUSPEND_SYNCs: 0 out of 1 0%

Well, that’s interesting. Apparently, there are 16 DSP modules and one of those was used for an 8-bit multiplier. The same results from a 16-bit multiplier. Let’s push it a little. Here’s a 32-bit multiplier:

Specific Feature Utilization: Number of RAMB16BWERs: 0 out of 32 0% Number of RAMB8BWERs: 0 out of 64 0% Number of BUFIO2/BUFIO2_2CLKs: 0 out of 32 0% Number of BUFIO2FB/BUFIO2FB_2CLKs: 0 out of 32 0% Number of BUFG/BUFGMUXs: 0 out of 16 0% Number of DCM/DCM_CLKGENs: 0 out of 4 0% Number of ILOGIC2/ISERDES2s: 0 out of 200 0% Number of IODELAY2/IODRP2/IODRP2_MCBs: 0 out of 200 0% Number of OLOGIC2/OSERDES2s: 0 out of 200 0% Number of BSCANs: 0 out of 4 0% Number of BUFHs: 0 out of 128 0% Number of BUFPLLs: 0 out of 8 0% Number of BUFPLL_MCBs: 0 out of 4 0% Number of DSP48A1s: 4 out of 16 25% Number of ICAPs: 0 out of 1 0% Number of MCBs: 0 out of 2 0% Number of PCILOGICSEs: 0 out of 2 0% Number of PLL_ADVs: 0 out of 2 0% Number of PMVs: 0 out of 1 0% Number of STARTUPs: 0 out of 1 0% Number of SUSPEND_SYNCs: 0 out of 1 0%

No LUTs were used, but 4 DSPs were used. For a 64-bit multiplier:

ERROR:Place:543 – This design does not fit into the number of slices available

Darn! I had high-hopes. Oh well. Now we know a limit to the Spartan-6 XC6SLX9 FPGA chip.

One other arithmetic function available is the modulo (mod). Which gives the remainder. Let’s add that to the ALU:

entity smallalu is port ( S: in integer range 0 to 15; A,B: in signed(7 DOWNTO 0); F: out signed(7 DOWNTO 0) ); end smallalu; architecture Behavioral of smallalu is begin process (S,A,B) begin case S is when 0 => F <= to_signed(0,8); when 1 => F <= A+B; when 2 => F <= A-B; when 3 => F <= RESIZE(A*B,8); when 4 => F <= A/B; when 5 => F <= A mod B; when others => end case; end process; end Behavioral;

As you can see from the simulation, 7/2 gives a remainder of 1: