I’m currently working on an ALU in VHDL.  To get warmed up, I tried my hand at some basic arithmetic, which I’m going to discuss in this blog post.  Here’s the diagram of the ALU:

A and B are the two inputs and F will be the result.  Each input and output is a 4-bit number.  S represents the function selected.  For now I’m going to use the following:

• 2 = Subtract
• 3 = Multiply
• 4 = Divide

I can add new functions as I need them.  I also have the option of expanding the number of bits to work with.  For now, I’ll just keep this simple.

I’m a software developer, so I look at this problem and I think “switch/case statement”.  As it turns out, there is a case statement for VHDL.  Some searching on Bing turns up this website: VHDL-Online.  Which I found to be very clear and easy to read.  I learn quicker from examples, so this site will be my go-to site for looking up VHDL syntax.  As I looked over the syntax examples, I noticed that the case statement doesn’t work without a process block.  I just wrapped my case statement with a generic process block and came up with this block of code:

```entity smallalu is port (
S: in integer range 0 to 15;
A,B: in signed(3 DOWNTO 0);
F: out signed(3 DOWNTO 0)
);
end smallalu;

architecture Behavioral of smallalu is
begin
process (S,A,B)
begin
case S is
when 1 =>
F <= A+B;
when 2 =>
F <= A-B;
when 3 =>
F <= RESIZE(A*B,4);
when 4 =>
F <= A/B;
when others =>

end case;
end process;
end Behavioral;```

You’ll have to include “use IEEE.numeric_std.ALL;” at the top in order to use the math functions (and “signed” data types).  Most of the code is pretty obvious: When the selector is set to “1”, then add the two inputs and assign to the output and so on.  The multiply was a bit of a challenge.  My simulation was showing “U” for the outputs of a multiply.  I did some investigating and discovered (or rather, rediscovered) that multiplying two 4-bit numbers results in an 8-bit number.  At one time, I knew that, but it’s been a while.  So I did some research and discovered the “resize” function that allowed me to take the 8-bit result and resize to a 4-bit result.  The understanding is that I can’t really multiple any more than 2-bits from A with 2-bits from B, otherwise, it’ll overflow.  So I’ll need to figure out a solution to that issue in the future, when I decided to expand the data path width.

There is also an unsigned data type.  If you change your inputs to unsigned, you must also change F to unsigned.  Everything will work correctly (except you’ll be working with positive numbers only).

Next, I wanted to add a reset or clear function.  Technically, it’s just a constant zero output because there are not latches inside this ALU.  This code is pure logic.  Here is what the code looks like after the change:

```entity smallalu is port (
S: in integer range 0 to 15;
A,B: in signed(3 DOWNTO 0);
F: out signed(3 DOWNTO 0)
);
end smallalu;

architecture Behavioral of smallalu is
begin
process (S,A,B)
begin
case S is
when 0 =>
F <= to_signed(0,4);
when 1 =>
F <= A+B;
when 2 =>
F <= A-B;
when 3 =>
F <= RESIZE(A*B,4);
when 4 =>
F <= A/B;
when others =>

end case;
end process;
end Behavioral;```

As you can see, assigning a zero to F is not just a matter of using an assignment.  A constant zero is assumed to be an integer data type.  The “to_signed()” function can be used to convert it into a signed data type.  This function requires the number of bits, so I put in a 4.  The simulation look like this:

The first block up to 10ns is just the clear.  From 10ns to 20ns is “add”, from 20-30 is “subtract”, 30-40 is multiply and finally 40-50 is divide (as you can see I’m dividing 4 by 2).

One last test, I decided to compile this code for the mimas board, just to see what kind of resources it would occupy on my FPGA.  I didn’t map any inputs and outputs, and I didn’t transfer this to the board since I don’t have enough dip switches to represent S, A and B (though I’m sure I could get creative and use the push buttons for “B” inputs or something).  Anyway, here is the result:

```Slice Logic Utilization:
Number of Slice Registers:                     0 out of  11,440    0%
Number of Slice LUTs:                         47 out of   5,720    1%
Number used as logic:                       47 out of   5,720    1%
Number using O6 output only:              38
Number using O5 output only:               0
Number using O5 and O6:                    9
Number used as ROM:                        0
Number used as Memory:                       0 out of   1,440    0%

Slice Logic Distribution:
Number of occupied Slices:                    19 out of   1,430    1%
Number of MUXCYs used:                         8 out of   2,860    1%
Number of LUT Flip Flop pairs used:           47
Number with an unused Flip Flop:            47 out of      47  100%
Number with an unused LUT:                   0 out of      47    0%
Number of fully used LUT-FF pairs:           0 out of      47    0%
Number of slice register sites lost
to control set restrictions:               0 out of  11,440    0%```

As you can see 47 LUTs are used for the logic as well as 19 slices.  This represents about 1% of the chip resources.  Not bad.  I’m betting that a multiplier scales up exponentially.  So an 8-bit alu is going to take up more than double the resources.  Let’s find out…

```Slice Logic Utilization:
Number of Slice Registers:                     0 out of  11,440    0%
Number of Slice LUTs:                        112 out of   5,720    1%
Number used as logic:                      112 out of   5,720    1%
Number using O6 output only:             101
Number using O5 output only:               0
Number using O5 and O6:                   11
Number used as ROM:                        0
Number used as Memory:                       0 out of   1,440    0%

Slice Logic Distribution:
Number of occupied Slices:                    42 out of   1,430    2%
Number of MUXCYs used:                        32 out of   2,860    1%
Number of LUT Flip Flop pairs used:          112
Number with an unused Flip Flop:           112 out of     112  100%
Number with an unused LUT:                   0 out of     112    0%
Number of fully used LUT-FF pairs:           0 out of     112    0%
Number of slice register sites lost
to control set restrictions:               0 out of  11,440    0%```

Hmmm…. Only a little over double (2.38 x).  Time to setup a multiply only and see what resources it takes to multiply two numbers together.  Here’s my basic code:

```entity multiplier is port (
A,B: in signed(3 DOWNTO 0);
Y: out signed(3 DOWNTO 0)
);
end multiplier;

architecture Behavioral of multiplier is

begin
Y <= RESIZE(A*B,4);
end Behavioral;```
```Slice Logic Utilization:
Number of Slice Registers:                     0 out of  11,440    0%
Number of Slice LUTs:                         15 out of   5,720    1%
Number used as logic:                       15 out of   5,720    1%
Number using O6 output only:              10
Number using O5 output only:               0
Number using O5 and O6:                    5
Number used as ROM:                        0
Number used as Memory:                       0 out of   1,440    0%```

That’s 15 LUTs to multiply two 4-bit numbers together.  8-bit numbers:

```Device Utilization Summary:

Slice Logic Utilization:
Number of Slice Registers:                     0 out of  11,440    0%
Number of Slice LUTs:                          0 out of   5,720    0%

Slice Logic Distribution:
Number of occupied Slices:                     0 out of   1,430    0%
Number of MUXCYs used:                         0 out of   2,860    0%
Number of LUT Flip Flop pairs used:            0

IO Utilization:
Number of bonded IOBs:                        24 out of     200   12%

Specific Feature Utilization:
Number of RAMB16BWERs:                         0 out of      32    0%
Number of RAMB8BWERs:                          0 out of      64    0%
Number of BUFIO2/BUFIO2_2CLKs:                 0 out of      32    0%
Number of BUFIO2FB/BUFIO2FB_2CLKs:             0 out of      32    0%
Number of BUFG/BUFGMUXs:                       0 out of      16    0%
Number of DCM/DCM_CLKGENs:                     0 out of       4    0%
Number of ILOGIC2/ISERDES2s:                   0 out of     200    0%
Number of IODELAY2/IODRP2/IODRP2_MCBs:         0 out of     200    0%
Number of OLOGIC2/OSERDES2s:                   0 out of     200    0%
Number of BSCANs:                              0 out of       4    0%
Number of BUFHs:                               0 out of     128    0%
Number of BUFPLLs:                             0 out of       8    0%
Number of BUFPLL_MCBs:                         0 out of       4    0%
Number of DSP48A1s:                            1 out of      16    6%
Number of ICAPs:                               0 out of       1    0%
Number of MCBs:                                0 out of       2    0%
Number of PCILOGICSEs:                         0 out of       2    0%
Number of PLL_ADVs:                            0 out of       2    0%
Number of PMVs:                                0 out of       1    0%
Number of STARTUPs:                            0 out of       1    0%
Number of SUSPEND_SYNCs:                       0 out of       1    0%```

Well, that’s interesting.  Apparently, there are 16 DSP modules and one of those was used for an 8-bit multiplier.  The same results from a 16-bit multiplier.  Let’s push it a little.  Here’s a 32-bit multiplier:

```Specific Feature Utilization:
Number of RAMB16BWERs:                         0 out of      32    0%
Number of RAMB8BWERs:                          0 out of      64    0%
Number of BUFIO2/BUFIO2_2CLKs:                 0 out of      32    0%
Number of BUFIO2FB/BUFIO2FB_2CLKs:             0 out of      32    0%
Number of BUFG/BUFGMUXs:                       0 out of      16    0%
Number of DCM/DCM_CLKGENs:                     0 out of       4    0%
Number of ILOGIC2/ISERDES2s:                   0 out of     200    0%
Number of IODELAY2/IODRP2/IODRP2_MCBs:         0 out of     200    0%
Number of OLOGIC2/OSERDES2s:                   0 out of     200    0%
Number of BSCANs:                              0 out of       4    0%
Number of BUFHs:                               0 out of     128    0%
Number of BUFPLLs:                             0 out of       8    0%
Number of BUFPLL_MCBs:                         0 out of       4    0%
Number of DSP48A1s:                            4 out of      16   25%
Number of ICAPs:                               0 out of       1    0%
Number of MCBs:                                0 out of       2    0%
Number of PCILOGICSEs:                         0 out of       2    0%
Number of PLL_ADVs:                            0 out of       2    0%
Number of PMVs:                                0 out of       1    0%
Number of STARTUPs:                            0 out of       1    0%
Number of SUSPEND_SYNCs:                       0 out of       1    0%```

No LUTs were used, but 4 DSPs were used.  For a 64-bit multiplier:

ERROR:Place:543 – This design does not fit into the number of slices available

Darn!  I had high-hopes.  Oh well.  Now we know a limit to the Spartan-6 XC6SLX9 FPGA chip.

One other arithmetic function available is the modulo (mod).  Which gives the remainder.  Let’s add that to the ALU:

```entity smallalu is port (
S: in integer range 0 to 15;
A,B: in signed(7 DOWNTO 0);
F: out signed(7 DOWNTO 0)
);
end smallalu;

architecture Behavioral of smallalu is
begin
process (S,A,B)
begin
case S is
when 0 =>
F <= to_signed(0,8);
when 1 =>
F <= A+B;
when 2 =>
F <= A-B;
when 3 =>
F <= RESIZE(A*B,8);
when 4 =>
F <= A/B;
when 5 =>
F <= A mod B;
when others =>

end case;
end process;
end Behavioral;```

As you can see from the simulation, 7/2 gives a remainder of 1:

Finally, here’s 8/2: