I’m currently working on an ALU in VHDL.  To get warmed up, I tried my hand at some basic arithmetic, which I’m going to discuss in this blog post.  Here’s the diagram of the ALU:

A and B are the two inputs and F will be the result.  Each input and output is a 4-bit number.  S represents the function selected.  For now I’m going to use the following:

• 1 = Add
• 2 = Subtract
• 3 = Multiply
• 4 = Divide

I can add new functions as I need them.  I also have the option of expanding the number of bits to work with.  For now, I’ll just keep this simple.

I’m a software developer, so I look at this problem and I think “switch/case statement”.  As it turns out, there is a case statement for VHDL.  Some searching on Bing turns up this website: VHDL-Online.  Which I found to be very clear and easy to read.  I learn quicker from examples, so this site will be my go-to site for looking up VHDL syntax.  As I looked over the syntax examples, I noticed that the case statement doesn’t work without a process block.  I just wrapped my case statement with a generic process block and came up with this block of code:

```entity smallalu is port (
S: in integer range 0 to 15;
A,B: in signed(3 DOWNTO 0);
F: out signed(3 DOWNTO 0)
);
end smallalu;

architecture Behavioral of smallalu is
begin
process (S,A,B)
begin
case S is
when 1 =>
F <= A+B;
when 2 =>
F <= A-B;
when 3 =>
F <= RESIZE(A*B,4);
when 4 =>
F <= A/B;
when others =>

end case;
end process;
end Behavioral;```

You’ll have to include “use IEEE.numeric_std.ALL;” at the top in order to use the math functions (and “signed” data types).  Most of the code is pretty obvious: When the selector is set to “1”, then add the two inputs and assign to the output and so on.  The multiply was a bit of a challenge.  My simulation was showing “U” for the outputs of a multiply.  I did some investigating and discovered (or rather, rediscovered) that multiplying two 4-bit numbers results in an 8-bit number.  At one time, I knew that, but it’s been a while.  So I did some research and discovered the “resize” function that allowed me to take the 8-bit result and resize to a 4-bit result.  The understanding is that I can’t really multiple any more than 2-bits from A with 2-bits from B, otherwise, it’ll overflow.  So I’ll need to figure out a solution to that issue in the future, when I decided to expand the data path width.

There is also an unsigned data type.  If you change your inputs to unsigned, you must also change F to unsigned.  Everything will work correctly (except you’ll be working with positive numbers only).

Next, I wanted to add a reset or clear function.  Technically, it’s just a constant zero output because there are not latches inside this ALU.  This code is pure logic.  Here is what the code looks like after the change:

```entity smallalu is port (
S: in integer range 0 to 15;
A,B: in signed(3 DOWNTO 0);
F: out signed(3 DOWNTO 0)
);
end smallalu;

architecture Behavioral of smallalu is
begin
process (S,A,B)
begin
case S is
when 0 =>
F <= to_signed(0,4);
when 1 =>
F <= A+B;
when 2 =>
F <= A-B;
when 3 =>
F <= RESIZE(A*B,4);
when 4 =>
F <= A/B;
when others =>

end case;
end process;
end Behavioral;```

As you can see, assigning a zero to F is not just a matter of using an assignment.  A constant zero is assumed to be an integer data type.  The “to_signed()” function can be used to convert it into a signed data type.  This function requires the number of bits, so I put in a 4.  The simulation look like this:

The first block up to 10ns is just the clear.  From 10ns to 20ns is “add”, from 20-30 is “subtract”, 30-40 is multiply and finally 40-50 is divide (as you can see I’m dividing 4 by 2).

One last test, I decided to compile this code for the mimas board, just to see what kind of resources it would occupy on my FPGA.  I didn’t map any inputs and outputs, and I didn’t transfer this to the board since I don’t have enough dip switches to represent S, A and B (though I’m sure I could get creative and use the push buttons for “B” inputs or something).  Anyway, here is the result:

```Slice Logic Utilization:
Number of Slice Registers:                     0 out of  11,440    0%
Number of Slice LUTs:                         47 out of   5,720    1%
Number used as logic:                       47 out of   5,720    1%
Number using O6 output only:              38
Number using O5 output only:               0
Number using O5 and O6:                    9
Number used as ROM:                        0
Number used as Memory:                       0 out of   1,440    0%

Slice Logic Distribution:
Number of occupied Slices:                    19 out of   1,430    1%
Number of MUXCYs used:                         8 out of   2,860    1%
Number of LUT Flip Flop pairs used:           47
Number with an unused Flip Flop:            47 out of      47  100%
Number with an unused LUT:                   0 out of      47    0%
Number of fully used LUT-FF pairs:           0 out of      47    0%
Number of slice register sites lost
to control set restrictions:               0 out of  11,440    0%```

As you can see 47 LUTs are used for the logic as well as 19 slices.  This represents about 1% of the chip resources.  Not bad.  I’m betting that a multiplier scales up exponentially.  So an 8-bit alu is going to take up more than double the resources.  Let’s find out…

```Slice Logic Utilization:
Number of Slice Registers:                     0 out of  11,440    0%
Number of Slice LUTs:                        112 out of   5,720    1%
Number used as logic:                      112 out of   5,720    1%
Number using O6 output only:             101
Number using O5 output only:               0
Number using O5 and O6:                   11
Number used as ROM:                        0
Number used as Memory:                       0 out of   1,440    0%

Slice Logic Distribution:
Number of occupied Slices:                    42 out of   1,430    2%
Number of MUXCYs used:                        32 out of   2,860    1%
Number of LUT Flip Flop pairs used:          112
Number with an unused Flip Flop:           112 out of     112  100%
Number with an unused LUT:                   0 out of     112    0%
Number of fully used LUT-FF pairs:           0 out of     112    0%
Number of slice register sites lost
to control set restrictions:               0 out of  11,440    0%```

Hmmm…. Only a little over double (2.38 x).  Time to setup a multiply only and see what resources it takes to multiply two numbers together.  Here’s my basic code:

```entity multiplier is port (
A,B: in signed(3 DOWNTO 0);
Y: out signed(3 DOWNTO 0)
);
end multiplier;

architecture Behavioral of multiplier is

begin
Y <= RESIZE(A*B,4);
end Behavioral;```
```Slice Logic Utilization:
Number of Slice Registers:                     0 out of  11,440    0%
Number of Slice LUTs:                         15 out of   5,720    1%
Number used as logic:                       15 out of   5,720    1%
Number using O6 output only:              10
Number using O5 output only:               0
Number using O5 and O6:                    5
Number used as ROM:                        0
Number used as Memory:                       0 out of   1,440    0%```

That’s 15 LUTs to multiply two 4-bit numbers together.  8-bit numbers:

```Device Utilization Summary:

Slice Logic Utilization:
Number of Slice Registers:                     0 out of  11,440    0%
Number of Slice LUTs:                          0 out of   5,720    0%

Slice Logic Distribution:
Number of occupied Slices:                     0 out of   1,430    0%
Number of MUXCYs used:                         0 out of   2,860    0%
Number of LUT Flip Flop pairs used:            0

IO Utilization:
Number of bonded IOBs:                        24 out of     200   12%

Specific Feature Utilization:
Number of RAMB16BWERs:                         0 out of      32    0%
Number of RAMB8BWERs:                          0 out of      64    0%
Number of BUFIO2/BUFIO2_2CLKs:                 0 out of      32    0%
Number of BUFIO2FB/BUFIO2FB_2CLKs:             0 out of      32    0%
Number of BUFG/BUFGMUXs:                       0 out of      16    0%
Number of DCM/DCM_CLKGENs:                     0 out of       4    0%
Number of ILOGIC2/ISERDES2s:                   0 out of     200    0%
Number of IODELAY2/IODRP2/IODRP2_MCBs:         0 out of     200    0%
Number of OLOGIC2/OSERDES2s:                   0 out of     200    0%
Number of BSCANs:                              0 out of       4    0%
Number of BUFHs:                               0 out of     128    0%
Number of BUFPLLs:                             0 out of       8    0%
Number of BUFPLL_MCBs:                         0 out of       4    0%
Number of DSP48A1s:                            1 out of      16    6%
Number of ICAPs:                               0 out of       1    0%
Number of MCBs:                                0 out of       2    0%
Number of PCILOGICSEs:                         0 out of       2    0%
Number of PLL_ADVs:                            0 out of       2    0%
Number of PMVs:                                0 out of       1    0%
Number of STARTUPs:                            0 out of       1    0%
Number of SUSPEND_SYNCs:                       0 out of       1    0%```

Well, that’s interesting.  Apparently, there are 16 DSP modules and one of those was used for an 8-bit multiplier.  The same results from a 16-bit multiplier.  Let’s push it a little.  Here’s a 32-bit multiplier:

```Specific Feature Utilization:
Number of RAMB16BWERs:                         0 out of      32    0%
Number of RAMB8BWERs:                          0 out of      64    0%
Number of BUFIO2/BUFIO2_2CLKs:                 0 out of      32    0%
Number of BUFIO2FB/BUFIO2FB_2CLKs:             0 out of      32    0%
Number of BUFG/BUFGMUXs:                       0 out of      16    0%
Number of DCM/DCM_CLKGENs:                     0 out of       4    0%
Number of ILOGIC2/ISERDES2s:                   0 out of     200    0%
Number of IODELAY2/IODRP2/IODRP2_MCBs:         0 out of     200    0%
Number of OLOGIC2/OSERDES2s:                   0 out of     200    0%
Number of BSCANs:                              0 out of       4    0%
Number of BUFHs:                               0 out of     128    0%
Number of BUFPLLs:                             0 out of       8    0%
Number of BUFPLL_MCBs:                         0 out of       4    0%
Number of DSP48A1s:                            4 out of      16   25%
Number of ICAPs:                               0 out of       1    0%
Number of MCBs:                                0 out of       2    0%
Number of PCILOGICSEs:                         0 out of       2    0%
Number of PLL_ADVs:                            0 out of       2    0%
Number of PMVs:                                0 out of       1    0%
Number of STARTUPs:                            0 out of       1    0%
Number of SUSPEND_SYNCs:                       0 out of       1    0%```

No LUTs were used, but 4 DSPs were used.  For a 64-bit multiplier:

ERROR:Place:543 – This design does not fit into the number of slices available

Darn!  I had high-hopes.  Oh well.  Now we know a limit to the Spartan-6 XC6SLX9 FPGA chip.

One other arithmetic function available is the modulo (mod).  Which gives the remainder.  Let’s add that to the ALU:

```entity smallalu is port (
S: in integer range 0 to 15;
A,B: in signed(7 DOWNTO 0);
F: out signed(7 DOWNTO 0)
);
end smallalu;

architecture Behavioral of smallalu is
begin
process (S,A,B)
begin
case S is
when 0 =>
F <= to_signed(0,8);
when 1 =>
F <= A+B;
when 2 =>
F <= A-B;
when 3 =>
F <= RESIZE(A*B,8);
when 4 =>
F <= A/B;
when 5 =>
F <= A mod B;
when others =>

end case;
end process;
end Behavioral;```

As you can see from the simulation, 7/2 gives a remainder of 1:

Finally, here’s 8/2:

My latest hobby is to learn VHDL and apply to the Mimas V2 FPGA board.  As with any language, reading about the language is all well and good, but attempting a real application is where the rubber hits the road.  I read through several tutorials and some introduction material to familiarize myself with some of the syntax.  I copied some code from example articles and observed how the code worked in simulation mode.  Then I finally decided to try and build a circuit without the code from a tutorial.  Keeping it somewhat simple, I chose to implement a shift register.  My goal was to create a 4-bit shift register that had one input that I could feed a “1” or a “0” per clock cycle.  I also wanted to see all 4 outputs to observe the data bits shifting down the line.  Here’s the circuit:

My first attempt was to just shift the outputs as though they are storage locations.  That caused an error: Cannot read from ‘out’ object output ; use ‘buffer’ or ‘inout’  I discovered that I needed some sort of storage inside my object (the flip-flops that represent my last state).  So I set up a “signal”:

`signal dflipflops: STD_LOGIC_VECTOR(3 downto 0):="0000";`

You can give your signal any name, I just called it dflipflops because that’s what popped into my head.  As you can see, the data in the signal can be pre-set to some value (in quotes because it’s a vector).

Next, I coded the reset.  I didn’t really need a reset for the simulation since I set the dflipflops to all zeros when the object is initialized.  However, if I decided to use this as a real circuit, I’d have to have a way to reset this at any time.  So I coded my reset as simple as possible:

```if (reset = '1') then
dflipflops <= "0000";
else
-- shift logic goes here
end if;```

Next, I hard-coded the logic of shifting bits (this is just the inner logic):

```if (clock='1' and clock'event) then
dflipflops(3) <= dflipflops(2);
dflipflops(2) <= dflipflops(1);
dflipflops(1) <= dflipflops(0);
dflipflops(0) <= datain;
end if;```

I did this because I wanted to see if the simulation worked, and it didn’t.  I ended up with unknown outputs:

Yup, forgot to translate my signal back out to the outputs:

```output(0) <= dflipflops(0);
output(1) <= dflipflops(1);
output(2) <= dflipflops(2);
output(3) <= dflipflops(3);```

That worked:

Next, I converted to “for” loops and here’s the final code:

```library IEEE;
use IEEE.STD_LOGIC_1164.ALL;

-- 4-bit shift register
entity shiftregister is port (
datain,clock,reset: in STD_LOGIC;
output: out STD_LOGIC_VECTOR(3 DOWNTO 0)
);
end shiftregister;

architecture Behavioral of shiftregister is

signal dflipflops: STD_LOGIC_VECTOR(3 downto 0):="0000";

begin
process (datain,clock,reset)
begin
if (reset = '1') then
dflipflops <= "0000";
else
if (clock='1' and clock'event) then
for i in 2 downto 0 loop
dflipflops(i+1) <= dflipflops(i);
end loop;
dflipflops(0) <= datain;
end if;
end if;

for k in 3 downto 0 loop
output(k) <= dflipflops(k);
end loop;

end process;
end Behavioral;```

The test bench code is here:

```ENTITY shiftregistertest IS
END shiftregistertest;

ARCHITECTURE behavior OF shiftregistertest IS

-- Component Declaration for the Unit Under Test (UUT)

COMPONENT shiftregister
PORT(
datain : IN  std_logic;
clock : IN  std_logic;
reset : IN std_logic;
output : OUT  std_logic_vector(3 downto 0)
);
END COMPONENT;

--Inputs
signal datain : std_logic := '0';
signal clock : std_logic := '0';
signal reset : std_logic := '1';

--Outputs
signal output : std_logic_vector(3 downto 0);

-- Clock period definitions
constant clock_period : time := 10 ns;

BEGIN

-- Instantiate the Unit Under Test (UUT)
uut: shiftregister PORT MAP (
datain => datain,
clock => clock,
reset => reset,
output => output
);

-- Clock process definitions
clock_process :process
begin
clock <= '0';
wait for clock_period/2;
clock <= '1';
wait for clock_period/2;
end process;

-- Stimulus process
stim_proc: process
begin
-- hold reset state for 100 ns.
wait for 100 ns;

reset <= '0';

wait for clock_period*6;

-- test 1
datain <= '1';
wait for clock_period;

datain <= '0';
wait for clock_period*6;

-- test 2
datain <= '1';
wait for clock_period*2;

datain <= '0';
wait for clock_period*6;

wait;
end process;

END;```

For the test bench I first defaulted the reset to a “1” to force a reset at the beginning.  Then I set the reset back to “0” before testing data inputs.  The first test (test 1) feeds a “1” into the datain and then shifts it one clock cycle, then sets the datain back to “0”.  Then I shift 6 times to make the “1” shift all the way out of the shift register.  The next test (test 2), I set the datain to a “1” and shifted it in for two clock cycles, causing two “1”s to be inputted into the shift register.  Then I set datain back to “0” and shifted for 6 clock cycles to watch the two bits shift all the way through the shift register.  Here’s the simulation output:

The first test starts at 170ns and ends around 210ns.  The second test starts at 240ns and ends at 290ns.

One thing I noticed about the editor is that you must select the test source file before double-clicking on “Simulate Behavioral Model”:

Otherwise, you’ll get a result like this:

You also need to close the ISim window before you can run another simulation otherwise, you’ll get an error like:

ERROR:Simulator:904 – Unable to remove previous simulation file isim/shiftregistertest_isim_beh.exe.sim/shiftregistertest_isim_beh.exe. Please check if you have another instance of this simulation running on your system, terminate it and then recompile your design. System Error Message: boost::filesystem::remove: Access is denied: “isim\shiftregistertest_isim_beh.exe.sim\shiftregistertest_isim_beh.exe”ERROR:Simulator:861 – Failed to link the design

Once this error occurs, you’ll need to close the ISim window, then you will need to right-click on the “Simluate Behavioral Model” and select “rerun all”.  Double-clicking just gives this error:

INFO:ProjectMgmt – The selected process was not run because a prior process failed.

One other thing I find annoying about the editor is that there is no file name change capability.  I’ve attempted to change the name of a file and ended up with a mess.  There is a lot of smart linking that goes on between the project and the files that belong to it.  My quick fix is to create a new file with the new name and scrape the code from the old source and paste into the new source.  Then I delete the old file.  It’s dumb and dirty, but it’s also pretty quick.

Other than the few quirks that I’ve worked around, I am happy that the editor is similar to Visual Studio in commands and syntax highlighting.