BML FPGA Design Tutorial Part-1ofN

2024.05.18 : So what is digital logic design exactly?

This is Part-1 of my series “BML FPGA Design Tutorial” which begins here.

The D Flip-Flop is the primary building block of digital logic design. A digital logic design deals with binary bits and the simple D Flip-Flop is the keeper of those bits.
From Wikipedia, “The D flip-flop is widely used, and known as a “data” flip-flop. The D flip-flop captures the value of the D-input at a definite portion of the clock cycle (such as the rising edge of the clock). That captured value becomes the Q output. At other times, the output Q does not change. The D flip-flop can be viewed as a memory cell, a zero-order hold, or a delay line.”

Digital logic designs can do truly amazing things. They can implement complex state machines with latency of only a few nanoseconds. They can perform parallel digital signal processing at rates of multiple DVD movies per second. They can electrically interconnect “dumb” semiconductor chips together into a working product. Unlike CPUs, digital logic designs can do everything in parallel ( all tasks at the same time ) rather than sequentially ( one task after the other ).

Ignoring combinatorial logic for now, the simplest thing to build with a bunch of digital D Flops is a shift-register delay line. For that reason, Part-1 and Part-2 of this introductory to FPGAs tutorial will dive into a very simple four-tap shift-register design.

First a little history and comparison of full custom silicon, Gate Arrays and Field Programable Gate Arrays.

At the beginning of time, the Big-Bang of semiconductor electronics happened in 1959 when Robert Noyce at Fairchild Semiconductor invented the very first “Integrated Circuit“. Instead of just a single transistor, Noyce’s IC had multiple transistors. From here fully custom circuits could be built.
Bob Noyce is a Rock God. A giant whose shoulders I stand upon.

For this tutorial, a fully custom silicon chip for a very simple four-tap shift register could be designed and built using CMOS D Flops and it would be both small and fast. Ignoring power, ground and clock – this simple device has only two pins, an input and and output. The logic gates are as close together as possible and metal routing is as short as possible with no wasted silicon. It’s a fully optimized chip design. It would be a little smaller than a 4-bit 74HC93 counter. This smallness and fastness comes at a cost however. That cost is NRE. NRE is both the engineering cost ( in time and salary ) and tooling costs ( reticles or masks ) for producing fully custom silicon. Paying the NRE gets you the smallest ( and cheapest ) silicon. Could there be alternatives to high NREs though?

Enter the Gate-Array in the early 1980’s. With Gate Arrays, a “generic” base wafer design is created that will potentially be used by multiple customers for completely different designs. The NRE for this base wafer can then be amortized down to very little. Think of Gate Arrays like a PCB fully stuffed with 7474 Flip-Flops and 7400-NAND ICs where at the very end, a customer gets to add an additional two routing layers that is unique to their design connecting all the chips. That’s a Gate Array.

Starting with this base wafer, the end customer designs and pays for custom interconnect for this generic array of gates. Typically this only requires custom reticles for one or two metal layers – making gate arrays very affordable. It can lower the price of entry to custom silicon design by 90%. Really. Gate-Arrays were that revolutionary in the 1980’s and 1990’s. I’ve done tape-outs that were the cost of my house and ones that were the cost of my car. I definitely prefer the latter.

This lower Gate-Array NRE does come at a cost. The size of the silicon for this chip is considerably larger than a fully custom design. Why? The gates for the base-wafer are deliberately spaced far apart to allow for varying routing demands of multiple customer designs. Reaching fully gate utilization in a gate-array design is also unlikely. For example, a vendor might offer base-wafers of 10,000 or 20,000 gates. If your design needs 11,000 gates – you pay for the larger 20,000 gate base-wafer area. The same goes for package size and pin counts. So although your NRE may be 10% of full-custom, your piece price may be 200% or 300%. So even though the NRE may be nearly free, nothing is ever free. Also, everything ends Tony.

By the early 1990’s, LSI Logic ASICs pretty much killed the Gate-Array industry. While this slow death was happening a new technology called the FPGA ( Field Programmable Gate Array ) entered the market. In 1984, a guy named Ross Freeman came up with this radical idea of making a gate array that was fully programmable in-field rather than metal masked configured at the foundry. With his radical new idea began Xilinx ( now part of AMD ).

Just like a traditional Gate-Array, an FPGA has a base-wafer like design with gates that are not connected to each other and are spaced far apart. What differs is that the FPGA is already fully metalized with a giant matrix of metal routing that allows any flip-flop to connect to any other flip-flop across the chip. It’s a bit like a freeway system where all the on and off ramps are draw-bridges that become either open or closed depending on the end user’s design.

These connections are done just after powerup using pass-transistors (switches) called “Programmable Interconnect Points” or PIPs. It’s really a crazy concept. An off-chip EEPROM stores dozens of PIP configuration bits for each user accessible flip-flop in an FPGA. On powerup, the EEPROM contents are fed into a giant shift-register of non-user flip-flops which then either open or close each PIP. This configuration can take 100’s of milliseconds – which is forever in the digital realm.

At a bitstream level, designing an FPGA is akin to designing a printed circuit board. The designer is deciding how to connect things electrically but within the chip itself. Thankfully, with millions to billions of PIPs to decide upon, today’s EDA tools offer higher levels of abstraction ( RTL ) which I will start to explain in Part-2.

My example super-simple FPGA below has only 4 user flip-flops, but more than 400 PIPs. These PIPs are not drawn to scale ( they are much larger than a single silicon via ). The Artix-7 XC7A35T FPGA used later in this tutorial has 40,000 flip-flops and requires 1,600,000 PIP configuration bits. Thankfully flash memory is dirt cheap these days ( it wasn’t in the 1980’s and 1990’s ). A $100 FPGA might have an external flash EEPROM costing only $1 or $2. Ross Freeman’s brilliance wasn’t skating to the puck, but skating to where the puck was going.

Not counting diffused IP like multipliers and RAMS, the actual overhead for an FPGA versus custom silicon is about 10x. This means a purely digital logic design will consume about 10x the CMOS area in an FPGA as it would in an ASIC. FPGAs cost more than custom silicon, but custom silicon NREs are now more than $1 million USD – so there’s a market for them for low volume designs.
Would a company be better off spending $1M NRE and $10 per ASIC or $0 NRE and $100 per FPGA?
Ay, there’s the rub.

The original 4-tap shift register design implemented in the above FPGA would look like this:

What is noteworthy is that the vast majority of the metal routing channels go unused. Very few PIPs are actually closed. They are all still consuming die area and must be paid for both in wafer test time and silicon area. The routing is also quite slow compared to full custom silicon. Not only are the routes much longer ( die is 10x the size, so more capacitance ) but each PIP also has parasitic resistance and capacitance relative to near-0 of metal vias in a fully custom ASIC, ASSP or even Gate-Array design. These PIP connected long Manhattan routes take nanoseconds instead of 100’s of picoseconds. Early generation FPGAs at 350nm were slow, running at only 20 to 40 MHz. Today’s 14nm FPGAs run comfortably at 200 to 300 MHz for many designs.

So what does this all mean? FPGAs will never compete with full custom silicon for things like 3 GHz CPUs and GPUs. For many applications though, 300 MHz is “good enough” and million dollar NREs are just way too much.

This ends Part-1 of the BML FPGA Tutorial. In Part-2 I show how to implement this simple 4-tap shift-register in structural Verilog and then Synthesize, Map and Place and Route into an AMD/Xilinx Artix-7 FPGA bitstream and flash some LEDs on the Digilent BASYS3 board.

Future Part-3 of this tutorial will explain Inferring gates using Verilog-RTL and how FPGA LUTs are used to implement generic combinatorial logic. I hope that you enjoyed this introduction to FPGAs and found it informative.
EOF

BML FPGA Design Tutorial Part-1ofN

3 thoughts on “BML FPGA Design Tutorial Part-1ofN

Leave a comment