# Generative AI for the Design of Digital Neuromorphic Spiking Neural Networks

#### Andreas G. Andreou

<u>http:/andreoulab.net/</u> @andreoulab

*Electrical and Computer Engineering, Center for Language and Speech Processing* Johns Hopkins University

Contributions by:



Michael Tomlinson Joe Li

Paola Vitolo. George Psaltakis











### Outline

- Introduction
  - Historical perspective
  - Neuromorphic systems for embodied AI
- Natural Language based digital Spiking Neural Network chip design (a.k.a LLM assisted Verilog generation)
- Conclusions



#### 2023: The Year of Generative-AI: Large Language Models and Stable Diffusion



A 1923 comic for *New York World* by cartoonist H. T. Webster (1885-1952)



06/22/2024

# 65 years of chips technology

Jack Kilby, Texas Instruments, Phase Shift Oscillator (1958) Robert Noyce, Fairchild/Intel Integrated Circuit (1959)

Feature size: mm

Google Doodle

December 12, 2011

NVIDIA GB200 Grace Blackwell Superchip (2024)







36 x GB200--- > GB200-NVL72

1 mm = 1000000 nm

Feature size: nm

#### ASICs – Application Specific Integrated Circuits





#### Global ASIC Market [USD Billion] [1]



 [1] "Application specific integrated circuit market full report," market.us, accessed: 3 June 2024. [Online].
 Available: https://market.us/report/application-specific-integrated-circuit-market.

Image generated by ChatGPT 4.0

# Challenges in Hardware Design





**Designing, verifying,** and **fabricating** a chip is a **complex and expensive process**, often spanning several years and costing hundreds of millions of dollars.



**Challenge:** meeting the increasing demand for ASICs maintaining high levels of reliability and keeping costs low.

#### Natural Language for Architecture Exploration, Design and Verification

06/22/2024



#### Design Methodology for Managing Complexity of Computational Systems



Lynn Ann Conway (January 2, 1938 – June 9, 2024)

https://www.invent.org/inductees/lynn-conway

2023 induction to the National Inventors Hall of Fame



1980





Carver Mead (May 1, 1934 - )

1989

2022 Kyoto Prize in Advanced Technology (<u>https://www.kyotoprize.org/en/2022/</u>)

#### Moore's Law, CAD Tools and Foundry Services

# Cramming more components onto integrated circuits

With unit cost falling as the number of components per circuit rises, by 1975 economics may dictate squeezing as many as 65,000 components on a single silicon chip

By Gordon E. Moore

Electronics, Volume 38, Number 8, April 19, 1965





# **SYNOPSYS**<sup>®</sup>

Predictable Success



MOSIS Foundry

University EDA Tools MAGIC, RSIM

Advances in technology create advances in computers and CAD tools that in turn accelerate advances in technology

06/22/2024 DCgAA 2024: International Workshop on DL-Hardware Co-Design for Generative AI Acceleration

**JOHNS HOPKINS** 



# Neuromorphic Electronic Systems

### Early 90s



in your phone.





#### Compute environment:

Pentium Pro, 0.5um, 5.5 M xtors 150 MHz, FSB: 66MHz, 3.3 Volts on package 256KB L2 cache, 3.3V 35W, 64 MB RAM, 5GB HD



DEC Alpha 21064A, 0.5um, , 2.85 M xtors 200 MHz, 256KB Bcache, 3.3 V, 30W, 64MB RAM, 5GB HD



K. A. Boahen and A. G. Andreou, "A Contrast Sensitive Silicon Retina with Reciprocal Synapses,"Neural Information Processing Systems 3, 1990, vol. 4, pp. 764–772.



# WAM Chip: Neuronmorphic processor In Memory (PIM), Compute In Memory (CIM), In Memory Computing (IMC)

Winner-Takes-All Associative Memory: A Hamming Distance Vector Quantizer

**PHILIPPE O. POULIQUEN**,<sup>1</sup> **ANDREAS G. ANDREOU**<sup>1</sup>, **AND KIM STROHBEHN**<sup>2</sup> Analog Integrated Circuits and Signal Processing, 13, 211–222 (1997)





#### exploiting problem statistics!

pose processor.) In an DEC-Alpha based general purpose computer it takes 10000 cycles to do a single pattern matching computation and thus it takes a total of 20 $\mu$ s per classification. Power dissipation is 30W at 500 MHz and therefore the energy per classification is 600 $\mu$ J) The Pentium-Pro is worse, because it requires 30W at 150 MHz and more than 10000 cycles for a single pattern matching. In contrast, the total current in the WAM is: (124×116×10) nA continuous bias current for the memory cells at 5V. Computation time is approximately 70 $\mu$ r for a total energy per classification in



- 1. Memory and processing are integrated in a single structure; this is analogous to the synapse in biology.
- 2. The system has an internal model that is related to the problem to be solved (*prior* knowledge). This is the template set of patterns to be classified.
- 3. The system is capable of learning i.e. templates can be changed to adapt to a different character set (different problem). This is done at the expense of storage capacity—we use a RAM based cell instead of a more compact ROM cell–.
- 4. The system processes information in a parallel and hierarchical fashion in a variable precision architecture. I.e. given the statistics of the problem, most of the computation is carried out with low precision (three or four bit) analog hardware. Yet arbritrary precision computation is possible through recursive processing that exploits a programmable WTA (capability to mask specific bits in the winner takes all circuitry).
- 5. The system is fault tolerant and gracefully degrades. The same structures that is used in the *precision-on-demand* architecture can also be used to reconfigure the system for defects in the fabrication process. The components of the chip that are worse matched can be disabled during operation.

#### 2um CMOS technology, 5Volts

06/22/2024

#### Neural AI accelerators for Edge

A. Garofalo, M. Rusci, F. Conti, D. Rossi, and L. Benini, "PULP-NN: accelerating quantized neural networks on parallel ultra-lowpower RISC-V processors.," Phil Trans A vol. 378, no. 2164, p. 20190155, Feb. 2020



SpiNNaker2 172 M4 FP cores DNN accelerators



C. Mayr, S. Hoeppner, and S. Furber, "SpiNNaker 2: http://arxiv.org/abs/1911.02385

06/22/2024

 Sensor Periph
 System Mem.
 Cld Cores
 Caches

Metric Value Clock Speed 100 MHz Die Dimensions 7mm x 4mm Max Event Readout 17Meps  $1.55 M/mm^2$ CiM Cell Density 1-bit MAC Throughput 2.1 \* 10<sup>11</sup> Op/s 1-bit efficiency 45fJ/Op  $3.3 * 10^9$  Op/s 8-bit MAC Throughput 8-bit efficiency 1.47pJ/Op Power Consumption 30 mW

D. R. Mendat, A.G. Andreou *et al.*, "A RISC-V Neuromorphic Micro-Controller Unit (vMCU) with Event-Based Physical Interface and Computational Memory for Low-Latency Machine Perception and Intelligence at the Edge," in ISCAS-2023.



i/O In HSI Dr GPIO D CU Interna 1-bit MAC Energy per 8-bit MAC Energy per

| Test         Test         CU           Struct         Struct         AUX           CU         CU         CU |                                                    |  |
|-------------------------------------------------------------------------------------------------------------|----------------------------------------------------|--|
|                                                                                                             |                                                    |  |
|                                                                                                             |                                                    |  |
|                                                                                                             |                                                    |  |
| UART / SPI UAR                                                                                              | T/SPI NoC Node                                     |  |
| Specification                                                                                               | Value                                              |  |
| # CUs with CIM Arrays                                                                                       | 77                                                 |  |
| # CIM Cells per CIM Array                                                                                   | 262K                                               |  |
| # Total 1-bit CIM Cells on<br>Chip                                                                          | 20.1 <i>M</i>                                      |  |
| CIM Cell Density                                                                                            | $4.66  M/mm^2$                                     |  |
| # Auxiliary CUs                                                                                             | 9                                                  |  |
| # RISC-V Processors                                                                                         | SiFive S7 and SiFive S21                           |  |
| I/O Interfaces                                                                                              | HSI (DDR), GPIO, UART<br>(x2), 4 bit-SPI (x2)      |  |
| HSI Data Rate                                                                                               | 25.6 <i>Gbps</i> (64-bit DDR @<br>200 <i>MHz</i> ) |  |
| GPIO Data Rate                                                                                              | 0.64 <i>Gbps</i> (32-bit @<br>100 <i>MHz</i> )     |  |
| CU Internal Clock Speed                                                                                     | 100 <i>MHz</i>                                     |  |
| 1-bit MAC Throughput                                                                                        | 6.7TOPS                                            |  |
| Energy per 1-bit MAC Op                                                                                     | 28.6 <i>fJ/Op</i> (8-bit<br>output)                |  |
| 8-bit MAC Throughput                                                                                        | 104.6 <i>GOPS</i>                                  |  |
| Energy per 8-bit MAC Op                                                                                     | 1.83 pJ/Op                                         |  |

|     |    | -        |    |     | -  |   |   |
|-----|----|----------|----|-----|----|---|---|
| ~ 5 | Mi | В        | t. | /sc | α. | m | m |
|     | 1  |          |    |     | -  |   |   |
|     | 4  | <u> </u> | SR | A   | VI | - |   |
|     | -  |          | -  | -   | -  | - |   |
| h   | -  |          |    | -   |    |   |   |



J. L. Molin and A. G. Andreou et.al., "Edge Super Compute (ESC) 2.0 A Low Power Systemon-Chip Al Accelerator for the Edge," in *Proceedings GOMACTech-2024*, 20 March 2024.

D24 DCgAA 2024: International Workshop on DL-Hardware Co-Design for Generative AI Acceleration

11

Today: AI/ML in embodied systems at the EDGE and 3<sup>rd</sup> Wave of AI



06/22/2024 DCgAA 2024: International Workshop on DL-Hardware Co-Design for Generative AI Acceleration

🚮 Johns Hopkins



### Al Makes Neuromorphic Al chips (2023-

Spiking Neural Networks (SNNs) designed with LLMs – ChatGPT



#### **Example 1: Spiking Neural Network** first layer neurons second layer neurons spikes\_out[2] spikes\_in[2] $\otimes$ -Hello, could you please provide me a spikes\_in[1] spikes\_out[1] Verilog module describing Spiking Neeural Network spikes\_in[0] spikes\_out[0] $\otimes$ ChatGPT secnd layer first layer weight weight **Topic Search Results from Github** Search Term Number of Results Language Is it even possible? "integrate and fire" Verilog 3 "integrate and fire" Any 750 "snn" Verilog 26 "snn" Any 2.4k "spiking neuron" Verilog 6 "spiking neuron" 478 Any 29 "spiking" Verilog "spiking" Any 8.7k "alu" Verilog 1.1k "accelerator" Verilog 339

"cnn"

"spi"

Verilog

Verilog

198

530

🔝 Johns Hopkins



#### The Neuron: Prompt

#### Anonymous

Can you write a verilog module for a spiking leaky integrate and fire neuron? Please include a multibit current input and a refactory period.

#### 🜀 ChatGPT

Sure, I'll provide a basic Verilog module for a spiking leaky integrate-and-fire neuron.

#### Here's a simple version:

#### This is a fairly simple model:

- When the neuron receives current, it increases its `membrane\_potential`.
- If the `membrane\_potential` reaches or exceeds a threshold (`THRESHOLD`), it sends out a spike (sets the `spike` signal to 1), and then resets its potential.
- After firing, the neuron goes into a refractory period during which it cannot fire. This is controlled by the `in\_refrac` flag and the `refrac\_counter`.
- The neuron also "leaks" current every cycle, decrementing the `membrane\_potential` by `LEAK\_RATE`.

You may need to tweak the parameters and values depending on your actual requirements. This is a high-level abstract representation, and real biological neurons have a lot more complexity and variability than what's represented here.

|    |    | ile leaky integrate fire neuron (                                             |
|----|----|-------------------------------------------------------------------------------|
|    |    | input cĺk, // clock input                                                     |
|    |    | input reset // asynchronous reset                                             |
|    |    | <i>input</i> [7:0] current, // 8-bit current input                            |
|    |    | output spike // output spike signal                                           |
|    | ); |                                                                               |
|    |    |                                                                               |
|    |    | // Parameters                                                                 |
|    |    | <pre>parameter THRESHOLD = 8'd255; // spiking threshold</pre>                 |
| 10 |    | parameter LEAK RATE = 8'd1; // decrease by 1 per cycle                        |
| 11 |    | parameter REFRAC PERIOD = 32; // refractory period in cycles                  |
| 12 |    |                                                                               |
| 13 |    | // Internal states                                                            |
| 14 |    | reg [7:0] membrane potential; // current value of the membrane potential      |
| 15 |    | reg [5:0] refrac counter = 6'd0; // refractory period counter                 |
| 16 |    | reg in refrac = 0; // refractory period flag                                  |
| 17 |    |                                                                               |
| 18 |    | // On every clock cycle                                                       |
|    |    | always @(posedge clk or posedge reset) begin                                  |
| 20 |    | if (reset) begin                                                              |
| 21 |    | membrane potential <= 8'd0;                                                   |
| 22 |    | refrac counter <= 6'd0;                                                       |
| 23 |    | in refrac <= 0;                                                               |
| 24 |    | end else begin                                                                |
| 25 |    | if(in refrac) begin                                                           |
| 26 |    | refrac counter <= refrac counter - 1'b1;                                      |
| 27 |    | if(refrac counter == 6'd0) in refrac <= 0;                                    |
| 28 |    | end else begin                                                                |
| 29 |    | // update membrane potential based on current input and leak rate             |
| 30 |    | if (membrane potential + current > THRESHOLD) begin                           |
|    |    | membrane potential <= 8'd0;                                                   |
|    |    | spike <= 1;                                                                   |
|    |    | in refrac <= 1;                                                               |
|    |    | refrac counter <= REFRAC PERIOD;                                              |
|    |    | end else if (membrane_potential + current - LEAK_RATE < 8'd0) begin           |
|    |    | membrane_potential <= 8'd0;                                                   |
|    |    | end else begin                                                                |
|    |    | <pre>membrane_potential &lt;= membrane_potential + current - LEAK_RATE;</pre> |
|    |    | end – – – – – – – – – – – – – – – – – – –                                     |
|    |    | end                                                                           |
|    |    | end                                                                           |
|    |    | end                                                                           |
|    |    |                                                                               |
|    |    | // Reset spike signal after one cycle                                         |
|    |    | always @(posedge clk or posedge reset) begin                                  |
|    |    | if (reset) spike <= 0;                                                        |
|    |    | else if (spike) spike <= 0;                                                   |
|    |    | end                                                                           |
|    |    |                                                                               |
|    |    | nodule                                                                        |



#### Neuron Verilog Code





# Neuron Verilog Code - Analysis



06/22/2024 DCgAA 2024: International Workshop on DL-Hardware Co-Design for Generative AI Acceleration



# Fixing the errors using Natural Language (I)





# Fixing the errors using Natural Language (II)







# Fixing the errors using Natural Language (III)



06/22/2024 DCgAA 2024: International Workshop on DL-Hardware Co-Design for Generative AI Acceleration

# Arithmetic Problem - Underflow Issue (I)







#### **Previous Output**

New Output

# Arithmetic Problem - Underflow Issue (II)





06/22/2024 DCgAA 2024: International Workshop on DL-Hardware Co-Design for Generative AI Acceleration

### Arithmetic Problem - Underflow Issue (III)







#### Arithmetic Problem - Underflow Issue Solved!



06/22/2024 DCgAA 2024: International Workshop on DL-Hardware Co-Design for Generative AI Acceleration

### Making the Neuron Programmable





# Building the rest of the system





# Physical Layout and submission to fabrication



- Fully LLM written SPI programmable Spiking Neuron Array
- Generated verilog verified with functional simulation
- Design was submitted on Tiny Tapeout 5 using an open source RTL-> GDS flow
- 33% Density in a 320um x 200um tile
- All code, scripts, and ChatGPT transcripts are available on Github
- One of the first fully AI described ASICs
- Not an easy process, but a fully Natural Language -> Verilog flow is possible with extensive manual (human) verification



Layout for LLM Spiking Neuron Array

Spiking Neural Network chip: -Telluride Workshop on Neuromorphic Engineering, July, 2023, ChatGPT 4, Efabless Tiny Tapeout 5, 4 November 2023, Chips and boards back June 2024 –now-



#### Recurrent Spiking Neural Network chip: -2024 Andreou Lab, ChatGPT 4, April-May 2024, Efabless Tiny Tapeout 6, 15 May 2024

# **Recurrent Neural Network Results**

- Fully LLM written SPI programmable Recurrent Spiking Neuron Array.
- Test-benches generated by ChatGPT
- Generated Verilog verified with functional simulation and mapping into an FPGA.
- Network parameters trained and tested for XOR and IRIS classification task
- Chip design submitted on Efabless Tiny Tapeout 6 using an open source RTL-> GDS flow
- All code, scripts, and ChatGPT transcripts are available on Github



| Module<br>Name    | Chat<br># | Itera-<br>tions<br># | Lines<br>count | Improvements                                                                                                                                                                                                         |
|-------------------|-----------|----------------------|----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| LIF<br>Neuron     | 2         | 38<br>(24+14)        | 33             | <ul> <li>Overflow/underflow<br/>management</li> <li>(Bit Width Adjustments +<br/>sign extension)</li> <li>Verilog Syntax +<br/>best practices</li> <li>Adding proper inputs</li> </ul>                               |
| RLIF<br>Neuron    | 1         | 6                    | 19             | - Overflow/underflow<br>management<br>(Bit Width Adjustments<br>+ sign extension)                                                                                                                                    |
| RLIF<br>Layer     | 2         | 17<br>(10+7)         | 63             | <ul> <li>Clarification of requirements<br/>(format of input data,<br/>parameter sharing among<br/>neurons, behaviour of the<br/>module)</li> <li>Verilog Syntax</li> <li>Oveflow/underflow<br/>management</li> </ul> |
| RSNN              | 1         | 8                    | 37             | <ul> <li>Clarification on module<br/>behaviour</li> <li>Verilog best practices</li> </ul>                                                                                                                            |
| FIPO<br>Memory    | 1         | 12                   | 21             | <ul> <li>Clarification on module<br/>behaviour</li> <li>Adding control input signals</li> </ul>                                                                                                                      |
| RegN              | 1         | 5                    | 6              | <ul> <li>Module Parameterizability</li> <li>Adding control input signals</li> <li>Change in reset behaviour</li> </ul>                                                                                               |
| Control<br>Memory | 1         | 9                    | 55             | <ul> <li>Clarification on model</li> <li>behaviour</li> <li>Verilog Syntax</li> </ul>                                                                                                                                |
| Top<br>Module     | 1         | 22                   | 103            | <ul> <li>Clarification on model<br/>behaviour</li> <li>Refining Connections</li> </ul>                                                                                                                               |

### Bibliography

#### 2023: The year of LLM based EDA



Kaiyan Chang, Ying Wang, Haimeng Ren, Mengdi Wang, Shengwen Liang, Yinhe Han, Huawei Li, and Xiaowei Li. 2023. ChipGPT: How far are we from natural language hardware design. http://arxiv.org/abs/2305.14019 arXiv:2305.14019 [cs]

Blocklove, Jason, Siddharth Garg, Ramesh Karri, and Hammond Pearce. "Chip-Chat: Challenges and Opportunities in Conversational Hardware Design." In *2023 ACM/IEEE 5th Workshop on Machine Learning for CAD (MLCAD)*, 1–6, 2023. https://doi.org/10.1109/MLCAD58807.2023.10299874.

Meech, James T. "Leveraging High-Level Synthesis and Large Language Models to Generate, Simulate, and Deploy a Uniform Random Number Generator Hardware Design." *arXiv:2311.03489* [*Cs*], November 2023. <u>http://arxiv.org/abs/2311.03489</u>.

Thakur, Shailja, Jason Blocklove, Hammond Pearce, Benjamin Tan, Siddharth Garg, and Ramesh Karri. "AutoChip: Automating HDL Generation Using LLM Feedback." arXiv, November 8, 2023. <u>http://arxiv.org/abs/2311.04887</u>.

Tomlinson, Michael A, Joe Li, and Andreas G. Andreou. "Designing Silicon Brains Using LLMs: Leveraging ChatGPT for Automated Description of a Spiking Neuron Array." In Proceedings 2024 Argentine Conference on Electronics, 154–59. IEEE, March 2024.

Vitolo, Paola, George Psaltakis, Michael Tomlinson, Gian Domenico Licciardo, and Andreas G. Andreou. "Natural Language to Verilog: Design of a Recurrent Spiking Neural Network Using Large Language Models and ChatGPT." arXiv:2405.01419, May 2, 2024. http://arxiv.org/abs/2405.01419.



### Generative AI acceleration: Neuromorphic Computational Memory Systems for Accelerating LLMs

The role of memory architecture in AI and the challenge of a 10 GiBits per square mm computational memory for Generative AI



#### 2.5D and 3D Integration impact in AI Systems

CASSIDY AND ANDREOU: BEYOND AMDAHL'S LAW: AN OBJECTIVE FUNCTIO

IEEE TRANSACTIONS ON COMPUTERS, VOL. 61,





https://semiengineering.com/kn owledge\_centers/memory/volati le-memory/dynamic-randomaccess-memory/highbandwidth-memory/

Cassidy, Andrew S., and Andreas G. Andreou. "Beyond Amdah's Law: An Objective Function That Links Multiprocessor Performance Gains to Delay and Energy." *IEEE Transactions on Computers* 61, no. 8 (August 2012): 1110–26. <u>https://doi.org/10.1109/TC.2011.169</u>.



# 3D Memory in 2.5D GPU design **AMD Fiji**



https://www.techpowerup.com/gpu-specs/amd-fiji.g774#

#### First Commercial Product: June 2015 AMD Radeon R9 Fury X

#### April 2016: Nvidia Tesla P100, June 2016 Intel Xeon-Phi Jensen Huang: "HBM is a technology miracle ..."

Sanni, Kayode A., and Andreas G. Andreou. "A Historical Perspective on Hardware AI Inference, Charge-Based Computational Circuits and an 8bit Charge-Based Multiply-Add Core in 16nm FinFET CMOS." *IEEE Journal on Emerging and Selected Topics in Circuits and Systems* 9, no. 3 (September 2019): 532–43. <u>https://doi.org/10.1109/JETCAS.2019.2933795</u>.



#### Why 3D?



M.A Marwick and A.G. Andreou, "Retinomorphic system design in three dimensional SOI-CMOS," 2006 IEEE International Symposium on Circuits and Systems. A. S. Cassidy and A. G. Andreou, "Beyond Amdahl's Law: an objective function that links multiprocessor performance gains to delay and energy," IEEE Trans Comput, vol. 61, no. 8, pp. 1110–1126, Aug. 2012.



### Concluding Remarks

- Results from "vanilla" LLMs for Natural Language to Verilog generation promising -OpenAl ChatGPT 4-
- Hierarchical and modular conception of the design by human is necessary
- Reduce design time
- Reduce time for documentation of design
- "Aligning" LLMs for HDL generation
- Opportunity: LLMs for end to end A2C- Application to Chip for example Efabless KWS challenge.
- How about LLM assisted analog design?

#### **Telluride Neuromorphic Engineering Workshop**







The core of Telluride is broken into separate **Topic Areas**. Each Topic Area is guided by a group of experts who will provide tutorials, lectures, and hands-on project ideas. Participants should expect to be exposed to all of the Topic Areas, but will generally focus on one or two to work on during the event; see <u>past workshops</u> for examples.

There are also a series of (mostly) neuroscience talks during 2nd week; see below.

#### <u>NIC24</u>

Neuromorphic integrated circuits

**Computational Neuroscience Talks** 

<u>CNS24</u>

Understanding the auditory brain with neural networks

AUD24

#### <u>L&T24</u> Language and thought

ht Neuromorphic systems for space applications

SPA24



#### 30th year anniversary: Join for 3 weeks of fun @ 9000 feet

#### Variations of our team picture generated by Dall-E





https://andreoulab.net