Show HN: Lightweight Llama3 Inference Engine – CUDA C

Hey, recently I took inspiration from llama.cpp, ollama, and many other similar tools that enable inference of LLMs locally, and I just finished building a Llama inference engine for the 8B model in CUDA C.

I recently wanted to explore my newly founded interest in CUDA programming and my passion for machine learning. This project only makes use of the native CUDA runtime api and cuda_fp16. The inference takes place in fp16, so it requires around 17-18GB of VRAM (~16GB for model params and some more for intermediary caches).

It doesn’t use cuBLAS or any similar libraries since I wanted to be exposed to the least amount of abstraction. Hence, it isn’t as optimized as a cuBLAS implementation or other inference engines like the ones that inspired the project.

## *A brief overview of the implementation*

I used CUDA C. It reads a .safetensor file of the model that you can pull from HuggingFace. The actual kernels are fairly straightforward for normalizations, skip connections, RoPE, and activation functions (SiLU).

For GEMM, I got as far as implementing tiled matrix multiplication with vectorized retrieval for each thread. The GEMM kernel is also written in such a way that the second matrix is not required to be pre-transposed while still achieving coalesced memory access to HBM.

Feel free to have a look at the project repo and try it out if you’re interested. If you like what you see, feel free to star the repo too!

I highly appreciate any feedback, good or constructive.

github.com

12 points

abhisheknair10

6 days ago


0 comments