Personal archive of coding experiments, research threads, and draft ideas.
My research interests revolve around compilers, high-performance computing, and deep learning systems.
Specifically, I am interested in intermediate representations, JIT compilation, and code generation for deep
learning and HPC workloads targeting CPUs, GPUs, and custom accelerators such as FPGAs and TPUs.
Having spent considerable time manually optimizing SIMD kernels and memory layouts, I have developed
a strong appreciation for both the performance gains achievable through hardware-aware optimization and
the difficulty of scaling such efforts across architectures.
My experience has made clear both the potential and limits of manual optimization. I can hand-write kernels
that utilize hardware effectively, but doing so for every operator, data type, and architecture is not scalable.
I am interested in how functional IRs and rewrite systems can be applied to deep learning compilation. I am also interested in how machine
learning can guide search through optimization spaces, and how memory-aware IR design enables efficient
accelerator code generation.
I am developing cppDL, a general-purpose deep learning inference library written entirely in C++ with no external ML dependencies. The system
includes a custom memory allocator, safetensor parser, BPE tokenizer, implementations of core neural network operations and high-performance tensor operations.
I validated the library by running GPT-2 inference:
cppDL achieves approximately 35 tok/s compared to PyTorch’s 19 tok/s on the same hardware, an 86% improvement in end-to-end latency without KV Caching on the CPU.
When using KV Caching, cppDL achieves over 110 tok/s compared to PyTorch's 60 tok/s on the CPU.
This advantage does not come from faster matrix multiplication. PyTorch uses highly optimized BLAS
backends that exceed my implementation in raw GEMM throughput. The gains come from system-level
optimizations: static memory arenas eliminating allocation overhead, cache-friendly data layouts reducing
memory traffic, and a simplified execution path avoiding dispatch overhead. This highlighted an important
insight: end-to-end performance depends on the entire system, and there is significant room for compilerdriven optimization at the graph level.
The backend for cppDL is CRUSHBLAS, a custom linear algebra library using AVX2 and AVX-512 intrinsics.
I implemented GEMM kernels with register blocking, loop tiling, and cache-aware memory access
patterns. Benchmarked across large matrix sizes (4096, 8192, and 16384), my kernels achieve a sustained
throughput of approximately 825 GFLOP/s with OpenMP parallelization. I am currently extending this
work to GPUs, writing custom drivers for direct kernel dispatch to understand the execution model below
the CUDA/ROCm abstraction layer. Building these systems has taught me the micro-architectural details
compilers must reason about: port contention, latency hiding, cache utilization, and prefetching behavior. I
see this manual optimization experience as preparation for researching how to automate such decisions.