Devlog Research Notes

Contact

email: r_elhel@live.concordia.ca

github: github.com/000x999

last updated: 2026-01-30

About Me

Name: Ralph EH
Current position: C++ SDK Software Engineer at Cognitive3D (Full Time)
Education: Bachelor of Computer Science at Concordia University - Graduation date: Aug/Sep 2026

GitHub: github.com/000x999

Languages: C++ (12 years), Assembly (12 years), Fortran (7 years), Python (7 years), Haskell (7 years), Swift (2 years)
HPC/GPU/Compilers: CUDA, ROCm, Vulkan, OpenGL, Metal, OpenCL, AVX2/AVX512, SSE, NEON, OpenMP, OpenMPI, LLVM, MLIR
Libraries: Eigen, OpenCV, PyTorch, BLAS/LAPACK, Numpy, Scikit
Developer tools: x64dbg, Ghidra, IDA Pro, NSight, Orbit Profiler, DTrace, Perf, CMake, Make
Interests: Compiler Design, Code Generation, GPU Code Generation, SIMD, HPC, Optimization, Deep Learning Optimization, Compilers for Deep Learning, Math, Linear Algebra, Theoretical Computer Science , Computer/Hardware Architecture
Graduate Research interests and Expected Focus:
My research interests revolve around compilers, high-performance computing, and deep learning systems. Specifically, I am interested in intermediate representations, JIT compilation, and code generation for deep learning and HPC workloads targeting CPUs, GPUs, and custom accelerators such as FPGAs and TPUs. Having spent considerable time manually optimizing SIMD kernels and memory layouts, I have developed a strong appreciation for both the performance gains achievable through hardware-aware optimization and the difficulty of scaling such efforts across architectures.
My experience has made clear both the potential and limits of manual optimization. I can hand-write kernels that utilize hardware effectively, but doing so for every operator, data type, and architecture is not scalable. I am interested in how functional IRs and rewrite systems can be applied to deep learning compilation. I am also interested in how machine learning can guide search through optimization spaces, and how memory-aware IR design enables efficient accelerator code generation.
Technical Background and Projects:
I am developing cppDL, a general-purpose deep learning inference library written entirely in C++ with no external ML dependencies. The system includes a custom memory allocator, safetensor parser, BPE tokenizer, implementations of core neural network operations and high-performance tensor operations.
I validated the library by running GPT-2 inference: cppDL achieves approximately 35 tok/s compared to PyTorch’s 19 tok/s on the same hardware, an 86% improvement in end-to-end latency without KV Caching on the CPU.
When using KV Caching, cppDL achieves over 110 tok/s compared to PyTorch's 60 tok/s on the CPU.

This advantage does not come from faster matrix multiplication. PyTorch uses highly optimized BLAS backends that exceed my implementation in raw GEMM throughput. The gains come from system-level optimizations: static memory arenas eliminating allocation overhead, cache-friendly data layouts reducing memory traffic, and a simplified execution path avoiding dispatch overhead. This highlighted an important insight: end-to-end performance depends on the entire system, and there is significant room for compilerdriven optimization at the graph level.

The backend for cppDL is CRUSHBLAS, a custom linear algebra library using AVX2 and AVX-512 intrinsics. I implemented GEMM kernels with register blocking, loop tiling, and cache-aware memory access patterns. Benchmarked across large matrix sizes (4096, 8192, and 16384), my kernels achieve a sustained throughput of approximately 825 GFLOP/s with OpenMP parallelization. I am currently extending this work to GPUs, writing custom drivers for direct kernel dispatch to understand the execution model below the CUDA/ROCm abstraction layer. Building these systems has taught me the micro-architectural details compilers must reason about: port contention, latency hiding, cache utilization, and prefetching behavior. I see this manual optimization experience as preparation for researching how to automate such decisions.