At NVIDIA, I serve as a Tech Lead for AI GPU performance engineering. Key responsibilities include:
- Lead CUDA, Triton, CUTLASS, CuTe, CuTile, and cuDNN kernel optimization for Hopper, Blackwell, and Rubin-class GPUs.
- Improve AI and LLM training/inference by fixing bottlenecks, fusing kernels, and optimizing memory bandwidth.
- Enable PyTorch and JAX integration through XLA, MLIR, TorchInductor, and torch.compile.
- Use Nsight Compute and Nsight Systems for profiling and performance analysis.
- Work on fused GEMMs, attention kernels, distributed training, and inference optimization.
- Drive co-design feedback for future GPU software and hardware.
At AMD, I worked as a GPU performance engineer and technical lead for AI workloads.
- Optimized training and inference on MI300 and MI200 platforms.
- Improved throughput, memory efficiency, and scaling for transformer workloads.
- Collaborated with framework, architecture, and distributed systems teams.
- Contributed to Composable Kernel and ROCm-based AI libraries.
- Optimized attention, GEMM, batching, KV-cache, and multi-GPU communication.
- Helped establish and lead the AMD Center of Excellence in AI at UW Seattle.
"Nothing in life is to be feared, it is only to be understood"- Marie Curie