Dr. Fu Li (Leo)
AI Infrastructure Architect
Summary
AI Infrastructure Architect with 15+ years across AI infrastructure, high-performance computing, distributed systems, and advanced networking.
Focused on large-scale AI systems across GPU clusters, RDMA fabrics, NVLink environments, and cloud superclusters, with hands-on experience in CUDA, MPI, FPGA, PCIe, and CXL-based architectures.
Currently develops Guided AI Engineering practices that combine progressive context and goal setting, harness engineering, and deterministic automated workflows for real-world engineering problems.
Core Skills
- AI infrastructure: GPU clusters, superclusters, heterogeneous computing, performance benchmarking
- GPU computing: CUDA, MPI, vLLM, llama.cpp, MoE inference, KV cache optimization
- Interconnects: RDMA, NVLink, PCIe, CXL, UCIe, FPGA-based systems
- Engineering: Guided AI Engineering, problem solving, systems innovation, deterministic workflows
Focus Areas
- Disaggregated AI infrastructure: remote PCIe subsystems, CXL switch systems, memory pooling, accelerator composition
- Memory-constrained inference: MoE expert placement, bounded GPU residency, CPU memory as model storage, KV-cache-preserving runtime behavior
- Cloud GPU systems: RDMA networking, NVLink systems, GPU cluster optimization, hybrid and heterogeneous AI/GPU computing
- Deterministic AI engineering: progressive context, harness engineering, goal setting, automated workflows
Blogs
- Zettascale OSU and NCCL benchmark for H100 AI workloads
- Zettascale in practice: Scaling beyond limits
- Run Nextflow with heterogeneous computing on OCI
- Cut Nextflow costs by 70% with OCI
- The Story of OpenClaw: Learning to Collaborate
- Human in the Loop: Guiding AI to Break the Bus/Network Architecture Barrier
- Human-in-the-Loop Engineering and Vibe Coding
- AI Coding BucketFS: A Transactional FUSE Filesystem for Object Storage
- AI Networking: How Fractal Scales Beyond Limits
- Inside AI Infrastructure, Series II: Benchmarking
- A New Architecture Shift: NVIDIA, Enfabrica, and Intel
Selected Projects
-
vLLM MoE CPU Offload
GPU-native MoE offload using CPU host memory as the expert-weight bank. A 26B MoE model can run on a 16GB GPU without out-of-memory failure while preserving vLLM KV cache behavior.
The design keeps CPU host memory as the source of truth for expert weights and uses GPU memory only to stage active expert weights needed for the current computation.
Active experts are copied or synchronized into GPU memory on demand, and GPU resources can be released after the active computation is complete. This reduces GPU expert residency pressure while keeping CPU memory as storage and GPU hardware as the execution path.
-
vLLM MoE GPU Prefetch
GPU expert slots for 26B MoE inference, improving locality throughput from 20.43 to 53.29 tokens/s in the 20GB case and from 60.31 to 81.68 tokens/s in the 38GB case.
The project treats expert weights like managed pages in a bounded GPU-resident working set, similar in spirit to virtual memory page management in an operating system.
At runtime, a fixed number of GPU expert slots are actively managed. Experts can be prefetched, replaced, or reused according to demand, which helps balance limited GPU memory with inference throughput when request locality creates repeated expert access patterns.
-
llama.cpp-MoE
Router-aware GPU expert slots for local MoE inference under constrained GPU memory.
This applies the MoE expert-slot idea to llama.cpp so sparse MoE execution can work under constrained GPU memory without relying only on traditional layer-level GPU offload.
The implementation direction is to keep selected experts in GPU-resident slots while remaining compatible with local inference workflows, making llama.cpp a practical testbed for MoE memory-management ideas on smaller hardware.
-
GPU-native Scheduler for GPU Computing
Patent-submitted scheduling approach for GPU resource management.
This project studies GPU resource scheduling from the perspective of GPU-native execution.
The goal is to coordinate GPU compute, memory movement, and runtime placement decisions so large AI workloads can run more efficiently under limited accelerator resources.
-
SuperKernel for SuperPod GPU Clusters
Patent-submitted Jupyter kernel architecture for large GPU fabric execution.
SuperKernel explores a Jupyter kernel architecture for large GPU fabric execution.
The goal is to let notebook-driven workflows reach large GPU clusters or SuperPod-style systems without exposing the user directly to the complexity of distributed infrastructure.
-
Nextflow IaC Plugin for Heterogeneous HPC Orchestration
Infrastructure-as-code orchestration for pipelines across Arm, GPU, and x86 infrastructure.
This project extends Nextflow workflows with infrastructure-as-code orchestration for heterogeneous compute, allowing workload placement to match the compute profile of each pipeline stage.
The practical goal is lower-cost heterogeneous HPC and AI workflow execution, especially on cloud infrastructure where different instance families can be combined for better cost-performance.
-
PCI Subsystem over Network
Remote PCIe virtualization that allows remote PCIe subsystems to be accessed as local devices.
The work fits into a broader direction of disaggregated AI and HPC infrastructure, where accelerators, storage, and network resources can be connected through high-speed fabrics rather than fixed inside one server boundary.
The project explores how a remote PCIe subsystem can appear accessible like a local device, enabling hardware resources to be composed across system boundaries.
-
Distributed MCP Protocol for AI-native CDN Architecture
Distributed protocol design for AI-native content delivery and agent coordination.
The project explores distributed protocol design for AI-native content delivery and agent coordination.
The core idea is to move beyond request-response APIs toward persistent message-driven systems that can route work across distributed AI agents or services, connecting CDN-like distribution, mailbox-style agent protocols, and asynchronous coordination.
-
PCIe-Net and RDMA over PCIe/CXL
TCP/IP-over-PCIe/CXL and RDMA-style data movement across high-speed interconnect fabrics.
PCIe-Net studies TCP/IP-over-PCIe/CXL fabric concepts for extending network-style communication semantics across high-speed interconnects.
RDMA over PCIe/CXL studies high-performance data movement based on RDMA principles across PCIe and CXL environments. The broader goal is to make bus and memory-fabric technologies behave more like scalable networks for AI and HPC systems.
-
CXL Switch SoC and Cluster-on-Board
Switch-chip and multi-CPU board-level architectures for scalable AI and HPC systems.
This project family explores switch-chip and board-level architectures for scalable AI and HPC systems.
CXL Switch SoC work studies a switch-chip architecture as an alternative interconnect model for scalable AI infrastructure. Cluster-on-Board work studies tightly interconnected multi-CPU board-level designs, including 8x CPU via PCIe SoC concepts.
-
Multi-rail HPC Computing System for Rendering Applications
Multi-rail HPC computing architecture for production rendering workloads.
This project built and optimized a multi-rail HPC computing system for rendering applications.
The work covered system architecture, high-performance rendering execution, multi-rail network and storage design, and operational support for production-scale media pipelines.
Education and Affiliations
- Ph.D., University of Wisconsin - Madison
- M.S., University of Wisconsin - Madison
- B.S., University of Science and Technology of China
- Voting member of Linux Foundation Edge and Akraino Project, 2022 and 2023
- Vice Director of the Film Advanced Technology Committee of CSMPTE, 2018
- Industry Professorship of Jiangnan University, 2017-2021
Patents
- US 20120290696: Method and System for Longest Prefix Matching of Variable-Sized Hierarchical Names by Treelets
- CN 20150605: Method and System for File Transfer based on Named Data Networking Caching Algorithm
- CN114827151A: Method and Device for Heterogeneous clustered devices and servers based on PCIe CXL and UCIe Physical Links
- CN114745325A: Method and Device for MAC in MAC Network Encoding Based on PCIe, CXL, and UCIe Physical Links
- CN110891081A: Method and Device for Packet Sending, Routing, Broadcasting and Receiving
- CN 20150605CN: Method and Device for Vehicular Networking Based on Content-Centric Networking
- CN111027396A: Method and System for Assisted Driving, Apparatus, Onboard Terminal and Cloud Server
- CN110929087A: Method and System for Audio Classification, Apparatus, Electronic Device, and Storage Medium
- CN106708749A: Method and System for Fast Searching based on Fractional Algorithms
- CN109688204A: Method and System for File Download Based on Named Data Networking, Node, and Terminal
- CN109448684A: Method and System for Intelligent Music Composition
- CN111209098A: Method and System for Intelligent Rendering Scheduling, Server, Management Node and Storage Medium
- CN110955515A: Method and System for Processing Files, Apparatus, Electronic Devices, and Storage Media
- CN111178151A: Method and System for Recognizing Micro-Expressions in Facial Changes Based on AI Technology
- CN106095996B: Method and System for Text and Content Classification
- CN110944034A: Method and System for Web-based Resumable Transmission, Device, Electronic Device and Storage Medium
- CN111125045A: Method and System for a Lightweight ETL Processing Platform
- 2017211770803: Method and System for a Portable Mobile Video Content Accelerated Transmission Device
- CN107819704A: Method and System of Scalable Wireless Media Application for Edge Computing