๐A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
-
Updated
Nov 18, 2024
๐A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
๐Modern CUDA Learn Notes with PyTorch: Tensor/CUDA Cores, ๐150+ CUDA Kernels, ๐HGEMM (achieve the performance of cuBLAS ๐๐), ๐100+ LLM/CUDA blogs.
Toy Flash Attention implementation in torch
Add a description, image, and links to the flash-attention-3 topic page so that developers can more easily learn about it.
To associate your repository with the flash-attention-3 topic, visit your repo's landing page and select "manage topics."