LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar
-
Updated
Apr 23, 2026 - Python
LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar
RamaLama is an open-source developer tool that simplifies the local serving of AI models from any source and facilitates their use for inference in production, all through the familiar language of containers.
Turn any computer or edge device into a command center for your computer vision projects.
The simplest way to serve AI/ML models in production
Python + Inference - Model Deployment library in Python. Simplest model inference server ever.
This is a repository for an nocode object detection inference API using the Yolov3 and Yolov4 Darknet framework.
This is a repository for an nocode object detection inference API using the Yolov4 and Yolov3 Opencv.
This is a repository for an object detection inference API using the Tensorflow framework.
Deploy DL/ ML inference pipelines with minimal extra code.
[⛔️ DEPRECATED] Friendli: the fastest serving engine for generative AI
Advanced inference pipeline using NVIDIA Triton Inference Server for CRAFT Text detection (Pytorch), included converter from Pytorch -> ONNX -> TensorRT, Inference pipelines (TensorRT, Triton server - multi-format). Supported model format for Triton inference: TensorRT engine, Torchscript, ONNX
Benchmark for machine learning model online serving (LLM, embedding, Stable-Diffusion, Whisper)
Inference Server Implementation from Scratch for Machine Learning Models
Roboflow's inference server to analyze video streams. This project extracts insights from video frames at defined intervals and generates informative visualizations and CSV outputs.
Session Based Real-time Hotel Recommendation Web Application
A networked inference server for Whisper speech recognition
Vision and vision-multi-modal components for geniusrise framework
Serve pytorch inference requests using batching with redis for faster performance.
🤖 Optimize LLM inference on Mac with continuous batching and SSD caching managed from your menu bar for efficient performance.
Text components powering LLMs & SLMs for geniusrise framework
Add a description, image, and links to the inference-server topic page so that developers can more easily learn about it.
To associate your repository with the inference-server topic, visit your repo's landing page and select "manage topics."