Generative AI Resources

Local LLM Inference:

Ollama: Small memory foot print LLM inference.
LM Studio
Nexa SDK
Transfer Lab
Clean UI
Bitnet.cpp - running LLM on cpu
...

Most popular Vector DBs:

Chroma
Milvus
Cassandra (support vector search)
Weaviate
Pgvector (extension)
Oracle (announced on 13th september)
...

Rag frameworks:

Langchain
LlamaIndex
Ragflow
Haystack
...

Local LLM inference tools

Below is a detailed comparison table of some popular LLM inference tools for running and managing LLMs locally.

Tool	Interface	Ease of Use	Customization & Flexibility	Performance & Optimization	OS Support	Unique Features
Ollama	CLI with REST API support	Simple & beginner-friendly with straightforward commands	- Flexible via modelfiles; - supports multiple LLM libraries (CPU, GPU, etc.).	- Optimized autodetection of hardware; - leverages GPU acceleration	Cross-platform	- Experimental OpenAI API compatibility; - multimodal input support (e.g., images).
llama.cpp	CLI	- Requires compilation and command-line usage; - more technical.	- Highly flexible; - extensive parameter tuning; - open-source and community maintained.	Lightweight with efficient CPU performance and low memory footprint	Cross-platform	- Supports various quantization techniques; - highly community-driven.
Ramallama	CLI	Designed to streamline model switching and ease repetitive tasks	Focuses on auto-unload and robust memory management for frequent model swapping	Optimized for constrained systems with rapid model loading and low memory use	Cross-platform	Emphasizes automated unloading and memory optimization.
LM Studio	GUI-based	- Extremely user-friendly; - ideal for newcomers.	Extensive configuration options via a polished GUI interface	- Provides solid local performance plus a local inference server - High memory foot print.	Windows, macOS, Linux (beta)	- Polished chat interface; - built-in model discovery; - integrated local inference server.
Nexa SDK	SDK/CLI toolkit	- Developer-focused; - requires technical integrationю	Offers advanced quantization options and optimization parameters	Reduces model file size and RAM usage while preserving accuracy (e.g., NexaQuant)	Cross-platform	- Specializes in quantization/optimization; - works seamlessly with tools like Ollama or LM Studio
vLLM	CLI with OpenAI-compatible API server	- Developer-friendly; - integrates seamlessly with Hugging Face models.	- Supports various decoding algorithms; tensor and pipeline parallelism; - quantization	- High throughput with efficient memory management via PagedAttention; - supports GPUs and CPUs	Cross-platform	- PagedAttention mechanism for efficient KV cache management; - supports speculative decoding and structured outputs.