Managing Instances¶
Learn how to effectively manage your llama.cpp, MLX, and vLLM instances with Llamactl through both the Web UI and API.
Overview¶
Llamactl provides two ways to manage instances:
- Web UI: Accessible at
http://localhost:8080
with an intuitive dashboard - REST API: Programmatic access for automation and integration
Authentication¶
If authentication is enabled: 1. Navigate to the web UI 2. Enter your credentials 3. Bearer token is stored for the session
Theme Support¶
- Switch between light and dark themes
- Setting is remembered across sessions
Instance Cards¶
Each instance is displayed as a card showing:
- Instance name
- Health status badge (unknown, ready, error, failed)
- Action buttons (start, stop, edit, logs, delete)
Create Instance¶
Via Web UI¶
- Click the "Create Instance" button on the dashboard
- Enter a unique Name for your instance (only required field)
- Choose Backend Type:
- llama.cpp: For GGUF models using llama-server
- MLX: For MLX-optimized models (macOS only)
- vLLM: For distributed serving and high-throughput inference
- Configure model source:
- For llama.cpp: GGUF model path or HuggingFace repo
- For MLX: MLX model path or identifier (e.g.,
mlx-community/Mistral-7B-Instruct-v0.3-4bit
) - For vLLM: HuggingFace model identifier (e.g.,
microsoft/DialoGPT-medium
)
- Configure optional instance management settings:
- Auto Restart: Automatically restart instance on failure
- Max Restarts: Maximum number of restart attempts
- Restart Delay: Delay in seconds between restart attempts
- On Demand Start: Start instance when receiving a request to the OpenAI compatible endpoint
- Idle Timeout: Minutes before stopping idle instance (set to 0 to disable)
- Configure backend-specific options:
- llama.cpp: Threads, context size, GPU layers, port, etc.
- MLX: Temperature, top-p, adapter path, Python environment, etc.
- vLLM: Tensor parallel size, GPU memory utilization, quantization, etc.
- Click "Create" to save the instance
Via API¶
# Create llama.cpp instance with local model file
curl -X POST http://localhost:8080/api/instances/my-llama-instance \
-H "Content-Type: application/json" \
-d '{
"backend_type": "llama_cpp",
"backend_options": {
"model": "/path/to/model.gguf",
"threads": 8,
"ctx_size": 4096,
"gpu_layers": 32
}
}'
# Create MLX instance (macOS only)
curl -X POST http://localhost:8080/api/instances/my-mlx-instance \
-H "Content-Type: application/json" \
-d '{
"backend_type": "mlx_lm",
"backend_options": {
"model": "mlx-community/Mistral-7B-Instruct-v0.3-4bit",
"temp": 0.7,
"top_p": 0.9,
"max_tokens": 2048
},
"auto_restart": true,
"max_restarts": 3
}'
# Create vLLM instance
curl -X POST http://localhost:8080/api/instances/my-vllm-instance \
-H "Content-Type: application/json" \
-d '{
"backend_type": "vllm",
"backend_options": {
"model": "microsoft/DialoGPT-medium",
"tensor_parallel_size": 2,
"gpu_memory_utilization": 0.9
},
"auto_restart": true,
"on_demand_start": true
}'
# Create llama.cpp instance with HuggingFace model
curl -X POST http://localhost:8080/api/instances/gemma-3-27b \
-H "Content-Type: application/json" \
-d '{
"backend_type": "llama_cpp",
"backend_options": {
"hf_repo": "unsloth/gemma-3-27b-it-GGUF",
"hf_file": "gemma-3-27b-it-GGUF.gguf",
"gpu_layers": 32
}
}'
Start Instance¶
Via Web UI¶
- Click the "Start" button on an instance card
- Watch the status change to "Unknown"
- Monitor progress in the logs
- Instance status changes to "Ready" when ready
Via API¶
Stop Instance¶
Via Web UI¶
- Click the "Stop" button on an instance card
- Instance gracefully shuts down
Via API¶
Edit Instance¶
Via Web UI¶
- Click the "Edit" button on an instance card
- Modify settings in the configuration dialog
- Changes require instance restart to take effect
- Click "Update & Restart" to apply changes
Via API¶
Modify instance settings:
curl -X PUT http://localhost:8080/api/instances/{name} \
-H "Content-Type: application/json" \
-d '{
"backend_options": {
"threads": 8,
"context_size": 4096
}
}'
Note
Configuration changes require restarting the instance to take effect.
View Logs¶
Via Web UI¶
- Click the "Logs" button on any instance card
- Real-time log viewer opens
Via API¶
Check instance status in real-time:
Delete Instance¶
Via Web UI¶
- Click the "Delete" button on an instance card
- Only stopped instances can be deleted
- Confirm deletion in the dialog
Via API¶
Instance Proxy¶
Llamactl proxies all requests to the underlying backend instances (llama-server, MLX, or vLLM).
All backends provide OpenAI-compatible endpoints. Check the respective documentation: - llama-server docs - MLX-LM docs - vLLM docs
Instance Health¶
Via Web UI¶
- The health status badge is displayed on each instance card
Via API¶
Check the health status of your instances: