API Reference¶
Complete reference for the Llamactl REST API.
Base URL¶
All API endpoints are relative to the base URL:
Authentication¶
Llamactl supports API key authentication. If authentication is enabled, include the API key in the Authorization header:
The server supports two types of API keys: - Management API Keys: Required for instance management operations (CRUD operations on instances) - Inference API Keys: Required for OpenAI-compatible inference endpoints
System Endpoints¶
Get Llamactl Version¶
Get the version information of the llamactl server.
Response:
Get Llama Server Help¶
Get help text for the llama-server command.
Response: Plain text help output from llama-server --help
Get Llama Server Version¶
Get version information of the llama-server binary.
Response: Plain text version output from llama-server --version
List Available Devices¶
List available devices for llama-server.
Response: Plain text device list from llama-server --list-devices
Instances¶
List All Instances¶
Get a list of all instances.
Response:
Get Instance Details¶
Get detailed information about a specific instance.
Response:
Create Instance¶
Create and start a new instance.
Request Body: JSON object with instance configuration. See Managing Instances for available configuration options.
Response:
Update Instance¶
Update an existing instance configuration. See Managing Instances for available configuration options.
Request Body: JSON object with configuration fields to update.
Response:
Delete Instance¶
Stop and remove an instance.
Response: 204 No Content
Instance Operations¶
Start Instance¶
Start a stopped instance.
Response:
Error Responses:
- 409 Conflict
: Maximum number of running instances reached
- 500 Internal Server Error
: Failed to start instance
Stop Instance¶
Stop a running instance.
Response:
Restart Instance¶
Restart an instance (stop then start).
Response:
Get Instance Logs¶
Retrieve instance logs.
Query Parameters:
- lines
: Number of lines to return (default: all lines, use -1 for all)
Response: Plain text log output
Example:
Proxy to Instance¶
Proxy HTTP requests directly to the llama-server instance.
This endpoint forwards all requests to the underlying llama-server instance running on its configured port. The proxy strips the /api/v1/instances/{name}/proxy
prefix and forwards the remaining path to the instance.
Example - Check Instance Health:
curl -H "Authorization: Bearer your-api-key" \
http://localhost:8080/api/v1/instances/my-model/proxy/health
This forwards the request to http://instance-host:instance-port/health
on the actual llama-server instance.
Error Responses:
- 503 Service Unavailable
: Instance is not running
OpenAI-Compatible API¶
Llamactl provides OpenAI-compatible endpoints for inference operations.
List Models¶
List all instances in OpenAI-compatible format.
Response:
{
"object": "list",
"data": [
{
"id": "llama2-7b",
"object": "model",
"created": 1705312200,
"owned_by": "llamactl"
}
]
}
Chat Completions, Completions, Embeddings¶
All OpenAI-compatible inference endpoints are available:
POST /v1/chat/completions
POST /v1/completions
POST /v1/embeddings
POST /v1/rerank
POST /v1/reranking
Request Body: Standard OpenAI format with model
field specifying the instance name
Example:
The server routes requests to the appropriate instance based on the model
field in the request body. Instances with on-demand starting enabled will be automatically started if not running. For configuration details, see Managing Instances.
Error Responses:
- 400 Bad Request
: Invalid request body or missing instance name
- 503 Service Unavailable
: Instance is not running and on-demand start is disabled
- 409 Conflict
: Cannot start instance due to maximum instances limit
Instance Status Values¶
Instances can have the following status values:
- stopped
: Instance is not running
- running
: Instance is running and ready to accept requests
- failed
: Instance failed to start or crashed
Error Responses¶
All endpoints may return error responses in the following format:
Common HTTP Status Codes¶
200
: Success201
: Created204
: No Content (successful deletion)400
: Bad Request (invalid parameters or request body)401
: Unauthorized (missing or invalid API key)403
: Forbidden (insufficient permissions)404
: Not Found (instance not found)409
: Conflict (instance already exists, max instances reached)500
: Internal Server Error503
: Service Unavailable (instance not running)
Examples¶
Complete Instance Lifecycle¶
# Create and start instance
curl -X POST http://localhost:8080/api/v1/instances/my-model \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-api-key" \
-d '{
"model": "/models/llama-2-7b.gguf"
}'
# Check instance status
curl -H "Authorization: Bearer your-api-key" \
http://localhost:8080/api/v1/instances/my-model
# Get instance logs
curl -H "Authorization: Bearer your-api-key" \
"http://localhost:8080/api/v1/instances/my-model/logs?lines=50"
# Use OpenAI-compatible chat completions
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-inference-api-key" \
-d '{
"model": "my-model",
"messages": [
{"role": "user", "content": "Hello!"}
],
"max_tokens": 100
}'
# Stop instance
curl -X POST -H "Authorization: Bearer your-api-key" \
http://localhost:8080/api/v1/instances/my-model/stop
# Delete instance
curl -X DELETE -H "Authorization: Bearer your-api-key" \
http://localhost:8080/api/v1/instances/my-model
Using the Proxy Endpoint¶
You can also directly proxy requests to the llama-server instance:
# Direct proxy to instance (bypasses OpenAI compatibility layer)
curl -X POST http://localhost:8080/api/v1/instances/my-model/proxy/completion \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-api-key" \
-d '{
"prompt": "Hello, world!",
"n_predict": 50
}'
Backend-Specific Endpoints¶
Parse Commands¶
Llamactl provides endpoints to parse command strings from different backends into instance configuration options.
Parse Llama.cpp Command¶
Parse a llama-server command string into instance options.
Request Body:
Response:
{
"backend_type": "llama_cpp",
"llama_server_options": {
"model": "/path/to/model.gguf",
"ctx_size": 2048,
"port": 8080
}
}
Parse MLX-LM Command¶
Parse an MLX-LM server command string into instance options.
Request Body:
Response:
Parse vLLM Command¶
Parse a vLLM serve command string into instance options.
Request Body:
Response:
Error Responses for Parse Commands:
- 400 Bad Request
: Invalid request body, empty command, or parse error
- 500 Internal Server Error
: Encoding error
Auto-Generated Documentation¶
The API documentation is automatically generated from code annotations using Swagger/OpenAPI. To regenerate the documentation:
- Install the swag tool:
go install github.com/swaggo/swag/cmd/swag@latest
- Generate docs:
swag init -g cmd/server/main.go -o apidocs
Swagger Documentation¶
If swagger documentation is enabled in the server configuration, you can access the interactive API documentation at:
This provides a complete interactive interface for testing all API endpoints.