API Reference¶

Complete reference for the Llamactl REST API.

Base URL¶

All API endpoints are relative to the base URL:

http://localhost:8080/api/v1

Authentication¶

Llamactl supports API key authentication. If authentication is enabled, include the API key in the Authorization header:

curl -H "Authorization: Bearer <your-api-key>" \
  http://localhost:8080/api/v1/instances

The server supports two types of API keys: - Management API Keys: Required for instance management operations (CRUD operations on instances) - Inference API Keys: Required for OpenAI-compatible inference endpoints

System Endpoints¶

Get Llamactl Version¶

Get the version information of the llamactl server.

GET /api/v1/version

Response:

Version: 1.0.0
Commit: abc123
Build Time: 2024-01-15T10:00:00Z

Get Llama Server Help¶

Get help text for the llama-server command.

GET /api/v1/server/help

Response: Plain text help output from llama-server --help

Get Llama Server Version¶

Get version information of the llama-server binary.

GET /api/v1/server/version

Response: Plain text version output from llama-server --version

List Available Devices¶

List available devices for llama-server.

GET /api/v1/server/devices

Response: Plain text device list from llama-server --list-devices

Instances¶

List All Instances¶

Get a list of all instances.

GET /api/v1/instances

Response:

[
  {
    "name": "llama2-7b",
    "status": "running",
    "created": 1705312200
  }
]

Get Instance Details¶

Get detailed information about a specific instance.

GET /api/v1/instances/{name}

Response:

{
  "name": "llama2-7b",
  "status": "running",
  "created": 1705312200
}

Create Instance¶

Create and start a new instance.

POST /api/v1/instances/{name}

Request Body: JSON object with instance configuration. See Managing Instances for available configuration options.

Response:

{
  "name": "llama2-7b",
  "status": "running",
  "created": 1705312200
}

Update Instance¶

Update an existing instance configuration. See Managing Instances for available configuration options.

PUT /api/v1/instances/{name}

Request Body: JSON object with configuration fields to update.

Response:

{
  "name": "llama2-7b",
  "status": "running",
  "created": 1705312200
}

Delete Instance¶

Stop and remove an instance.

DELETE /api/v1/instances/{name}

Response: 204 No Content

Instance Operations¶

Start Instance¶

Start a stopped instance.

POST /api/v1/instances/{name}/start

Response:

{
  "name": "llama2-7b",
  "status": "running",
  "created": 1705312200
}

Error Responses: - 409 Conflict: Maximum number of running instances reached - 500 Internal Server Error: Failed to start instance

Stop Instance¶

Stop a running instance.

POST /api/v1/instances/{name}/stop

Response:

{
  "name": "llama2-7b",
  "status": "stopped",
  "created": 1705312200
}

Restart Instance¶

Restart an instance (stop then start).

POST /api/v1/instances/{name}/restart

Response:

{
  "name": "llama2-7b",
  "status": "running",
  "created": 1705312200
}

Get Instance Logs¶

Retrieve instance logs.

GET /api/v1/instances/{name}/logs

Query Parameters: - lines: Number of lines to return (default: all lines, use -1 for all)

Response: Plain text log output

Example:

curl "http://localhost:8080/api/v1/instances/my-instance/logs?lines=100"

Proxy to Instance¶

Proxy HTTP requests directly to the llama-server instance.

GET /api/v1/instances/{name}/proxy/*
POST /api/v1/instances/{name}/proxy/*

This endpoint forwards all requests to the underlying llama-server instance running on its configured port. The proxy strips the /api/v1/instances/{name}/proxy prefix and forwards the remaining path to the instance.

Example - Check Instance Health:

curl -H "Authorization: Bearer your-api-key" \
  http://localhost:8080/api/v1/instances/my-model/proxy/health

This forwards the request to http://instance-host:instance-port/health on the actual llama-server instance.

Error Responses: - 503 Service Unavailable: Instance is not running

OpenAI-Compatible API¶

Llamactl provides OpenAI-compatible endpoints for inference operations.

List Models¶

List all instances in OpenAI-compatible format.

GET /v1/models

Response:

{
  "object": "list",
  "data": [
    {
      "id": "llama2-7b",
      "object": "model",
      "created": 1705312200,
      "owned_by": "llamactl"
    }
  ]
}

Chat Completions, Completions, Embeddings¶

All OpenAI-compatible inference endpoints are available:

POST /v1/chat/completions
POST /v1/completions
POST /v1/embeddings
POST /v1/rerank
POST /v1/reranking

Request Body: Standard OpenAI format with model field specifying the instance name

Example:

{
  "model": "llama2-7b",
  "messages": [
    {
      "role": "user",
      "content": "Hello, how are you?"
    }
  ]
}

The server routes requests to the appropriate instance based on the model field in the request body. Instances with on-demand starting enabled will be automatically started if not running. For configuration details, see Managing Instances.

Error Responses: - 400 Bad Request: Invalid request body or missing instance name - 503 Service Unavailable: Instance is not running and on-demand start is disabled - 409 Conflict: Cannot start instance due to maximum instances limit

Instance Status Values¶

Instances can have the following status values: - stopped: Instance is not running - running: Instance is running and ready to accept requests - failed: Instance failed to start or crashed

Error Responses¶

All endpoints may return error responses in the following format:

{
  "error": "Error message description"
}

Common HTTP Status Codes¶

200: Success
201: Created
204: No Content (successful deletion)
400: Bad Request (invalid parameters or request body)
401: Unauthorized (missing or invalid API key)
403: Forbidden (insufficient permissions)
404: Not Found (instance not found)
409: Conflict (instance already exists, max instances reached)
500: Internal Server Error
503: Service Unavailable (instance not running)

Examples¶

Complete Instance Lifecycle¶

# Create and start instance
curl -X POST http://localhost:8080/api/v1/instances/my-model \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "model": "/models/llama-2-7b.gguf"
  }'

# Check instance status
curl -H "Authorization: Bearer your-api-key" \
  http://localhost:8080/api/v1/instances/my-model

# Get instance logs
curl -H "Authorization: Bearer your-api-key" \
  "http://localhost:8080/api/v1/instances/my-model/logs?lines=50"

# Use OpenAI-compatible chat completions
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-inference-api-key" \
  -d '{
    "model": "my-model",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "max_tokens": 100
  }'

# Stop instance
curl -X POST -H "Authorization: Bearer your-api-key" \
  http://localhost:8080/api/v1/instances/my-model/stop

# Delete instance
curl -X DELETE -H "Authorization: Bearer your-api-key" \
  http://localhost:8080/api/v1/instances/my-model

Using the Proxy Endpoint¶

You can also directly proxy requests to the llama-server instance:

# Direct proxy to instance (bypasses OpenAI compatibility layer)
curl -X POST http://localhost:8080/api/v1/instances/my-model/proxy/completion \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "prompt": "Hello, world!",
    "n_predict": 50
  }'

Backend-Specific Endpoints¶

Parse Commands¶

Llamactl provides endpoints to parse command strings from different backends into instance configuration options.

Parse Llama.cpp Command¶

Parse a llama-server command string into instance options.

POST /api/v1/backends/llama-cpp/parse-command

Request Body:

{
  "command": "llama-server -m /path/to/model.gguf -c 2048 --port 8080"
}

Response:

{
  "backend_type": "llama_cpp",
  "llama_server_options": {
    "model": "/path/to/model.gguf",
    "ctx_size": 2048,
    "port": 8080
  }
}

Parse MLX-LM Command¶

Parse an MLX-LM server command string into instance options.

POST /api/v1/backends/mlx/parse-command

Request Body:

{
  "command": "mlx_lm.server --model /path/to/model --port 8080"
}

Response:

{
  "backend_type": "mlx_lm",
  "mlx_server_options": {
    "model": "/path/to/model",
    "port": 8080
  }
}

Parse vLLM Command¶

Parse a vLLM serve command string into instance options.

POST /api/v1/backends/vllm/parse-command

Request Body:

{
  "command": "vllm serve /path/to/model --port 8080"
}

Response:

{
  "backend_type": "vllm",
  "vllm_server_options": {
    "model": "/path/to/model",
    "port": 8080
  }
}

Error Responses for Parse Commands: - 400 Bad Request: Invalid request body, empty command, or parse error - 500 Internal Server Error: Encoding error

Auto-Generated Documentation¶

The API documentation is automatically generated from code annotations using Swagger/OpenAPI. To regenerate the documentation:

Install the swag tool: go install github.com/swaggo/swag/cmd/swag@latest
Generate docs: swag init -g cmd/server/main.go -o apidocs

Swagger Documentation¶

If swagger documentation is enabled in the server configuration, you can access the interactive API documentation at:

http://localhost:8080/swagger/

This provides a complete interactive interface for testing all API endpoints.