uncloseai.

How We Run Inference

How we run inference

This section is optional. This is only if you wanted to try to contribute idle GPU time to the project or if you wanted to reproduce everything in your own cluster.

vLLM Setup

We use vLLM to run models, generally with full f16 safetensors. We make sure to use a virtualenv to hold the dependencies.

Note: For Hermes, we use an FP8 quant by adamo1139 (adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic), which is optimized for 4090 and 3090 GPUs.

Tool Calling: The --enable-auto-tool-choice and --tool-call-parser hermes flags are required for Hermes 3 to support function calling via the OpenAI-compatible API. Without these flags, tool calling requests will fail.

We are considering supporting ollama for better quant support.

Stand up a replica cluster on a new domain.

sudo apt-get install gcc python3.12-dev python3.12-venv
cd ~
python3 -m venv env
source env/bin/activate
pip install vllm
python -m vllm.entrypoints.openai.api_server --model adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic --host 0.0.0.0 --port 18888 --max-model-len 82000 --enable-auto-tool-choice --tool-call-parser hermes

Proxy Setup

If you want to see how we setup the proxy, check out /etc/caddy/Caddyfile

hermes.ai.unturf.com {
    reverse_proxy <removed>:18888
    log {
        output file /var/log/caddy/hermes.ai.unturf.com.log {
            roll_size 50mb
            roll_keep 5
        }
    }
    tls {
        on_demand
    }
}

qwen.ai.unturf.com {
    reverse_proxy <removed>:18889
    log {
        output file /var/log/caddy/qwen.ai.unturf.com.log {
            roll_size 50mb
            roll_keep 5
        }
    }
    tls {
        on_demand
    }
}

qwen-vl.ai.unturf.com {
    reverse_proxy <removed>:18890
    log {
        output file /var/log/caddy/qwen-vl.ai.unturf.com.log {
            roll_size 50mb
            roll_keep 5
        }
    }
    tls {
        on_demand
    }
}

speech.ai.unturf.com {
    reverse_proxy <removed>:8000
    log {
        output file /var/log/caddy/speech.ai.unturf.com.log {
            roll_size 50mb
            roll_keep 5
        }
    }
    tls {
        on_demand
    }
}

Model Discovery

vLLM provides an OpenAI-compatible API with built-in documentation. You can discover available models and explore the full API using these endpoints:

Swagger Documentation

Access the interactive API docs at the /docs endpoint:

hermes.ai.unturf.com/docs - Hermes endpoint Swagger docs

qwen.ai.unturf.com/docs - Qwen endpoint Swagger docs

qwen-vl.ai.unturf.com/docs - Qwen VL endpoint Swagger docs

The Swagger UI lets you explore all available endpoints, see request/response schemas, and test API calls directly in your browser.

Model Discovery

To get the current model ID being hosted, query the /v1/models endpoint:

hermes.ai.unturf.com/v1/models - Hermes models
qwen.ai.unturf.com/v1/models - Qwen models
qwen-vl.ai.unturf.com/v1/models - Qwen VL (vision) models

Or via curl:

curl https://hermes.ai.unturf.com/v1/models
curl https://qwen.ai.unturf.com/v1/models
curl https://qwen-vl.ai.unturf.com/v1/models

Example response:

{
  "object": "list",
  "data": [
    {
      "id": "adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic",
      "object": "model",
      "created": 1735689600,
      "owned_by": "vllm",
      "root": "adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic",
      "max_model_len": 82000
    }
  ]
}

The id field contains the model name you should use in your API calls. The max_model_len field tells you the maximum context length supported.

Tip: Always query /v1/models programmatically rather than hardcoding model names. This ensures your code works even when models are updated or swapped.

Rate Limiting

Rate limiting is configured based on client IP address: 3 requests per second per IP per endpoint.

Next Steps

Ready to add text-to-speech to your application?

📖 Read the Text-to-Speech documentation →