uncloseai.
How We Run Inference
How we run inference
This section is optional. This is only if you wanted to try to contribute idle GPU time to the project or if you wanted to reproduce everything in your own cluster.
vLLM Setup
We use vLLM to run models, generally with full f16 safetensors. We make sure to use a virtualenv to hold the dependencies.
Note: For Hermes, we use an FP8 quant by adamo1139 (adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic), which is optimized for 4090 and 3090 GPUs.
Tool Calling: The --enable-auto-tool-choice and --tool-call-parser hermes flags are required for Hermes 3 to support function calling via the OpenAI-compatible API. Without these flags, tool calling requests will fail.
We are considering supporting ollama for better quant support.
Stand up a replica cluster on a new domain.
sudo apt-get install gcc python3.12-dev python3.12-venv
cd ~
python3 -m venv env
source env/bin/activate
pip install vllm
python -m vllm.entrypoints.openai.api_server --model adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic --host 0.0.0.0 --port 18888 --max-model-len 82000 --enable-auto-tool-choice --tool-call-parser hermes
Proxy Setup
If you want to see how we setup the proxy, check out /etc/caddy/Caddyfile
ai.unturf.com {
root * /opt/www
file_server
log {
output file /var/log/caddy/ai.unturf.com.log {
roll_size 50mb
roll_keep 5
}
}
tls {
on_demand
}
}
hermes.ai.unturf.com {
reverse_proxy <removed>:18888
log {
output file /var/log/caddy/hermes.ai.unturf.com.log {
roll_size 50mb
roll_keep 5
}
}
tls {
on_demand
}
}
speech.ai.unturf.com {
reverse_proxy <removed>:8000
log {
output file /var/log/caddy/speech.ai.unturf.com.log {
roll_size 50mb
roll_keep 5
}
}
tls {
on_demand
}
}
Model Discovery
vLLM provides an OpenAI-compatible API with built-in documentation. You can discover available models and explore the full API using these endpoints:
Swagger Documentation
Access the interactive API docs at the /docs endpoint:
hermes.ai.unturf.com/docs - Hermes endpoint Swagger docs
The Swagger UI lets you explore all available endpoints, see request/response schemas, and test API calls directly in your browser.
Model Discovery
To get the current model ID being hosted, query the /v1/models endpoint:
- hermes.ai.unturf.com/v1/models - Hermes models
- qwen.ai.unturf.com/v1/models - Qwen models
Or via curl:
curl https://hermes.ai.unturf.com/v1/models
curl https://qwen.ai.unturf.com/v1/models
Example response:
{
"object": "list",
"data": [
{
"id": "adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic",
"object": "model",
"created": 1735689600,
"owned_by": "vllm",
"root": "adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic",
"max_model_len": 82000
}
]
}
The id field contains the model name you should use in your API calls. The max_model_len field tells you the maximum context length supported.
Tip: Always query /v1/models programmatically rather than hardcoding model names. This ensures your code works even when models are updated or swapped.
Rate Limiting
Rate limiting is configured based on client IP address: 3 requests per second per IP per endpoint.
Next Steps
Ready to add text-to-speech to your application?