Custom LLM Server Guide

What This Does

This server lets you send requests to OpenAI’s models (like gpt-3.5-turbo) and get back streamed responses in real time. It’s perfect if you want to build a chatbot, automate customer support, or add AI into your app.

The whole point? Take a structured request, process it with OpenAI’s API, and send back the result in bite-sized chunks so you can display or process it as it’s happening.

The Code (Core of Everything)

Here’s the main piece of code:

@app.post("/chat/completions")
async def chat_completion_stream(vapi_payload: ChatRequest):
    try:
        response = await client.chat.completions.create(
            model=vapi_payload.model,
            messages=vapi_payload.messages,
            temperature=vapi_payload.temperature,
            tools=vapi_payload.tools,
            stream=True,
        )

        async def event_stream():
            try:
                async for chunk in response:
                    yield f"data: {json.dumps(chunk.model_dump())}\\n\\n"
                yield "data: [DONE]\\n\\n"
            except Exception as e:
                print(f"Error during response streaming: {e}")
                yield f"data: {json.dumps({'error': str(e)})}\\n\\n"

        return StreamingResponse(event_stream(), media_type="text/event-stream")

    except Exception as e:
        return StreamingResponse(
            f"data: {json.dumps({'error': str(e)})}\\n\\n", media_type="text/event-stream"
        )

This code sets up a FastAPI endpoint (/chat/completions). It streams responses back to the client using OpenAI’s API and wraps everything nicely in Server-Sent Events (SSE). If anything goes wrong, it catches the error and sends it back too.

How to Use It

Sending a Request

Send a POST request to /chat/completions with a JSON payload. Here’s the structure of what you need to send:

What to Send (Input)

Field	Description
`model`	The OpenAI model you want to use (e.g., `gpt-3.5-turbo`).
`messages`	A list of conversation messages (e.g., user messages + assistant replies).
`temperature`	Controls creativity (higher = more random, lower = focused).
`tools`	Optional. Add functions the AI can use during the chat.
`stream`	Set to `true` for streamed responses.
`call`	Metadata about the call (session details).
`phoneNumber`	Optional. Phone metadata if relevant.
`customer`	Optional. Info about the customer.
`metadata`	Add any extra context.

Here’s a concrete example:

{
  "model": "gpt-3.5-turbo",
  "messages": [{"role": "user", "content": "Tell me a joke"}],
  "temperature": 0.7,
  "tools": [],
  "stream": true,
  "call": {
    "id": "12345",
    "orgId": "org001",
    "createdAt": "2024-11-29T10:00:00Z",
    "updatedAt": "2024-11-29T10:00:00Z",
    "type": "query",
    "status": "pending",
    "assistantId": "assist001"
  },
  "metadata": {
    "context": "chat_query"
  }
}

What You’ll Get (Output)

The response comes back in chunks (real-time streaming). Each chunk looks like this:

{
  "id": "chatcmpl-1234",
  "object": "chat.completion",
  "created": 1732291256,
  "model": "gpt-3.5-turbo",
  "choices": [
    {
      "index": 0,
      "delta": {"content": "Why did the chicken cross the road?"},
      "finish_reason": "stop"
    }
  ]
}