This server lets you send requests to OpenAI’s models (like gpt-3.5-turbo
) and get back streamed responses in real time. It’s perfect if you want to build a chatbot, automate customer support, or add AI into your app.
The whole point? Take a structured request, process it with OpenAI’s API, and send back the result in bite-sized chunks so you can display or process it as it’s happening.
Here’s the main piece of code:
@app.post("/chat/completions")
async def chat_completion_stream(vapi_payload: ChatRequest):
try:
response = await client.chat.completions.create(
model=vapi_payload.model,
messages=vapi_payload.messages,
temperature=vapi_payload.temperature,
tools=vapi_payload.tools,
stream=True,
)
async def event_stream():
try:
async for chunk in response:
yield f"data: {json.dumps(chunk.model_dump())}\\n\\n"
yield "data: [DONE]\\n\\n"
except Exception as e:
print(f"Error during response streaming: {e}")
yield f"data: {json.dumps({'error': str(e)})}\\n\\n"
return StreamingResponse(event_stream(), media_type="text/event-stream")
except Exception as e:
return StreamingResponse(
f"data: {json.dumps({'error': str(e)})}\\n\\n", media_type="text/event-stream"
)
This code sets up a FastAPI endpoint (/chat/completions
). It streams responses back to the client using OpenAI’s API and wraps everything nicely in Server-Sent Events (SSE). If anything goes wrong, it catches the error and sends it back too.
Send a POST
request to /chat/completions
with a JSON payload. Here’s the structure of what you need to send:
Field | Description |
---|---|
model |
The OpenAI model you want to use (e.g., gpt-3.5-turbo ). |
messages |
A list of conversation messages (e.g., user messages + assistant replies). |
temperature |
Controls creativity (higher = more random, lower = focused). |
tools |
Optional. Add functions the AI can use during the chat. |
stream |
Set to true for streamed responses. |
call |
Metadata about the call (session details). |
phoneNumber |
Optional. Phone metadata if relevant. |
customer |
Optional. Info about the customer. |
metadata |
Add any extra context. |
Here’s a concrete example:
{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Tell me a joke"}],
"temperature": 0.7,
"tools": [],
"stream": true,
"call": {
"id": "12345",
"orgId": "org001",
"createdAt": "2024-11-29T10:00:00Z",
"updatedAt": "2024-11-29T10:00:00Z",
"type": "query",
"status": "pending",
"assistantId": "assist001"
},
"metadata": {
"context": "chat_query"
}
}
The response comes back in chunks (real-time streaming). Each chunk looks like this:
{
"id": "chatcmpl-1234",
"object": "chat.completion",
"created": 1732291256,
"model": "gpt-3.5-turbo",
"choices": [
{
"index": 0,
"delta": {"content": "Why did the chicken cross the road?"},
"finish_reason": "stop"
}
]
}