Chat
Our models can be accessed through the chat API. This page gives an introduction to using the chat API via the Python SDK.
Single Turn Prompting
A simple single turn request can be made as follows:
This will return a response like:
See Available Models for details on valid model names.
Multiple Turn Conversations
You can request a response to a multiple turn conversation by adding more messages in the history. For example:
This will return a response like:
Assistant Completions
We support guiding the assistant output (e.g. prompting it to output a structured JSON response), by allowing the developer to specify how the assistant response should start. This is done by adding a partial assistant response as the last message:
This will output:
Useful Parameters
The parameters of the chat API are fully documented in the API reference, but some particularly useful parameters are listed below:
- temperature: Typically between 0 and 1. Values close to 0 will result in less varied generations, and higher values will result in more variation and creativity.
- max_tokens: The maximum number of tokens that should be returned. Increase this if generations are being truncated, i.e. the
finish_reason
in the response is"length"
. - stop: A list of strings that should stop the generation. This can be used to stop after generating a code block, when reaching a certain number in a list etc.
Streaming
The chat API supports streaming with the chat_stream
function in the Python SDK,
or by setting stream
to true
in the HTTP API. Below is an example of streaming in Python:
Async
The Python SDK also exports an async client so that you can make non-blocking calls to our API. This can be useful to make batch requests. The following code illustrates how to batch calls to the API, by creating a list of async tasks, and gathering them with asyncio.gather
. The Semaphore
limits the number of concurrent requests to the API.