Guide: Retrieval#

Custom knowledge can be injected into Reka models using our built-in retrieval feature. Retrieval augments the chat API with relevant knowledge retrieved from text datasets that you have uploaded and indexed in advance. We have jointly optimized the models, retrieval algorithms, and prompting strategy to effectively use the knowledge present in the dataset when formulating responses.

How it works:

  1. Upload a dataset using reka.add_dataset()

  2. Prepare retrieval using reka.prepare_retrieval()

  3. Use reka.chat() with the retrieval_dataset argument

Internally, this will use hybrid vector and keyword search to retrieve any relevant snippets, and inject them into the model’s input.

Dataset format#

We support utf-8 encoded text format datasets. These can either be:

  • a single text file

  • a zip file containing multiple text files (and only text files) in any directory structure

In the case of a zip file, the internal file names may be presented to the model along with the retrieved snippets, so these should be chosen carefully. For example, if each text file is a document from wikipedia, you could name them using the page title. This can help contextualize the snippets when pulled out of a document.

End-to-end guide#

This is guide demonstrates how to upload a dataset, prepare it for retrieval, and then use retrieve from the dataset with the chat endpoint.

Upload dataset#

For this guide, we will retrieve over the full text of Alice’s Adventures in Wonderland. This guide will assume you have alice_in_wonderland.txt downloaded and accessible in the working directory.

Upload the dataset using reka.add_dataset():

import reka

reka.API_KEY = "yourapikey"

add_dataset_response = reka.add_dataset(
    filepath="alice_in_wonderland.txt",
    name="alice-in-wonderland",
    description="Alice's Adventures in Wonderland",
)

print(add_dataset_response)  # {'name': 'alice-in-wonderland', 'ok': True, 'info': ''}
dataset_name = add_dataset_response["name"]

At any time, you can view your currently available datasets using using reka.list_datasets():

print(reka.list_datasets())  # ['alice-in-wonderland']

Prepare retrieval#

Before we can use the dataset for retrieval, we must submit a prepare retrieval job using reka.prepare_retrieval(). This splits the dataset into meaningful chunks, and indexes them in a hybrid vector / keyword database. For typical datasets, this should take less than a minute:

import time

job_id = reka.prepare_retrieval(dataset_name=dataset_name)
print(job_id)

while True:
    status = reka.retrieval_job_status(job_id)
    if status.is_done():
        break

    if status.is_running():
        print("Job is running...")
    else:
        print("Job is still pending...")

    time.sleep(5)

print(f"Job finished with status: {status.job_status.name}")  # Job finished with status: COMPLETE

Use the retrieval dataset#

You can use the retrieval_dataset argument in reka.chat() or reka.vlm_chat() to retrieve from the dataset during response generation. Below is an example that may not work reliably without retrieval:

response = reka.chat(
    "Who is 'Dinah' in Alice in Wonderland?",
    conversation_history=[],
    retrieval_dataset=dataset_name,
    temperature=0.6,
)
print(response["text"])

This should output something like:

In Lewis Carroll’s “Alice’s Adventures in Wonderland,” Dinah is the name of Alice’s pet cat. Dinah is a white cat with golden eyes and a long tail. She is described as being very good at catching mice and birds. In the story, Dinah is often mentioned by Alice and is a beloved pet of hers.

The full response contains information about the chunks that were retrieved when generating the response:

{
  "type": "model",
  "text": "In Lewis Carroll's \"Alice's Adventures in Wonderland,\" Dinah is the name of Alice's pet cat. Dinah is a white cat with golden eyes and a long tail. She is described as being very good at catching mice and birds. In the story, Dinah is often mentioned by Alice and is a beloved pet of hers.",
  "finish_reason": "stop",
  "retrieved_chunks": [
    {
      "text": "`And who is Dinah, if I might venture to ask the question?' said the Lory.\n\nAlice replied eagerly, for she was always ready to talk about her pet:  `Dinah's our cat.  And she's such a capital one for catching mice you can't think!  And oh, I wish you could see her after the birds!  Why, she'll eat a little bird as soon as look at it!'\n\n",
      "sourceDocument": "alice-in-wonderland",
      "chunkIndex": 102,
      "isNegative": false,
      "score": 0.89663236
    },
    {
      "text": "`How queer it seems,' Alice said to herself, `to be going messages for a rabbit!  I suppose Dinah'll be sending me on messages next!'  And she began fancying the sort of thing that would happen:  `\"Miss Alice!  Come here directly, and get ready for your walk!\" \"Coming in a minute, nurse!  But I've got to see that the mouse doesn't get out.\"  Only I don't think,' Alice went on, `that they'd let Dinah stop in the house if it began ordering people about like that!'\n\n",
      "sourceDocument": "alice-in-wonderland",
      "chunkIndex": 111,
      "isNegative": false,
      "score": 0.83404374
    },
    {
      "text": "So Alice began telling them her adventures from the time when she first saw the White Rabbit.  She was a little nervous about it just at first, the two creatures got so close to her, one on each side, and opened their eyes and mouths so VERY wide, but she gained courage as she went on.  Her listeners were perfectly quiet till she got to the part about her repeating `YOU ARE OLD, FATHER WILLIAM,' to the Caterpillar, and the words all coming different, and then the Mock Turtle drew a long breath, and said `That's very curious.'\n\n",
      "sourceDocument": "alice-in-wonderland",
      "chunkIndex": 454,
      "isNegative": false,
      "score": 0.82579088
    }
  ]
}