Bagel's RAG - Vector Embedding

This tutorial demonstrates how to use the Bagel library to create and query vector embeddings using the bagel-text and bagel-multi-modal models. You can try Google Colab to walkthrough step by step:

Prerequisites

Python 3.x
Bagel library (install via pip install bagelML)
API key from https://bakery.bagel.net/api-key

Client Setup

Import the necessary libraries:

import bagel
import os
from getpass import getpass

Create a Bagel client instance:
```
client = bagel.Client()
```

Set the API key as an environment variable:

API_KEY = getpass("Enter your API key: ")
os.environ['BAGEL_API_KEY'] = API_KEY

Creating a Vector Text Embedding Asset

Define the create_asset function for creating a Vector Text Embedding Asset:

def create_asset(client):
    title = "vector_asset_text"
    description = "Test vector asset created from python client"
    payload = {
        "title": title,
        "dataset_type": "VECTOR",
        "tags": [""],
        "category": "AI",
        "details": description,
        "user_id": user_id,
        "embedding_model": "bagel-text",
        "dimension": 768
    }
    dataset = client.create_asset(payload=payload)
    return dataset

title: The title or name for the dataset, e.g., "Apple".

dataset_type: Defines the type of dataset being created. This could be a RAW, VECTOR, or MODEL dataset. They will be treated differently depending on the type.

tags: A list of optional tags to further classify the asset for easier searching, e.g., ["AI", "DEMO", "TEST"].

category: Classifies the dataset into a specific category such as Cat1 or Cat2 to help with organization and retrieval.

details: Additional descriptive details or notes about the dataset.

user_id: The unique identifier for the user creating the asset, e.g., "345281182308180743453".

embedding_model: This specifies that it is a text embedding function with default as bagel-text

dimension: This is the default dimension for text embedding which is 768.

Call the create_asset function to create a new vector text embedding asset:
```
asset_id = create_asset(client)
print(asset_id)
```

Adding Text Embeddings

Add text embeddings to the asset:

    payload = {
        'source': "google", 
        'documents': "This is google"
    }
    client.add_text(asset_id, payload)

    payload = {
        'source': "facebook",
        'documents': "This is facebook"
    }
    client.add_text(asset_id, payload)

source: Here you add the text to embed.

documents: This is a long text that contains the source text you embedded.

Querying from Vector Embedding

Query the vector embedding asset:

    payload = {
        "where": {},
        "where_document": {},
        "n_results": 2,
        "include": [
            "metadatas",
            "documents", 
            "distances"
        ],
        "query_texts": ["google"],
        "padding": False
    }
    response = client.query_asset(asset_id, payload)
    print(response['documents'])

where: A filter to specify conditions on metadata. For example, this can limit the search to certain categories like {"category": "Cat2"}.

where_document: Specifies conditions related to the document itself, such as whether the document is published, e.g., {"is_published": True}.

n_results: The number of results to return, e.g., 2.

include: A list of fields to include in the result, such as metadatas, documents, or distances for contextual information related to each result.

query_texts: The search term(s) used to query the asset, e.g., ["google"].

padding: A Boolean flag indicating whether to pad the results to a fixed size. This could be useful for ensuring uniformity in certain result sets, e.g., False.

Define the create_asset function with the bagel-multi-modal embedding model:

def create_asset(client):
    title = "vector_asset_multi_modal"
    description = "Test vector asset created from python client"
    payload = {
        "title": title,
        "dataset_type": "VECTOR", 
        "tags": [""],
        "category": "AI",
        "details": description,
        "user_id": user_id,
        "embedding_model": "bagel-multi-modal",
        "dimension": 1408
    }
    dataset = client.create_asset(payload=payload)
    return dataset

title: The title or name for the dataset, e.g., "Apple".

dataset_type: Defines the type of dataset being created. This could be a RAW, VECTOR, or MODEL dataset. They will be treated differently depending on the type.

tags: A list of optional tags to further classify the asset for easier searching, e.g., ["AI", "DEMO", "TEST"].

category: Classifies the dataset into a specific category such as Cat1 or Cat2 to help with organization and retrieval.

details: Additional descriptive details or notes about the dataset.

user_id: The unique identifier for the user creating the asset, e.g., "345281182308180743453".

embedding_model: This specifies that it is a text embedding function with default as bagel-multi-modal

dimension: This is the default dimension for text embedding which is 1408.

You can find the created VECTOR asset on the Bakery. Simply login and search for the specific title under My Datasets.

Call the create_asset function to create a new vector multi-modal embedding asset:
```
asset_id = create_asset(client)
print(asset_id)
```

Adding Image Embeddings

Add image embeddings to the asset:

    #image 01
    file_path = "/path/to/image"
    client.add_image(asset_id, file_path)

    #image 02
    file_path = "/path/to/image" 
    client.add_image(asset_id, file_path)

file_path: This is the path to the image you want to embed.

Querying for Images

Query Search

Query the vector multi-modal embedding asset:

payload = {
    "where": {},
    "n_results": 1,
    "include": [
        "metadatas",
        "distances"
    ],
    "query_texts": ["cat"],
    "padding": False
}
response = client.query_asset(asset_id, payload)
print(response['metadatas'][0][0]['bagel:image'])

Converting Base64 to Image Object

Define the load_and_display_image_from_base64 function:

import base64
from io import BytesIO
from PIL import Image
from IPython.display import display

def load_and_display_image_from_base64(base64_string):
    if ',' in base64_string:
        base64_string = base64_string.split(',')[1]
    img_data = base64.b64decode(base64_string)
    img = Image.open(BytesIO(img_data))
    display(img)
    return img

Convert the base64 image string to an image object:

base64_image = response['metadatas'][0][0]['bagel:image']
image = load_and_display_image_from_base64(base64_image)

This tutorial covers the basic steps for creating vector embedding assets, adding text and image embeddings, and querying the assets using the bagel-text and bagel-multi-modal models. Users can follow these guidelines to get started with the Bagel RAG Tutorial and explore the capabilities of vector embeddings.

Need extra help or just want to connect with other developers? Join our community on Discord for additional support 👾

PreviousBagels RAW - finetuning asset Quickstart.md NextBakery Docs

Last updated 1 month ago