EmbText

Overview

CapyDB uses vector embeddings to understand the meaning of text beyond simple keyword matching. By wrapping your text in EmbText, you unlock the ability to query documents based on conceptual and contextual relevance, rather than just literal keyword occurrences.

Key Points:

Automatic Chunking: Large text is split into smaller pieces (chunks) for more efficient embeddings and semantic searches.
Asynchronous Embedding: After the text is stored, embeddings are generated in the background without blocking your application.
Semantic Indexing: The final vector representations are indexed for fast semantic lookups.

Basic Usage

Below is the simplest way to use EmbText:

from capydb import EmbText

# Storing a single text field that you want to embed.
{
  "field_name": EmbText("Alice is a data scientist with expertise in AI and machine learning. She has led several projects in natural language processing.")
}

This snippet creates an EmbText object containing text to embed. By default, it uses the text-embedding-3-small model and sensible defaults for chunking and overlap.

Customized Usage with Parameters

If you have specific requirements (e.g., a different embedding model or particular chunking strategy), customize EmbText by specifying additional parameters:

from capydb import EmbText, EmbModels

{
    "field_name": EmbText(
        text="Alice is a data scientist with expertise in AI and machine learning. She has led several projects in natural language processing.",
        emb_model=EmbModels.TEXT_EMBEDDING_3_LARGE,  # Change the default model
        max_chunk_size=200,                          # Configure chunk sizes
        chunk_overlap=20,                            # Overlap between chunks
        is_separator_regex=False,                    # Are separators plain strings or regex?
        separators=[
            "\n\n",
            "\n",
        ],
        keep_separator=False,                        # Keep or remove the separator in chunks
    )
}

After Saving

CapyDB saves data by splitting each EmbText into chunks, embedding them, and indexing while preserving their relationships with vector data. It also automatically adds a 'chunks' field to each EmbText for seamless access.

{
    "field_name": EmbText(
        text="Alice is a data scientist with expertise in AI and machine learning. She has led several projects in natural language processing.",
        chunks=["Alice is a data scientist", "with expertise in AI", "and machine learning.", "She has led several projects", "in natural language processing."],
        emb_model=EmbModels.TEXT_EMBEDDING_3_LARGE,  # Change the default model
        max_chunk_size=200,                          # Configure chunk sizes
        chunk_overlap=20,                            # Overlap between chunks
        is_separator_regex=False,                    # Are separators plain strings or regex?
        separators=[
            "\n\n",
            "\n",
        ],
        keep_separator=False,                        # Keep or remove the separator in chunks
    )
}

Parameter Reference

Parameter	Description
text	The core content for `EmbText`. This text is automatically chunked and embedded for semantic search.
emb_model	Which embedding model to use. Defaults to `text-embedding-3-small`. You can choose from other supported models, such as `text-embedding-3-large`.
max_chunk_size	Maximum character length of each chunk. Larger chunks reduce the total chunk count but may reduce search efficiency (due to bigger embeddings).
chunk_overlap	Overlapping character count between consecutive chunks, useful for preserving context at chunk boundaries.
is_separator_regex	Whether to treat each separator in `separators` as a regular expression. Defaults to `False`.
separators	A list of separator strings (or regex patterns) used to split the text. For instance, `["\n\n", "\n"]` can split paragraphs or single lines.
keep_separator	If `True`, separators remain in the chunked text. If `False`, they are stripped out.
chunks	Auto-generated by the database after the text is processed. It is not set by the user, and is available only after embedding completes.

How It Works: Asynchronous Processing

Whenever you insert a document containing EmbText into CapyDB, three main steps happen asynchronously:

Chunking
The text is divided into chunks based on max_chunk_size, chunk_overlap, and any specified separators. This ensures the text is broken down into optimally sized segments.
Embedding
Each chunk is transformed into a vector representation using the specified emb_model. This step captures the semantic essence of the text.
Indexing
The embeddings are indexed for efficient semantic search. Because these steps occur in the background, you get immediate responses to your write operations, but actual query availability may lag slightly behind the write.

Querying

Once the Embedding and Indexing steps are complete, your EmbText fields become searchable. To do this, use 'query' operation. Please also refer Query page

Examples

Accessing Generated Chunks

The chunks attribute is auto-added by the database after the text finishes embedding and indexing. For instance:

emb_text: EmbText  # Assume this EmbText has been inserted and processed

print(emb_text.text)
# "Alice is a data scientist with expertise in AI and machine learning. She has led several projects in natural language processing."

print(emb_text.chunks)
# [
#   "Alice is a data scientist",
#   "with expertise in AI",
#   "and machine learning.",
#   "She has led several projects",
#   "in natural language processing."
# ]

Usage in Nested Fields

EmbText can be embedded anywhere in your document, including nested objects:

{
  "profile": {
    "name": "Bob",
    "bio": EmbText(
      "Bob has over a decade of experience in AI, focusing on neural networks and deep learning."
    )
  }
}

How can we improve this documentation?

Share Your Thoughts

Your feedback helps us improve our documentation. Let us know what you think!