EmbJSON
EmbText

EmbText

Overview

CapybaraDB uses vector embeddings to understand the meaning of text beyond simple keyword matching. By wrapping your text in EmbText, you unlock the ability to query documents based on conceptual and contextual relevance, rather than just literal keyword occurrences.

Key Points:

  • Automatic Chunking: Large text is split into smaller pieces (chunks) for more efficient embeddings and semantic searches.
  • Asynchronous Embedding: After the text is stored, embeddings are generated in the background without blocking your application.
  • Semantic Indexing: The final vector representations are indexed for fast semantic lookups.

Basic Usage

Below is the simplest way to use EmbText:

from capybaradb import EmbText
 
# Storing a single text field that you want to embed.
{
  "field_name": EmbText("Alice is a data scientist with expertise in AI and machine learning. She has led several projects in natural language processing.")
}

This snippet creates an EmbText object containing "text_to_embed". By default, it uses the text-embedding-3-small model and sensible defaults for chunking and overlap.


Customized Usage with Parameters

If you have specific requirements (e.g., a different embedding model or particular chunking strategy), customize EmbText by specifying additional parameters:

from capybaradb import EmbText, EmbModels
 
{
    "field_name": EmbText(
        text="Alice is a data scientist with expertise in AI and machine learning. She has led several projects in natural language processing.",
        emb_model=EmbModels.TEXT_EMBEDDING_3_LARGE,  # Change the default model
        max_chunk_size=200,                          # Configure chunk sizes
        chunk_overlap=20,                            # Overlap between chunks
        is_separator_regex=False,                    # Are separators plain strings or regex?
        separators=[
            "\n\n",
            "\n",
        ],
        keep_separator=False,                        # Keep or remove the separator in chunks
    )
}

After Saving

CapybaraDB saves data by splitting each EmbText into chunks, embedding them, and indexing while preserving their relationships with vector data. It also automatically adds a 'chunks' field to each EmbText for seamless access.

{
    "field_name": EmbText(
        text="Alice is a data scientist with expertise in AI and machine learning. She has led several projects in natural language processing.",
        chunks=["Alice is a data scientist", "with expertise in AI", "and machine learning.", "She has led several projects", "in natural language processing."],
        emb_model=EmbModels.TEXT_EMBEDDING_3_LARGE,  # Change the default model
        max_chunk_size=200,                          # Configure chunk sizes
        chunk_overlap=20,                            # Overlap between chunks
        is_separator_regex=False,                    # Are separators plain strings or regex?
        separators=[
            "\n\n",
            "\n",
        ],
        keep_separator=False,                        # Keep or remove the separator in chunks
    )
}

Parameter Reference

ParameterDescription
textThe core content for EmbText. This text is automatically chunked and embedded for semantic search.
emb_modelWhich embedding model to use. Defaults to text-embedding-3-small. You can choose from other supported models, such as text-embedding-3-large.
max_chunk_sizeMaximum character length of each chunk. Larger chunks reduce the total chunk count but may reduce search efficiency (due to bigger embeddings).
chunk_overlapOverlapping character count between consecutive chunks, useful for preserving context at chunk boundaries.
is_separator_regexWhether to treat each separator in separators as a regular expression. Defaults to False.
separatorsA list of separator strings (or regex patterns) used to split the text. For instance, ["\n\n", "\n"] can split paragraphs or single lines.
keep_separatorIf True, separators remain in the chunked text. If False, they are stripped out.
chunksAuto-generated by the database after the text is processed. It is not set by the user, and is available only after embedding completes.

How It Works: Asynchronous Processing

Whenever you insert a document containing EmbText into CapybaraDB, three main steps happen asynchronously:

  1. Chunking
    The text is divided into chunks based on max_chunk_size, chunk_overlap, and any specified separators. This ensures the text is broken down into optimally sized segments.

  2. Embedding
    Each chunk is transformed into a vector representation using the specified emb_model. This step captures the semantic essence of the text.

  3. Indexing
    The embeddings are indexed for efficient semantic search. Because these steps occur in the background, you get immediate responses to your write operations, but actual query availability may lag slightly behind the write.


Querying

Once the Embedding and Indexing steps are complete, your EmbText fields become searchable. To do this, use 'query' operation. Please also refer Query page

Examples

Accessing Generated Chunks

The chunks attribute is auto-added by the database after the text finishes embedding and indexing. For instance:

emb_text: EmbText  # Assume this EmbText has been inserted and processed
 
print(emb_text.text)
# "Alice is a data scientist with expertise in AI and machine learning. She has led several projects in natural language processing."
 
print(emb_text.chunks)
# [
#   "Alice is a data scientist",
#   "with expertise in AI",
#   "and machine learning.",
#   "She has led several projects",
#   "in natural language processing."
# ]

Usage in Nested Fields

EmbText can be embedded anywhere in your document, including nested objects:

{
  "profile": {
    "name": "Bob",
    "bio": EmbText(
      "Bob has over a decade of experience in AI, focusing on neural networks and deep learning."
    )
  }
}

How can we improve this documentation?

Got question? Email us