EmbText
Overview
CapybaraDB uses vector embeddings to understand the meaning of text beyond simple keyword matching. By wrapping your text in EmbText
, you unlock the ability to query documents based on conceptual and contextual relevance, rather than just literal keyword occurrences.
Key Points:
- Automatic Chunking: Large text is split into smaller pieces (chunks) for more efficient embeddings and semantic searches.
- Asynchronous Embedding: After the text is stored, embeddings are generated in the background without blocking your application.
- Semantic Indexing: The final vector representations are indexed for fast semantic lookups.
Basic Usage
Below is the simplest way to use EmbText
:
from capybaradb import EmbText
# Storing a single text field that you want to embed.
{
"field_name": EmbText("Alice is a data scientist with expertise in AI and machine learning. She has led several projects in natural language processing.")
}
This snippet creates an EmbText
object containing "text_to_embed"
. By default, it uses the text-embedding-3-small
model and sensible defaults for chunking and overlap.
Customized Usage with Parameters
If you have specific requirements (e.g., a different embedding model or particular chunking strategy), customize EmbText
by specifying additional parameters:
from capybaradb import EmbText, EmbModels
{
"field_name": EmbText(
text="Alice is a data scientist with expertise in AI and machine learning. She has led several projects in natural language processing.",
emb_model=EmbModels.TEXT_EMBEDDING_3_LARGE, # Change the default model
max_chunk_size=200, # Configure chunk sizes
chunk_overlap=20, # Overlap between chunks
is_separator_regex=False, # Are separators plain strings or regex?
separators=[
"\n\n",
"\n",
],
keep_separator=False, # Keep or remove the separator in chunks
)
}
After Saving
CapybaraDB saves data by splitting each EmbText into chunks, embedding them, and indexing while preserving their relationships with vector data. It also automatically adds a 'chunks' field to each EmbText for seamless access.
{
"field_name": EmbText(
text="Alice is a data scientist with expertise in AI and machine learning. She has led several projects in natural language processing.",
chunks=["Alice is a data scientist", "with expertise in AI", "and machine learning.", "She has led several projects", "in natural language processing."],
emb_model=EmbModels.TEXT_EMBEDDING_3_LARGE, # Change the default model
max_chunk_size=200, # Configure chunk sizes
chunk_overlap=20, # Overlap between chunks
is_separator_regex=False, # Are separators plain strings or regex?
separators=[
"\n\n",
"\n",
],
keep_separator=False, # Keep or remove the separator in chunks
)
}
Parameter Reference
Parameter | Description |
---|---|
text | The core content for EmbText . This text is automatically chunked and embedded for semantic search. |
emb_model | Which embedding model to use. Defaults to text-embedding-3-small . You can choose from other supported models, such as text-embedding-3-large . |
max_chunk_size | Maximum character length of each chunk. Larger chunks reduce the total chunk count but may reduce search efficiency (due to bigger embeddings). |
chunk_overlap | Overlapping character count between consecutive chunks, useful for preserving context at chunk boundaries. |
is_separator_regex | Whether to treat each separator in separators as a regular expression. Defaults to False . |
separators | A list of separator strings (or regex patterns) used to split the text. For instance, ["\n\n", "\n"] can split paragraphs or single lines. |
keep_separator | If True , separators remain in the chunked text. If False , they are stripped out. |
chunks | Auto-generated by the database after the text is processed. It is not set by the user, and is available only after embedding completes. |
How It Works: Asynchronous Processing
Whenever you insert a document containing EmbText
into CapybaraDB, three main steps happen asynchronously:
-
Chunking
The text is divided into chunks based onmax_chunk_size
,chunk_overlap
, and any specifiedseparators
. This ensures the text is broken down into optimally sized segments. -
Embedding
Each chunk is transformed into a vector representation using the specifiedemb_model
. This step captures the semantic essence of the text. -
Indexing
The embeddings are indexed for efficient semantic search. Because these steps occur in the background, you get immediate responses to your write operations, but actual query availability may lag slightly behind the write.
Querying
Once the Embedding and Indexing steps are complete, your EmbText
fields become searchable. To do this, use 'query' operation. Please also refer Query page
Examples
Accessing Generated Chunks
The chunks
attribute is auto-added by the database after the text finishes embedding and indexing. For instance:
emb_text: EmbText # Assume this EmbText has been inserted and processed
print(emb_text.text)
# "Alice is a data scientist with expertise in AI and machine learning. She has led several projects in natural language processing."
print(emb_text.chunks)
# [
# "Alice is a data scientist",
# "with expertise in AI",
# "and machine learning.",
# "She has led several projects",
# "in natural language processing."
# ]
Usage in Nested Fields
EmbText
can be embedded anywhere in your document, including nested objects:
{
"profile": {
"name": "Bob",
"bio": EmbText(
"Bob has over a decade of experience in AI, focusing on neural networks and deep learning."
)
}
}