EmbText
EmbText is one of the EmbJSON data types supported by CapybaraDB, designed for handling text that requires semantic embedding. It allows you to store text in a structured format where it can be automatically embedded and indexed for efficient semantic search.
Basic Structure of EmbText
The structure of EmbText is simple and standardized, ensuring that the text is properly embedded and indexed:
{
"@embText": {
"text": "text to embed",
"emb_model": "text-embedding-3-small", // Optional: If not provided, the default model 'text-embedding-3-small' will be used
"chunks": ["text to", "embed"] // Auto-generated by the database
}
}
Key Components
Component | Description |
---|---|
text | The core content of the EmbText object that needs to be embedded and indexed. The text will be automatically split into smaller chunks for optimized semantic search and retrieval. |
emb_model | Specifies the embedding model to use. The default model is text-embedding-3-small , which is optimized for general-purpose embeddings, but you can choose from other supported models if needed. |
chunks | Added automatically by the database after asynchronous data processing. Unlike text and emb_model , which are specified by users, the chunks field will only be added or modified by the system. It contains the smaller, processed parts of the original text used for optimized search. Example values: ["text to", "embed"] . |
No Custom Fields in EmbText
EmbText does not support custom fields. Only text
and emb_model
are added or allowed to be modified by users. Any additional fields will either be ignored or cause errors.
Asynchronous Processing
When an EmbText field is inserted into CapybaraDB, the following processes take place asynchronously:
1. Chunking
After the document is inserted into the database, each EmbText field's text is automatically divided into smaller, manageable chunks. This chunking process helps improve the efficiency and accuracy of semantic search operations.
2. Embedding
The text chunks are then embedded using the specified model. This embedding process transforms the chunks into vector representations that capture the semantic meaning of the text. This entire process is automated and performed by the system without user intervention.
3. Indexing
The embedded chunks are indexed asynchronously. Because this process is asynchronous, users receive immediate responses from the database when inserting data. However, it may take a few seconds for the data to be fully processed and available for query operations. You can continue to interact with the document during indexing, but query results may not reflect the new text until the process is complete.
Semantic Search Capability
After chunking, embedding, and indexing are complete, you can perform query operations on the EmbText field. The search will retrieve text chunks based on meaning, rather than just keyword matches.
Advantages of Using EmbText
-
Default Embedding Model: If no
emb_model
is specified, the defaulttext-embedding-3-small
model will be used. You can change this by explicitly setting theemb_model
field. -
Asynchronous Processing: Embedding and indexing are done asynchronously, which allows for immediate responses when inserting data, though full query availability may take a few seconds. Be mindful of this delay, especially when working with large datasets or needing immediate query results.
-
Nested Fields: EmbText fields can be included within nested document structures. For example, a user’s biography can be stored in
profile.bio
and still be embedded and indexed correctly.
Example of What It Looks Like
Note: In CapybaraDB, the default document ******************************_id
****************************** field is an auto-generated BSON ObjectId unless specified otherwise by the user. This is used in the examples below.
Here’s an example of how a document looks like after saving:
{
"_id": { "$oid": "64d2f8f01234abcd5678ef90" },
"name": "Alice",
"bio": {
"@embText": {
"text": "Alice is a data scientist with expertise in AI and machine learning. She has led several projects in natural language processing.",
"emb_model": "text-embedding-3-small", // Optional
"chunks": [
"Alice is a data scientist",
"with expertise in AI",
"and machine learning.",
"She has led several projects",
"in natural language processing."
] // Auto-generated by the database
}
}
}
In this example:
- The bio field uses the EmbText data type to embed and index Alice’s biography, making it searchable based on the semantic meaning of the text.
- The embedding model is
text-embedding-3-small
, but it can be adjusted as needed. - The chunks field contains example values showing how the original text might be divided for optimized search.
Example of EmbText in Nested Fields
{
"profile": {
"name": "Bob",
"bio": {
"@embText": {
"text": "Bob has over a decade of experience in AI, focusing on neural networks and deep learning.",
"emb_model": "text-embedding-3-small" // Optional
}
}
}
}
Use Cases for EmbText
-
Semantic Search: EmbText is ideal for scenarios where retrieving documents based on meaning is more important than matching exact keywords. For example, you could query for biographies mentioning “AI research” and find relevant documents even if the exact phrase isn't used.
-
Natural Language Processing: EmbText is useful in AI-driven applications where understanding the context and meaning of text is crucial, such as summarization, sentiment analysis, or question-answering systems.