Query

This guide explains how to perform semantic queries on documents in CapyDB. Semantic queries retrieve documents by matching the meaning of the provided query text with EmbJSONs in the database.

The query operation returns a list of matched chunks from EmbJSONs in the collection. Only EmbJSONs with the same emb_model as the query text are included in the semantic search.EmbJSONs with differing emb_model are excluded from the semantic search.

Basic Query Operation

The simplest way to use the query operation is to just provide the query text. This offers an easy and intuitive way to search your data semantically without worrying about additional parameters.

Basic Example

# Simple query example
query_text = "Software engineer with expertise in AI"

response = collection.query(query_text)

# Process the results - response is now a list of matches
for match in response:
    print(f"Match: {match['chunk']} (Score: {match['score']})")

Default Response

A successful query operation will return a JSON array containing the matching documents. By default, the response includes the matched text chunks, their location in the document, similarity scores, and basic document metadata:

[
  {
    "chunk": "John is a software engineer with expertise in AI.",
    "path": "bio",
    "chunk_n": 0,
    "score": 0.95,
    "document": {
      "_id": ObjectId("64d2f8f01234abcd5678ef90")
      // All document fields are returned here (name, bio, skills, etc.)
    }
  },
  {
    "chunk": "Alice is a data scientist with a background in machine learning.",
    "path": "bio",
    "chunk_n": 1,
    "score": 0.89,
    "document": {
      "_id": ObjectId("64d2f8f01234abcd5678ef91")
      // Complete document data is returned by default
    }
  }
]

By default, the system will:

Use OpenAI's text-embedding-3-small as the embedding model
Return the top 10 matching results
Exclude the vector values from the response
Return the whole document data, not just minimal metadata

Advanced Query Operation

For more control over your semantic searches, you can customize the query operation with additional parameters. These parameters allow you to fine-tune the search behavior, filter results, and specify what data to include in the response.

Advanced Example with Optional Parameters

# Advanced query with optional parameters
query_text = "Software engineer with expertise in AI"
emb_model = "text-embedding-3-small"  # Optional
top_k = 3  # Optional
include_values = True  # Optional
projection = {
    "mode": "include",
    "fields": ["name", "bio"]
}  # Optional

response = collection.query(
    query_text, 
    filter={"status": "active"}, 
    projection=projection,
    emb_model=emb_model, 
    top_k=top_k, 
    include_values=include_values
)

# Process the results - response is now a list of matches
for match in response:
    print(f"Match: {match['chunk']} (Score: {match['score']})")
    print(f"From document: {match['document']['_id']}")

Detailed Response with Additional Parameters

When you customize the query with additional parameters like include_values or projection, the response array can include more detailed information:

[
  {
    "path": "bio",
    "chunk": "John is a software engineer with expertise in AI.",
    "chunk_n": 0,
    "score": 0.95,
    "values": [
      0.123, 0.456, 0.789, ...
    ],
    "document": {
      "_id": ObjectId("64d2f8f01234abcd5678ef90"),
      "name": "John Doe",
      "bio": EmbText("John is a software engineer with expertise in AI.")
    }
  },
  {
    "path": "bio",
    "chunk": "Alice is a data scientist with a background in machine learning.",
    "chunk_n": 1,
    "score": 0.89,
    "values": [
      0.234, 0.567, 0.890, ...
    ],
    "document": {
      "_id": ObjectId("64d2f8f01234abcd5678ef91"),
      "name": "Alice Smith",
      "bio": EmbText("Alice is a data scientist with a background in machine learning.")
    }
  }
]

Parameters for Query Operations

Parameter	Description
query	The text to be embedded and matched against stored EmbJSON fields. This parameter is required.
filter (optional)	MongoDB-style query filter to apply to documents before semantic search. This helps narrow down the document set before performing the semantic search.
projection (optional)	Specifies which fields to include or exclude in the returned documents. Format: `{"mode": "include", "fields": ["field1", "field2"]}` or`{"mode": "exclude", "fields": ["field3"]}`.
emb_model (optional)	The embedding model used for the query. Defaults to OpenAI's text-embedding-3-small. Users can select from supported embedding models. If the specified model does not match those used in the stored EmbJSON, only matching fields will be targeted.
top_k (optional)	The maximum number of matches to return. Defaults to 10. Increase this value to get more results, decrease it to improve performance and reduce response size.
include_values (optional)	Whether to include the embedding vector values in the response. Defaults to false. Set to true if you need the raw vector data for further processing.

Common Use Cases

1. Simple Semantic Search

When you need to quickly search for documents related to a concept:

results = collection.query("climate change impact")

2. Filtered Semantic Search

When you need to search within a specific category or subset of documents:

results = collection.query(
    "renewable energy solutions",
    filter={"category": "science", "published": True}
)

3. Limited Result Set

When you only need the top few most relevant matches:

results = collection.query("machine learning techniques", top_k=3)

4. Specific Fields Retrieval

When you need to include specific fields in the response:

projection = {"mode": "include", "fields": ["title", "abstract", "author"]}
results = collection.query("quantum computing", projection=projection)

Your feedback helps us improve our documentation. Let us know what you think!