Contextinator 
The weapon of mass codebase context for agentic AI.
Turn any codebase into semantically-aware, searchable knowledge for AI-powered workflows.
📖 Overview
Contextinator is a powerful tool that bridges the gap between static codebases and intelligent AI agents. It uses Abstract Syntax Tree (AST) parsing to extract semantic code chunks, generates embeddings, and stores them in a vector database-enabling AI agents to understand, navigate, and reason about codebases with unprecedented precision.
✨ Key Features
🌳 AST-Powered Chunking - Extract functions, classes, and methods from 23+ programming languages
🔍 Semantic Search - Find relevant code using natural language queries
🚀 Full Pipeline Automation - One command to chunk, embed, and store
🎯 Smart Deduplication - Hash-based detection of duplicate code
📊 Visual AST Explorer - Debug and visualize code structure
📦 TOON Format Export - Token-efficient output format for LLM prompts (40-60% savings)
🐳 Docker-Ready - ChromaDB server included
🎯 Use Cases
- AI Code Assistants - Give LLMs deep codebase understanding
- Documentation Generation - Auto-generate docs from code structure
- Code Search & Discovery - Find implementations across large projects
- Refactoring Analysis - Identify duplicate or similar code patterns
- Onboarding Automation - Help new developers navigate unfamiliar codebases
🚀 Quick Start
Prerequisites
- Python 3.11 or higher
- Docker (for ChromaDB)
- OpenAI API key (for embeddings)
Installation
Step 1: Clone and setup
git clone https://github.com/starthackHQ/Contextinator.git
cd Contextinator
Step 2: Create virtual environment
python -m venv .venv
Step 3: Activate environment
# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate
Step 4: Install dependencies
pip install -r requirements.txt
Step 5: Configure environment variables
Copy the example environment file and add your OpenAI API key:
# Copy the example file
cp .env.example .env
# Edit .env and add your OpenAI API key
# OPENAI_API_KEY=sk-your-actual-key-here
Important: The .env file should contain:
OPENAI_API_KEY=sk-your-openai-key-here
USE_CHROMA_SERVER=true
CHROMA_SERVER_URL=http://localhost:8000
Step 6: Start ChromaDB
docker-compose up -d
Usage
Contextinator can be used in two ways: via the CLI or programmatically as a Python library.
CLI Usage
After installation, you can use the contextinator command directly:
# Recommended: Use the installed command
contextinator <command> [options]
# Alternative: Use module execution
python -m contextinator <command> [options]
Development mode (before installation):
python -m src.contextinator <command> [options]
1. Chunking
contextinator chunk --save --path <repo-path> --output <output-dir>
contextinator chunk --save --repo-url <github-url>
contextinator chunk --save-ast # Save AST trees for debugging
contextinator chunk --chunks-dir <custom-dir> # Custom chunks directory
2. Embedding
right now, we're only supporting OpenAI embeddings, so make sure you've got the .env.example setup'd correctly.
contextinator embed --save --path <repo-path> --output <output-dir>
contextinator embed --save --repo-url <github-url>
contextinator embed --chunks-dir <custom-dir> --embeddings-dir <custom-dir>
3. Storing in Vector Store
Note: Make sure ChromaDB server is running: docker-compose up -d
contextinator store-embeddings --path <repo-path> --output <output-dir>
contextinator store-embeddings --collection-name <custom-name>
contextinator store-embeddings --repo-name <repo-name> --collection-name <custom-name>
contextinator store-embeddings --embeddings-dir <custom-dir> --chromadb-dir <custom-dir>
4. Search Commands
Contextinator provides multiple search methods for different use cases:
4.1 Semantic Search (Natural Language)
Find code using natural language queries. Uses AI embeddings for semantic similarity.
# Basic semantic search
contextinator search "authentication logic" --collection MyRepo
# With filters
contextinator search "error handling" -c MyRepo --language python -n 10
# Include parent chunks (classes/modules) in results
contextinator search "database queries" -c MyRepo --include-parents
# Filter by file path
contextinator search "API endpoints" -c MyRepo --file "src/api/"
# Filter by node type
contextinator search "validation logic" -c MyRepo --type function_definition
# Export to JSON
contextinator search "authentication" -c MyRepo --json results.json
# Export to TOON format (40-60% token savings for LLMs)
contextinator search "authentication" -c MyRepo --toon results.json
**Options:**
- `-c, --collection` (required): Collection name
- `-n, --n-results`: Number of results (default: 5)
- `-l, --language`: Filter by programming language (e.g., python, javascript)
- `-f, --file`: Filter by file path (partial match)
- `-t, --type`: Filter by node type (e.g., function_definition, class_definition)
- `--include-parents`: Include parent chunks (classes/modules) in results
- `--json`: Export results to JSON file
- `--toon`: Export results to TOON file (compact format for LLMs)
4.2 Symbol Search (Exact Name Match)
Find specific functions or classes by name.
# Find function by name
python -m src.contextinator.cli symbol authenticate_user --collection MyRepo
# Find class by name
python -m src.contextinator.cli symbol UserManager -c MyRepo --type class_definition
# Search in specific file
python -m src.contextinator.cli symbol get_user -c MyRepo --file "api/"
# Export results
python -m src.contextinator.cli symbol main -c MyRepo --json main_functions.json
**Options:**
- `-c, --collection` (required): Collection name
- `-t, --type`: Filter by node type
- `-f, --file`: Filter by file path
- `--limit`: Maximum results (default: 50)
- `--json`: Export to JSON
- `--toon`: Export to TOON format
4.3 Pattern Search (Text/Regex)
Search for specific text patterns in code.
# Find TODOs
python -m src.contextinator.cli pattern "TODO" --collection MyRepo
# Find import statements
python -m src.contextinator.cli pattern "import requests" -c MyRepo --language python
# Find async functions
python -m src.contextinator.cli pattern "async def" -c MyRepo --file "api/"
# Find FIXMEs and export
python -m src.contextinator.cli pattern "FIXME" -c MyRepo --toon fixmes.json
**Options:**
- `-c, --collection` (required): Collection name
- `-l, --language`: Filter by programming language
- `-f, --file`: Filter by file path
- `-t, --type`: Filter by node type
- `--limit`: Maximum results (default: 50)
- `--json`: Export to JSON
- `--toon`: Export to TOON format
4.4 Advanced Search (Hybrid)
Combine semantic search, pattern matching, and filters for precise results.
# Semantic search with language filter
python -m src.contextinator.cli search-advanced -c MyRepo \
--semantic "authentication" --language python
# Pattern search with file filter
python -m src.contextinator.cli search-advanced -c MyRepo \
--pattern "TODO" --file "src/"
# Hybrid: semantic + pattern + type filter
python -m src.contextinator.cli search-advanced -c MyRepo \
--semantic "error handling" --pattern "try" --type function_definition
# Multiple filters with export
python -m src.contextinator.cli search-advanced -c MyRepo \
--semantic "API routes" --language python --file "api/" --toon api_routes.json
**Options:**
- `-c, --collection` (required): Collection name
- `-s, --semantic`: Semantic query (natural language)
- `-p, --pattern`: Text pattern to search for
- `-l, --language`: Filter by programming language
- `-f, --file`: Filter by file path
- `-t, --type`: Filter by node type
- `--limit`: Maximum results (default: 50)
- `--json`: Export to JSON
- `--toon`: Export to TOON format
4.5 Read File (Reconstruct from Chunks)
Reconstruct and display a complete file from its chunks.
# Read complete file
python -m src.contextinator.cli read-file "src/auth.py" --collection MyRepo
# Show chunks separately (don't join)
python -m src.contextinator.cli read-file "src/api/routes.py" -c MyRepo --no-join
# Export to JSON
python -m src.contextinator.cli read-file "src/main.py" -c MyRepo --json main.json
**Options:**
- `-c, --collection` (required): Collection name
- `--no-join`: Show chunks separately instead of joining them
- `--json`: Export to JSON
- `--toon`: Export to TOON format
5. Export Formats
All search commands support two export formats:
JSON Format (Standard)
python -m src.contextinator.cli search "authentication" -c MyRepo --json results.json
Output structure:
json
{
"query": "authentication",
"collection": "MyRepo",
"total_results": 5,
"results": [
{
"id": "chunk_0_12345",
"content": "def authenticate_user(username, password):\n ...",
"metadata": {
"file_path": "src/auth.py",
"language": "python",
"node_type": "function_definition",
"node_name": "authenticate_user",
"start_line": 10,
"end_line": 25
},
"cosine_similarity": 0.89
}
]
}
TOON Format (Token-Optimized)
Compact format designed for LLM prompts. Saves 40-60% tokens compared to JSON.
python -m src.contextinator.cli search "authentication" -c MyRepo --toon results.json
Perfect for:
- Feeding search results to LLMs
- Building RAG (Retrieval-Augmented Generation) systems
- Minimizing token usage in AI workflows
6. Database Management
# Show database statistics
contextinator db-info
# List all collections
contextinator db-list
# Show collection details with sample documents
contextinator db-show MyRepo --sample 3
# Delete a collection
contextinator db-clear MyRepo
# Use custom ChromaDB location
contextinator db-info --chromadb-dir <custom-dir>
# Use specific repo database
contextinator db-info --repo-name MyRepo
📦 Programmatic Usage (As a Library)
You can also import and use Contextinator directly in your Python code:
from contextinator import (
chunk_repository,
embed_chunks,
store_repository_embeddings,
semantic_search,
symbol_search,
read_file,
)
# 1. Chunk a repository
chunks = chunk_repository(
repo_path="./my-project",
repo_name="MyProject",
save=True,
output_dir="./output"
)
print(f"Created {len(chunks)} chunks")
# 2. Generate embeddings
embeddings = embed_chunks(
base_dir="./output",
repo_name="MyProject",
save=True
)
# 3. Store in vector database
stats = store_repository_embeddings(
base_dir="./output",
repo_name="MyProject",
embedded_chunks=embeddings,
collection_name="MyProject"
)
print(f"Stored {stats['stored_count']} embeddings")
# 4. Search semantically
results = semantic_search(
collection_name="MyProject",
query="authentication logic",
n_results=5
)
for result in results:
print(f"File: {result['metadata']['file_path']}")
print(f"Code: {result['content'][:200]}...")
print(f"Similarity: {result.get('cosine_similarity', 'N/A')}")
print("-" * 80)
# 5. Search by symbol name
functions = symbol_search(
collection_name="MyProject",
symbol_name="authenticate_user"
)
# 6. Read entire file from chunks
file_content = read_file(
collection_name="MyProject",
file_path="src/auth.py"
)
print(file_content['content'])
📚 Quick Reference
Common Workflows
1. Index a GitHub repository:
contextinator chunk-embed-store-embeddings \
--repo-url https://github.com/user/repo \
--save \
--collection-name MyRepo
2. Search for specific functionality:
# Natural language search
contextinator search "how is authentication handled" -c MyRepo
# Find specific function
contextinator symbol authenticate_user -c MyRepo
# Find TODOs
contextinator pattern "TODO" -c MyRepo
3. Advanced filtered search:
python -m src.contextinator.cli search-advanced -c MyRepo \
--semantic "error handling" \
--language python \
--file "src/" \
--toon error_handling.json
4. Export for LLM context:
# Get authentication-related code in token-efficient format
python -m src.contextinator.cli search "authentication and authorization" \
-c MyRepo \
--include-parents \
--toon auth_context.json
All Commands
| Command | Purpose | Example |
|---|---|---|
chunk |
Extract semantic chunks from code | chunk --repo-url <url> --save |
embed |
Generate embeddings for chunks | embed --repo-url <url> --save |
store-embeddings |
Store embeddings in ChromaDB | store-embeddings --repo-name MyRepo |
chunk-embed-store-embeddings |
Full pipeline (all-in-one) | chunk-embed-store-embeddings --repo-url <url> --save |
search |
Semantic search (natural language) | search "query" -c MyRepo |
symbol |
Find functions/classes by name | symbol function_name -c MyRepo |
pattern |
Text/regex search | pattern "TODO" -c MyRepo |
search-advanced |
Hybrid search with filters | search-advanced -c MyRepo --semantic "query" --language python |
read-file |
Reconstruct file from chunks | read-file "path/to/file.py" -c MyRepo |
db-info |
Show database statistics | db-info |
db-list |
List all collections | db-list |
db-show |
Show collection details | db-show MyRepo --sample 3 |
db-clear |
Delete a collection | db-clear MyRepo |