rag-train
Train the RAG (Retrieval-Augmented Generation) system to better understand your database. Add schema information, business documentation, and example queries to improve natural language SQL accuracy.
Syntax
%%calliope rag-train <datasource>
{
"ddl": [...],
"documentation": [...],
"question": [...],
"sql": [...]
}Why Train RAG?
When you add a database, the system automatically learns the schema. Training adds:
- Business context: What columns actually mean
- Query patterns: How your team queries data
- Terminology: Industry-specific terms
This makes ask-sql more accurate and relevant.
Training Types
DDL (Schema)
Provide schema definitions:
%%calliope rag-train sales_db
{
"ddl": [
"TABLE customers (id INT PRIMARY KEY, name VARCHAR(100), email VARCHAR(255), created_at TIMESTAMP)",
"TABLE orders (id INT PRIMARY KEY, customer_id INT REFERENCES customers(id), total DECIMAL(10,2), status VARCHAR(20))",
"TABLE products (id INT PRIMARY KEY, name VARCHAR(100), price DECIMAL(10,2), category VARCHAR(50))"
]
}Documentation
Add business context:
%%calliope rag-train sales_db
{
"documentation": [
"The 'customers' table stores all registered users. 'created_at' is when they signed up.",
"Order status can be: pending, processing, shipped, delivered, cancelled",
"Product categories are: Electronics, Clothing, Home, Sports, Other",
"Revenue is calculated as SUM(total) from orders where status = 'delivered'",
"ARR (Annual Recurring Revenue) comes from subscriptions table, not orders"
]
}Question-SQL Pairs
Teach with examples:
%%calliope rag-train sales_db
{
"question": [
"How many active customers do we have?",
"What's the average order value?",
"Which products are bestsellers?",
"Show me revenue by month"
],
"sql": [
"SELECT COUNT(DISTINCT customer_id) FROM orders WHERE created_at > CURRENT_DATE - INTERVAL '30 days'",
"SELECT AVG(total) FROM orders WHERE status = 'delivered'",
"SELECT p.name, COUNT(*) as orders FROM products p JOIN order_items oi ON p.id = oi.product_id GROUP BY p.id ORDER BY orders DESC LIMIT 10",
"SELECT DATE_TRUNC('month', created_at) as month, SUM(total) as revenue FROM orders WHERE status = 'delivered' GROUP BY month ORDER BY month"
]
}Complete Training Example
Combine all types in one cell:
%%calliope rag-train analytics_db
{
"ddl": [
"TABLE users (id INT, email VARCHAR, plan VARCHAR, created_at TIMESTAMP)",
"TABLE events (id INT, user_id INT, event_type VARCHAR, created_at TIMESTAMP)"
],
"documentation": [
"Plans are: free, pro, enterprise",
"DAU = Daily Active Users = distinct users with events today",
"MAU = Monthly Active Users = distinct users with events in last 30 days",
"Activation = user who completed onboarding (event_type = 'onboarding_complete')",
"Churn = pro/enterprise user with no events in 30 days"
],
"question": [
"What's our DAU?",
"Show activation rate by week",
"How many users churned this month?"
],
"sql": [
"SELECT COUNT(DISTINCT user_id) FROM events WHERE DATE(created_at) = CURRENT_DATE",
"SELECT DATE_TRUNC('week', u.created_at) as week, COUNT(DISTINCT CASE WHEN e.event_type = 'onboarding_complete' THEN u.id END)::float / COUNT(DISTINCT u.id) as activation_rate FROM users u LEFT JOIN events e ON u.id = e.user_id GROUP BY week ORDER BY week",
"SELECT COUNT(*) FROM users WHERE plan IN ('pro', 'enterprise') AND id NOT IN (SELECT DISTINCT user_id FROM events WHERE created_at > CURRENT_DATE - INTERVAL '30 days')"
]
}Automatic Training
When you add a database with %%calliope add-database, the system automatically:
- Introspects the schema
- Extracts DDL for all tables
- Trains RAG with the schema
You only need manual training to add:
- Business documentation
- Common query patterns
- Domain terminology
Best Practices
Document Business Logic
%%calliope rag-train mydb
{
"documentation": [
"Active user = logged in within 7 days",
"Revenue only counts completed orders (status='delivered')",
"Enterprise customers are in the 'enterprise_clients' table, not 'customers'"
]
}Include Common Queries
Train with queries your team frequently asks:
%%calliope rag-train mydb
{
"question": ["Weekly signups", "Conversion funnel", "Top referrers"],
"sql": [
"SELECT DATE_TRUNC('week', created_at), COUNT(*) FROM users GROUP BY 1",
"SELECT step, COUNT(*) FROM funnel_events GROUP BY step ORDER BY step",
"SELECT referrer, COUNT(*) FROM users GROUP BY referrer ORDER BY 2 DESC LIMIT 10"
]
}Use Consistent Terminology
If your team says “ARR” train the system to understand it:
%%calliope rag-train finance_db
{
"documentation": [
"ARR = Annual Recurring Revenue = SUM(monthly_amount * 12) from subscriptions"
],
"question": ["What's our ARR?"],
"sql": ["SELECT SUM(monthly_amount * 12) as arr FROM subscriptions WHERE status = 'active'"]
}Tips
- Train iteratively: add more examples as you find gaps
- Review generated SQL and add corrections as training pairs
- Document any non-obvious column meanings
- Include examples of complex queries your team runs
- Re-train after schema changes
Verifying Training
After training, test with ask-sql:
%%calliope ask-sql mydb
What's our ARR?If results aren’t right, add more training:
%%calliope rag-train mydb
{
"documentation": ["ARR calculation should exclude cancelled subscriptions"],
"question": ["What's our ARR?"],
"sql": ["SELECT SUM(monthly_amount * 12) FROM subscriptions WHERE status = 'active' AND cancelled_at IS NULL"]
}