rag-train

Train the RAG (Retrieval-Augmented Generation) system to better understand your database. Add schema information, business documentation, and example queries to improve natural language SQL accuracy.

Syntax

%%calliope rag-train <datasource>
{
    "ddl": [...],
    "documentation": [...],
    "question": [...],
    "sql": [...]
}

Why Train RAG?

When you add a database, the system automatically learns the schema. Training adds:

  • Business context: What columns actually mean
  • Query patterns: How your team queries data
  • Terminology: Industry-specific terms

This makes ask-sql more accurate and relevant.

Training Types

DDL (Schema)

Provide schema definitions:

%%calliope rag-train sales_db
{
    "ddl": [
        "TABLE customers (id INT PRIMARY KEY, name VARCHAR(100), email VARCHAR(255), created_at TIMESTAMP)",
        "TABLE orders (id INT PRIMARY KEY, customer_id INT REFERENCES customers(id), total DECIMAL(10,2), status VARCHAR(20))",
        "TABLE products (id INT PRIMARY KEY, name VARCHAR(100), price DECIMAL(10,2), category VARCHAR(50))"
    ]
}

Documentation

Add business context:

%%calliope rag-train sales_db
{
    "documentation": [
        "The 'customers' table stores all registered users. 'created_at' is when they signed up.",
        "Order status can be: pending, processing, shipped, delivered, cancelled",
        "Product categories are: Electronics, Clothing, Home, Sports, Other",
        "Revenue is calculated as SUM(total) from orders where status = 'delivered'",
        "ARR (Annual Recurring Revenue) comes from subscriptions table, not orders"
    ]
}

Question-SQL Pairs

Teach with examples:

%%calliope rag-train sales_db
{
    "question": [
        "How many active customers do we have?",
        "What's the average order value?",
        "Which products are bestsellers?",
        "Show me revenue by month"
    ],
    "sql": [
        "SELECT COUNT(DISTINCT customer_id) FROM orders WHERE created_at > CURRENT_DATE - INTERVAL '30 days'",
        "SELECT AVG(total) FROM orders WHERE status = 'delivered'",
        "SELECT p.name, COUNT(*) as orders FROM products p JOIN order_items oi ON p.id = oi.product_id GROUP BY p.id ORDER BY orders DESC LIMIT 10",
        "SELECT DATE_TRUNC('month', created_at) as month, SUM(total) as revenue FROM orders WHERE status = 'delivered' GROUP BY month ORDER BY month"
    ]
}

Complete Training Example

Combine all types in one cell:

%%calliope rag-train analytics_db
{
    "ddl": [
        "TABLE users (id INT, email VARCHAR, plan VARCHAR, created_at TIMESTAMP)",
        "TABLE events (id INT, user_id INT, event_type VARCHAR, created_at TIMESTAMP)"
    ],
    "documentation": [
        "Plans are: free, pro, enterprise",
        "DAU = Daily Active Users = distinct users with events today",
        "MAU = Monthly Active Users = distinct users with events in last 30 days",
        "Activation = user who completed onboarding (event_type = 'onboarding_complete')",
        "Churn = pro/enterprise user with no events in 30 days"
    ],
    "question": [
        "What's our DAU?",
        "Show activation rate by week",
        "How many users churned this month?"
    ],
    "sql": [
        "SELECT COUNT(DISTINCT user_id) FROM events WHERE DATE(created_at) = CURRENT_DATE",
        "SELECT DATE_TRUNC('week', u.created_at) as week, COUNT(DISTINCT CASE WHEN e.event_type = 'onboarding_complete' THEN u.id END)::float / COUNT(DISTINCT u.id) as activation_rate FROM users u LEFT JOIN events e ON u.id = e.user_id GROUP BY week ORDER BY week",
        "SELECT COUNT(*) FROM users WHERE plan IN ('pro', 'enterprise') AND id NOT IN (SELECT DISTINCT user_id FROM events WHERE created_at > CURRENT_DATE - INTERVAL '30 days')"
    ]
}

Automatic Training

When you add a database with %%calliope add-database, the system automatically:

  1. Introspects the schema
  2. Extracts DDL for all tables
  3. Trains RAG with the schema

You only need manual training to add:

  • Business documentation
  • Common query patterns
  • Domain terminology

Best Practices

Document Business Logic

%%calliope rag-train mydb
{
    "documentation": [
        "Active user = logged in within 7 days",
        "Revenue only counts completed orders (status='delivered')",
        "Enterprise customers are in the 'enterprise_clients' table, not 'customers'"
    ]
}

Include Common Queries

Train with queries your team frequently asks:

%%calliope rag-train mydb
{
    "question": ["Weekly signups", "Conversion funnel", "Top referrers"],
    "sql": [
        "SELECT DATE_TRUNC('week', created_at), COUNT(*) FROM users GROUP BY 1",
        "SELECT step, COUNT(*) FROM funnel_events GROUP BY step ORDER BY step",
        "SELECT referrer, COUNT(*) FROM users GROUP BY referrer ORDER BY 2 DESC LIMIT 10"
    ]
}

Use Consistent Terminology

If your team says “ARR” train the system to understand it:

%%calliope rag-train finance_db
{
    "documentation": [
        "ARR = Annual Recurring Revenue = SUM(monthly_amount * 12) from subscriptions"
    ],
    "question": ["What's our ARR?"],
    "sql": ["SELECT SUM(monthly_amount * 12) as arr FROM subscriptions WHERE status = 'active'"]
}

Tips

  • Train iteratively: add more examples as you find gaps
  • Review generated SQL and add corrections as training pairs
  • Document any non-obvious column meanings
  • Include examples of complex queries your team runs
  • Re-train after schema changes

Verifying Training

After training, test with ask-sql:

%%calliope ask-sql mydb
What's our ARR?

If results aren’t right, add more training:

%%calliope rag-train mydb
{
    "documentation": ["ARR calculation should exclude cancelled subscriptions"],
    "question": ["What's our ARR?"],
    "sql": ["SELECT SUM(monthly_amount * 12) FROM subscriptions WHERE status = 'active' AND cancelled_at IS NULL"]
}