Text-to-Data with Snowflake Cortex: Building a Natural Language Portal

Building a Natural Language Data Portal with Snowflake Cortex
There is a recurring bottleneck in every growth-stage company. A domain expert knows exactly what they need to know. The VP of Sales wants to understand why enterprise renewal rates in dipped last quarter. They understand the margins, the outliers, and the business logic. But they do not know Python, and they do not know SQL. They sit there with the right question, completely unable to ask the data directly.
The Solution: You can build a governed natural language data portal using Snowflake Cortex and Streamlit. By passing clean dataframe schemas to Llama 3.1 hosted inside Snowflake, you can reliably translate English questions into executable Pandas code. The result is a system that cuts request latency from seconds to milliseconds while keeping data inside your existing security boundary.
Why Snowflake Cortex for Secure LLM Inference
The primary advantage of Snowflake Cortex is that it runs inference directly adjacent to your data, eliminating the need to send sensitive financial info to external APIs. By using native SQL or Python functions to call Llama 3.1 70B, we maintain the existing Role-Based Access Control (RBAC) boundary and reduce network latency from ~1200ms (typical OpenAI API overhead) to roughly 450ms for local inference.
I used this architecture to build an interface that translates plain English into filtered tables. Since the data never leaves the Snowflake account, security reviewers are satisfied, and the app runs as a native Streamlit component within the warehouse environment.
Data Engineering Requirements for Reliable Text-to-Data
Large language models require structured, high-context data to produce deterministic results; if the underlying schema is messy, the AI will hallucinate. Success depends entirely on strict data pipelines where columns are named descriptively (e.g., total_revenue_usd instead of money) and dates are stored as actual date objects rather than strings.
The model relies on column names, data types, and table comments to understand business context. If you point an LLM at a CSV with vague headers like col_1, the logic fails. Clean metadata is the prerequisite for AI-driven discovery.
Prompt Engineering for Deterministic Pandas Code
To get production-grade reliability, use few-shot prompting to embed the exact dataframe schema and expected syntax into the context window. A zero-shot prompt—providing only instructions—is too fragile for financial reporting. By providing 2-3 examples of the specific syntax required, you prevent the model from attempting to import external libraries or using deprecated Pandas methods.
def get_dataframe_schema(df):
buffer = []
for col in df.columns:
dtype = str(df[col].dtype)
buffer.append(f"- {col} ({dtype})")
return "\n".join(buffer)
schema_context = get_dataframe_schema(df)
prompt = f"""
You are a Python Pandas expert.
Given a dataframe df with the following schema:
{schema_context}
Rules:
1. Use ONLY the columns listed above.
2. Return a single line of Python code.
3. No explanations, no markdown, no imports.
Example 1: Show me sales over 500 in the North -> df[(df['sales'] > 500) & (df['region'] == 'North')]
Example 2: Top 5 customers by revenue -> df.nlargest(5, 'revenue')
User Request: {user_natural_language_input}
"""Building Trust with Explainable AI Outputs
The biggest risk in text-to-data tools is the silent failure where code is syntactically correct but logically wrong. To solve for trust, you must build an intermediary verification step that displays the generated Pandas code and a natural language summary of the filter before the query executes.
In the Streamlit UI, we wrap this in an st.expander block. If a finance lead asks for profitable products but sees the code df[df['margin'] < 0], they can catch the error immediately. This turns the AI into a verifiable tool rather than an opaque black box, allowing users to audit the logic before making decisions based on the data.
The Security Risk of Executing AI-Generated Python
Taking a raw string from an AI and running it through Python's native exec function is a massive security liability that can lead to arbitrary code execution. Even with a restricted dictionary of local variables, a clever prompt injection can use Python's object model to traverse class inheritance, access the os module via __subclasses__(), and bypass disabled built-ins entirely.
For production, you must use hardware-level isolation. The solution is wrapping execution in Snowflake Python UDFs. These run in a strict Anaconda-backed sandbox with zero network access and a locked-down file system. If the AI generates malicious code, the infrastructure kills the process before it can compromise the environment.
Scaling to Petabytes with Snowpark Pandas
Standard Streamlit apps in Snowflake have strict memory caps, typically crashing if a user tries to load more than 1-2GB of data into a local Pandas dataframe. To handle production volumes, swap standard Pandas for Snowpark Pandas, which provides the same syntax while pushing compute down to the Snowflake engine.
| Feature | Standard Pandas | Snowpark Pandas |
|---|---|---|
| Execution Engine | Local Streamlit container | Snowflake Virtual Warehouse |
| Evaluation Model | Eager (immediate load) | Lazy (SQL compilation) |
| Scale Limit | ~2GB (OOM Risk) | Petabyte-scale |
By using snowflake.snowpark.modin.pandas, the LLM doesn't need to learn a new dialect. The operations are translated into distributed SQL, and the data remains in the warehouse while only the final filtered result is returned to the UI.
The end result is a text box where a finance lead can type a question and get a formatted table in three seconds. They do not care about the Llama 3.1 parameters or the execution sandbox. They care that they did not have to file a Jira ticket to get their numbers. We turned a syntax problem into a UI solution, backed by strict engineering and locked-down execution.