Data Lineage Validation
This section documents the accuracy of GSP's DataFlowAnalyzer (DFA) by comparing its output against independent LLM-based lineage analysis across multiple database vendors and real-world SQL codebases.
Both approaches have distinct strengths and limitations. Together they provide a more complete picture than either alone.
Methodology
Each validation report follows a consistent process:
- LLM Analysis — An LLM reads the SQL files and identifies all data lineage relationships (views, INSERT...SELECT, CTAS, triggers, stored procedures, MERGE statements)
- DFA Analysis — The GSP
DataFlowAnalyzer processes the same SQL files programmatically
- Comparison — Results are compared and discrepancies categorized:
| Category |
Severity |
Description |
| Cat 1: DFA Crash |
Critical |
DFA threw an exception, lineage missed |
| Cat 2: Silent Miss |
High |
DFA produced no error but lineage not detected |
| Cat 3: DFA Extra |
Info |
DFA found lineage LLM missed (usually FK relationships) |
| Cat 4: Robustness |
Low |
DFA errors that don't affect lineage accuracy |
Reports by Vendor
DFA Strengths vs LLM Strengths
The two approaches are complementary. Each excels where the other struggles.
| Capability |
DFA |
LLM |
| Column-level lineage |
Precise column-to-column mappings (e.g., view.salary ← employees.SALARY) |
Table-level only; cannot reliably trace individual columns through complex joins |
| Multi-source computed columns |
Correctly identifies all contributing columns (e.g., TOTAL_COST ← unit_cost + quantity_sold) |
May miss secondary source columns in expressions |
| Foreign key relationships |
Automatically extracted from DDL constraints as ER lineage |
Often noted as structural info but not counted as lineage |
| Deterministic and reproducible |
Same input always produces same output |
Results vary between runs; may miss or hallucinate relationships |
| Scale |
Handles thousands of SQL files in batch |
Context window limits restrict to smaller file sets |
| COALESCE / multi-branch expressions |
Traces all branches to their source columns |
May oversimplify to the "primary" source |
| Indirect dataflow (fdr) |
Distinguishes direct (fdd) vs indirect (fdr) relationships (WHERE, JOIN ON, GROUP BY) |
Does not differentiate — reports all as "lineage" |
| Object views and type hierarchies |
Correctly handles Oracle object views (OF type), UNDER subviews, MAKE_REF |
May misidentify nested object-relational patterns |
| Capability |
LLM |
DFA |
| Novel SQL syntax |
Understands any SQL pattern from language knowledge, even vendor-specific extensions not yet in the parser |
Fails on unsupported syntax (parse error = zero lineage) |
| Semantic understanding |
Understands that DECODE(active, 1, 'active', '') means "the output depends on active" |
Treats DECODE arguments mechanically per function.properties; condition args are fdr by default |
| Template languages (Jinja/dbt) |
Can read through {{ ref('model') }} and understand the intended SQL |
Cannot parse non-SQL template syntax at all |
| Dynamic SQL |
Can reason about SQL constructed via string concatenation in procedures |
Cannot trace lineage through dynamic SQL strings |
| Cross-file context |
Understands that a procedure in file A writes to a table read by a view in file B |
Limited cross-file resolution without external metadata |
| Intent and business logic |
Can explain why data flows a certain way, not just that it does |
Produces structural relationships without semantic context |
| Graceful degradation |
Never crashes; always produces some analysis even on malformed SQL |
Crashes (NPE, OOM, IndexOutOfBounds) on certain SQL patterns |
| Aggregate function arguments |
Understands that GROUP_CONCAT(CONCAT(first_name, last_name)) depends on both name columns |
Does not trace arguments inside GROUP_CONCAT, LISTAGG, STRING_AGG to source columns |
| Trigger pseudo-records |
Correctly resolves NEW.column / OLD.column to the triggering table's columns |
May resolve NEW.col as self-reference to the target table |
| CREATE RULE / partition routing |
Understands PostgreSQL rules as data routing from parent to partition tables |
Partial or no detection of CREATE RULE lineage |
DFA Known Limitations
Patterns that consistently challenge the DataFlowAnalyzer across validations:
Resolved with Configuration
| Pattern |
Affected Vendors |
Resolution |
CASE WHEN condition columns not traced |
All |
Enable /showCaseWhenAsDirect |
Open Issues
| Pattern |
Affected Vendors |
Root Cause |
Oracle DECODE() first argument |
Oracle, GaussDB |
Treated as fdr in function.properties; /showCaseWhenAsDirect does not cover it |
GROUP_CONCAT / LISTAGG / STRING_AGG |
MySQL, PostgreSQL, SQL Server |
Aggregate function arguments not traced to source columns |
Trigger NEW pseudo-record resolution |
MySQL |
NEW.col resolves to target table (self) instead of source table |
JSON_TABLE with NESTED PATH |
Oracle |
NullPointerException in column model creation |
CAST(MULTISET(SELECT ...)) in VALUES |
Oracle |
IndexOutOfBoundsException in INSERT analysis |
PL/pgSQL ELSIF, RECORD, FOR..IN EXECUTE |
Redshift |
Parser does not support these PL/pgSQL constructs |
ARRAY_AGG(DISTINCT ... IGNORE NULLS) |
BigQuery |
NullPointerException in aggregate analysis |
BigQuery 2-arg TRIM(expr, chars) |
BigQuery |
Parser only supports standard TRIM(LEADING/TRAILING FROM ...) |
BigQuery scripting (FOR..IN..DO, BEGIN..EXCEPTION) |
BigQuery |
Scripting constructs not supported |
PostgreSQL CREATE RULE partition routing |
PostgreSQL |
Partially detected — column-level lineage incomplete |
| Function body table reads |
PostgreSQL |
SELECT INTO within functions not emitted in simple output mode |
Expected Limitations (Not Bugs)
| Pattern |
Reason |
Jinja/dbt templates ({{ ref('...') }}) |
Not SQL — requires template pre-processing |
Dynamic SQL (EXECUTE IMMEDIATE, string concatenation) |
SQL text constructed at runtime, unknowable at parse time |
External data loads (SQL*Loader, COPY FROM) |
Data source is outside SQL; no SQL-level lineage |
LLM Known Limitations
Patterns where LLM analysis is unreliable or incomplete:
Structural Limitations
| Limitation |
Impact |
Example |
| No column-level precision |
LLM identifies table-to-table flows but cannot reliably map individual columns through multi-table joins with aliases |
SELECT e.name, d.dept_name FROM emp e JOIN dept d ON ... — LLM knows both tables contribute but may not map which column goes where |
| Cannot distinguish fdd vs fdr |
All relationships reported as "lineage" without differentiating direct dataflow from indirect (WHERE/JOIN conditions) |
WHERE status = 'A' — LLM reports status as lineage, but it's fdr (indirect), not fdd (direct) |
| Hallucination risk |
May infer lineage that doesn't exist, especially with ambiguous column names or complex CTEs |
A column named id in a CTE may be incorrectly attributed to the wrong source table |
| Non-deterministic |
Different runs on the same SQL may produce slightly different results |
One run may catch a subtle trigger flow, the next may miss it |
| Context window limits |
Cannot process very large SQL files (>200K lines) or very many files simultaneously |
The 231K-line INSERT data files must be skipped entirely |
Analytical Limitations
| Limitation |
Impact |
| Misses FK/ER relationships |
Does not systematically extract foreign key constraints from DDL as lineage |
| Over-counts INSERT...VALUES |
May report literal INSERT statements as "lineage" when there's no source table |
| Inconsistent trigger analysis |
May miss INSTEAD OF triggers or misidentify which tables are source vs target |
| No incremental analysis |
Cannot efficiently re-analyze only changed files; must re-read everything |
| Cannot verify correctness |
LLM lineage cannot be programmatically validated — it's an opinion, not a proof |
When to Use Which Approach
| Scenario |
Recommended Approach |
| Production lineage for data governance |
DFA — deterministic, column-level, auditable |
| Quick assessment of unfamiliar SQL codebase |
LLM — handles any syntax, gives semantic context |
| Validating DFA accuracy on new SQL patterns |
Both — use LLM as oracle, DFA as system under test |
| Lineage through dbt/Jinja templates |
LLM — DFA cannot parse templates |
| Lineage for compliance/audit trails |
DFA — reproducible, machine-readable output |
| Understanding why data flows a certain way |
LLM — can explain business logic and intent |
| Processing 10,000+ SQL files |
DFA — scales to any size with batch/parallel processing |
| Debugging a specific query's lineage |
DFA with /showCaseWhenAsDirect and /env metadata.json — precise column-level detail |
Configuration Impact
Some DFA options significantly affect lineage completeness:
| Option |
Impact |
/showCaseWhenAsDirect |
Resolves all CASE WHEN condition column misses |
/showConstant |
Makes literal value flows visible |
/env metadata.json |
Improves orphan column resolution via external schema metadata |
/lof |
Links unresolved columns to first table (fallback heuristic) |
/treatArgumentsInCountFunctionAsDirectDataflow |
Treats aggregate function arguments as direct dataflow |
See Configuration Options for full details.