Skip to main content

Quality Rules & Validation

Graphora’s quality validation system provides comprehensive data quality enforcement through format rules, business rules, and automated scoring. This ensures that extracted data meets your standards before being merged into your knowledge graph.

Quality System Overview

The quality validation system operates on multiple levels:

Format Rules

Pattern matching, length constraints, and case formatting validation

Business Rules

Domain-specific validation including allowed/forbidden values and ranges

Automated Scoring

0-100 scale scoring with letter grades (A-F) and auto-approval logic

Violation Reporting

Detailed reports with context, suggestions, and confidence scores

Adding Quality Rules to Properties

Quality rules are defined within property definitions using the quality section:
entities:
  Company:
    properties:
      name:
        type: string
        description: "Company legal name"
        required: true
        quality:                    # Quality rules section
          format:                   # Format validation rules
            pattern: "^[A-Z][a-zA-Z0-9\\s&.,-]+$"
            minLength: 2
            maxLength: 100
            caseFormat: "titleCase"
          business:                 # Business validation rules
            forbiddenValues: ["Unknown Company", "N/A", "TBD", ""]
            allowedValues: null    # Optional whitelist

Format Rules

Format rules validate the structure and formatting of text data.

Pattern Validation

Use regular expressions to enforce specific formats:
properties:
  email:
    type: string
    quality:
      format:
        pattern: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
        
  phone:
    type: string
    quality:
      format:
        pattern: "^\\+?1?[-.\\s]?\\(?[0-9]{3}\\)?[-.\\s]?[0-9]{3}[-.\\s]?[0-9]{4}$"
        
  cik:
    type: string
    quality:
      format:
        pattern: "^[0-9]{10}$"   # Exactly 10 digits
Use double backslashes (\\) to escape special regex characters in YAML strings.

Length Constraints

Enforce minimum and maximum character limits:
properties:
  companyName:
    type: string
    quality:
      format:
        minLength: 2           # At least 2 characters
        maxLength: 100         # No more than 100 characters
        
  description:
    type: string
    quality:
      format:
        minLength: 10          # Meaningful descriptions
        maxLength: 500         # Prevent excessive length

Case Format Validation

Enforce consistent text casing:
First Letter Of Each Word Capitalized
companyName:
  type: string
  quality:
    format:
      caseFormat: "titleCase"
ALL LETTERS UPPERCASE
countryCode:
  type: string
  quality:
    format:
      caseFormat: "upperCase"
all letters lowercase
email:
  type: string
  quality:
    format:
      caseFormat: "lowerCase"
First letter capitalized, rest lowercase
description:
  type: string
  quality:
    format:
      caseFormat: "sentenceCase"

Business Rules

Business rules enforce domain-specific constraints and data validity.

Forbidden Values

Prevent specific values that indicate poor data quality:
properties:
  companyName:
    type: string
    quality:
      business:
        forbiddenValues: 
          - "Unknown Company"
          - "N/A"
          - "TBD"
          - "Not Available"
          - ""
          - "NULL"
          
  industry:
    type: string
    quality:
      business:
        forbiddenValues:
          - "Unknown"
          - "Other"
          - "Miscellaneous"

Allowed Values (Whitelist)

Restrict values to a predefined set:
properties:
  companyStatus:
    type: string
    quality:
      business:
        allowedValues:
          - "Active"
          - "Inactive"
          - "Suspended"
          - "Dissolved"
          
  country:
    type: string
    quality:
      business:
        allowedValues:
          - "United States"
          - "Canada"
          - "United Kingdom"
          - "Germany"
          - "France"

Numeric Range Validation

For integer and float properties, enforce value ranges:
properties:
  employeeCount:
    type: integer
    quality:
      business:
        minValue: 1            # At least 1 employee
        maxValue: 1000000      # Reasonable upper bound
        
  revenue:
    type: float
    quality:
      business:
        minValue: 0.0          # Non-negative revenue
        maxValue: 1000000.0    # Billions in millions
        
  ownershipPercentage:
    type: float
    quality:
      business:
        minValue: 0.0
        maxValue: 100.0
        inclusiveMin: true     # Include 0%
        inclusiveMax: true     # Include 100%

Range Inclusivity

Control whether boundary values are included:
properties:
  age:
    type: integer
    quality:
      business:
        minValue: 0
        maxValue: 150
        inclusiveMin: true     # 0 is valid
        inclusiveMax: false    # 150 is not valid (149 max)

Global Quality Configuration

Define global quality settings that apply to the entire ontology:
version: 1

# Global quality configuration
qualityConfig:
  overallScoreThreshold: 80     # Minimum score for auto-approval
  maxViolationsPerEntity: 5     # Maximum violations before failure
  confidenceThreshold: 0.7      # Minimum extraction confidence
  
  # Distribution-based validation
  distributionRules:
    maxDuplicateRatio: 0.3      # Max 30% duplicate values
    minUniqueRatio: 0.1         # Min 10% unique values
    
entities:
  # ... entity definitions

Quality Thresholds

Configure when data requires manual review:
qualityConfig:
  scoring:
    autoApproveThreshold: 90    # Score >= 90: auto-approve
    manualReviewThreshold: 70   # Score 70-89: manual review
    autoRejectThreshold: 50     # Score < 50: auto-reject
    
  violationLimits:
    maxErrors: 0                 # No critical errors allowed
    maxWarnings: 5               # Up to 5 warnings acceptable
    maxTotalViolations: 10       # Overall violation limit

Complete Quality-Aware Example

Here’s a comprehensive example showing all quality rule types:
version: 1

qualityConfig:
  overallScoreThreshold: 80
  maxViolationsPerEntity: 5
  confidenceThreshold: 0.7

entities:
  Company:
    properties:
      name:
        type: string
        description: "Official company legal name"
        required: true
        unique: true
        quality:
          format:
            pattern: "^[A-Z][a-zA-Z0-9\\s&.,-]+$"
            minLength: 2
            maxLength: 100
            caseFormat: "titleCase"
          business:
            forbiddenValues: ["Unknown Company", "N/A", "TBD", ""]
            
      cik:
        type: string
        description: "SEC Central Index Key"
        unique: true
        quality:
          format:
            pattern: "^[0-9]{10}$"
            
      industry:
        type: string
        description: "Primary industry classification"
        required: true
        quality:
          business:
            allowedValues:
              - "Technology"
              - "Healthcare"
              - "Financial Services"
              - "Manufacturing"
              - "Retail"
              - "Energy"
              - "Real Estate"
              
      employeeCount:
        type: integer
        description: "Total number of employees"
        quality:
          business:
            minValue: 1
            maxValue: 1000000
            
      revenue:
        type: float
        description: "Annual revenue in millions USD"
        quality:
          business:
            minValue: 0.0
            maxValue: 1000000.0
            
      foundedYear:
        type: integer
        description: "Year company was founded"
        quality:
          business:
            minValue: 1800
            maxValue: 2024
            
    relationships:
      HAS_SUBSIDIARY:
        target: Company
        cardinality: oneToMany
        properties:
          ownershipPercentage:
            type: float
            description: "Percentage ownership"
            required: true
            quality:
              business:
                minValue: 0.0
                maxValue: 100.0
                inclusiveMin: true
                inclusiveMax: true

Quality Scoring System

The quality system automatically scores extracted data:

Scoring Algorithm

  • Base Score: Starts at 100 points
  • Format Violations: -5 to -15 points each (based on severity)
  • Business Rule Violations: -10 to -25 points each
  • Missing Required Properties: -20 points each
  • Low Confidence Extractions: -5 to -10 points each

Letter Grades

Excellent Quality
  • Minimal or no violations
  • High extraction confidence
  • Auto-approval eligible
# Example: Score 95/100, Grade A
overall_score: 95
grade: "A"
auto_approve: true
Good Quality
  • Minor violations only
  • Generally acceptable for production
  • May require brief review
# Example: Score 85/100, Grade B  
overall_score: 85
grade: "B"
requires_review: false
Acceptable Quality
  • Some quality issues present
  • Manual review recommended
  • Fixable violations
# Example: Score 75/100, Grade C
overall_score: 75
grade: "C" 
requires_review: true
Poor Quality
  • Significant quality issues
  • Manual review required
  • Consider re-extraction
# Example: Score 65/100, Grade D
overall_score: 65
grade: "D"
requires_review: true
Failed Quality
  • Critical quality failures
  • Not suitable for production
  • Re-extraction strongly recommended
# Example: Score 45/100, Grade F
overall_score: 45
grade: "F"
requires_review: true

Violation Reporting

Quality violations include detailed context and suggestions:
{
  "rule_type": "format",
  "rule_name": "pattern",
  "severity": "error",
  "message": "Company name 'acme corp' does not match required pattern",
  "context": {
    "entity_type": "Company", 
    "property": "name",
    "extracted_value": "acme corp",
    "expected_pattern": "^[A-Z][a-zA-Z0-9\\s&.,-]+$"
  },
  "suggestion": "Capitalize first letter: 'Acme Corp'",
  "confidence": 0.85
}

Best Practices

Start with Essential Rules: Begin with basic format and business rules, then add more sophisticated validation as needed.
Test with Real Data: Validate your quality rules against sample extractions to ensure they’re not too restrictive.
Avoid Over-Constraining: Too many strict rules can result in excessive false positives and manual review overhead.
Use Meaningful Thresholds: Set quality thresholds based on your specific use case and tolerance for data imperfection.

API Integration

Quality-validated ontologies work seamlessly with the extraction pipeline:
# Transform with quality validation
response = client.transform_document(
    file_path="document.pdf",
    ontology_id=ontology_id,
    enable_quality_validation=True  # Enable quality checking
)

# Check quality results
if response.quality_results:
    print(f"Quality Score: {response.quality_results.overall_score}")
    print(f"Grade: {response.quality_results.grade}")
    
    if response.quality_results.requires_review:
        print("Manual review required")
        for violation in response.quality_results.violations:
            print(f"- {violation.message}")

Next Steps

See Examples

Browse complete quality-aware ontology examples

Validation Testing

Learn how to test and validate your quality rules