Skip to main content

Data Configuration

Configure your data sources and processing options for the ZettaQuant SEAL pipeline.

Data Source Options

ZettaQuant supports multiple data ingestion methods to fit your workflow:

Existing Snowflake Tables

Connect to your current document and sentence tables with minimal setup.

Prerequisites:

  • Document table with document_id column
  • Sentence table with document_id, sentence_id, and sentence_text columns
  • Proper access grants configured

PDF Ingestion

Upload and process PDF documents directly through the Streamlit interface.

Features:

  • Drag-and-drop interface for single PDF files or ZIP files containing PDF files
  • Automatic text extraction and table creation
  • Batch processing capabilities

Requirements:

  • CREATE TABLE permissions on your target schema
  • Sufficient data access grants if the tables are already created as outlined in access grants.

Large File Processing

Staging from Local Environment

For files (PDF/Zip) greater than 200MB, you can process them directly through SnowSQL staging:

  1. Stage the file using SnowSQL:

    PUT file://path/to/your/large-file @my_stage;
  2. Process staged files through the ZettaQuant application interface

Note: For detailed information on staging files from your local environment, refer to the Snowflake documentation on PUT command.

Best Practices for Large Files:

  • Compress files before staging to reduce transfer time
  • Use internal stages for better performance
  • Consider splitting very large archives into smaller batches

Troubleshooting

Common Issues

  • Permission Errors: Verify data access grants
  • Processing Failures: Check compute pool status
  • Format Errors: Validate input document structure