doc_loader

πŸ“„ Doc Loader

Rust Python License: MIT GitHub Crates.io PyPI Documentation

A comprehensive Rust toolkit for extracting and processing documentation from multiple file formats into a universal JSON structure, optimized for vector stores and RAG (Retrieval-Augmented Generation) systems.

🎯 Project Status

Current Version: 0.3.1
Status: βœ… Production Ready
Python Bindings: βœ… Fully Functional
Documentation: βœ… Complete

πŸš€ Features

πŸ“¦ Installation

Prerequisites

Building from Source

git clone https://github.com/WillIsback/doc_loader.git
cd doc_loader
cargo build --release

Available Binaries

After building, you’ll have access to these CLI tools:

πŸ”§ Usage

Universal Processor

Process any supported document type with the main binary:

# Basic usage
./target/release/doc_loader --input document.pdf

# With custom options
./target/release/doc_loader \
    --input document.pdf \
    --output result.json \
    --chunk-size 1500 \
    --chunk-overlap 150 \
    --detect-language \
    --pretty

Format-Specific Processors

Use specialized processors for specific formats:

# Process a PDF
./target/release/pdf_processor --input report.pdf --pretty

# Process a CSV with analysis
./target/release/csv_processor --input data.csv --output analysis.json

# Process a JSON document
./target/release/json_processor --input config.json --detect-language

Command Line Options

All processors support these common options:

πŸ“‹ Output Format

All processors generate a standardized JSON structure:

{
  "document_metadata": {
    "filename": "document.pdf",
    "filepath": "/path/to/document.pdf", 
    "document_type": "PDF",
    "file_size": 1024000,
    "created_at": "2025-01-01T12:00:00Z",
    "modified_at": "2025-01-01T12:00:00Z",
    "title": "Document Title",
    "author": "Author Name",
    "format_metadata": {
      // Format-specific metadata
    }
  },
  "chunks": [
    {
      "id": "pdf_chunk_0",
      "content": "Extracted text content...",
      "chunk_index": 0,
      "position": {
        "page": 1,
        "line": 10,
        "start_offset": 0,
        "end_offset": 1000
      },
      "metadata": {
        "size": 1000,
        "language": "en",
        "confidence": 0.95,
        "format_specific": {
          // Chunk-specific metadata
        }
      }
    }
  ],
  "processing_info": {
    "processor": "PdfProcessor",
    "processor_version": "1.0.0",
    "processed_at": "2025-01-01T12:00:00Z",
    "processing_time_ms": 150,
    "total_chunks": 5,
    "total_content_size": 5000,
    "processing_params": {
      "max_chunk_size": 1000,
      "chunk_overlap": 100,
      "text_cleaning": true,
      "language_detection": true
    }
  }
}

πŸ—οΈ Architecture

The project follows a modular architecture:

src/
β”œβ”€β”€ lib.rs              # Main library interface
β”œβ”€β”€ main.rs             # Universal CLI
β”œβ”€β”€ error.rs            # Error handling
β”œβ”€β”€ core/               # Core data structures
β”‚   └── mod.rs          # Universal output format
β”œβ”€β”€ utils/              # Utility functions
β”‚   └── mod.rs          # Text processing utilities
β”œβ”€β”€ processors/         # Document processors
β”‚   β”œβ”€β”€ mod.rs          # Common processor traits
β”‚   β”œβ”€β”€ pdf.rs          # PDF processor
β”‚   β”œβ”€β”€ txt.rs          # Text processor
β”‚   β”œβ”€β”€ json.rs         # JSON processor
β”‚   β”œβ”€β”€ csv.rs          # CSV processor
β”‚   └── docx.rs         # DOCX processor
└── bin/                # Individual CLI binaries
    β”œβ”€β”€ pdf_processor.rs
    β”œβ”€β”€ txt_processor.rs
    β”œβ”€β”€ json_processor.rs
    β”œβ”€β”€ csv_processor.rs
    └── docx_processor.rs

πŸ§ͺ Testing

Test the functionality with the provided sample files:

# Test text processing
./target/debug/doc_loader --input test_sample.txt --pretty

# Test JSON processing
./target/debug/json_processor --input test_sample.json --pretty

# Test CSV processing  
./target/debug/csv_processor --input test_sample.csv --pretty

πŸ“Š Format-Specific Features

PDF Processing

CSV Processing

JSON Processing

DOCX Processing

TXT Processing

πŸ”§ Library Usage

Use doc_loader as a library in your Rust projects:

use doc_loader::{UniversalProcessor, ProcessingParams};
use std::path::Path;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let processor = UniversalProcessor::new();
    let params = ProcessingParams::default()
        .with_chunk_size(1500)
        .with_language_detection(true);
    
    let result = processor.process_file(
        Path::new("document.pdf"), 
        Some(params)
    )?;
    
    println!("Extracted {} chunks", result.chunks.len());
    Ok(())
}

πŸ“ˆ Performance

πŸ›£οΈ Roadmap

Immediate Improvements

Future Features

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

πŸ“„ License

[Add your license information here]

πŸ› Issues & Support

Report issues on the project’s issue tracker. Include:


Doc Loader - Making document processing simple, fast, and universal! πŸš€

🐍 Python Bindings βœ…

Doc Loader provides fully functional Python bindings through PyO3, offering the same performance as the native Rust library with a clean Python API.

Installation

# Via PyPI (recommandΓ©)
pip install extracteur-docs-rs

# Ou build depuis les sources
# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install maturin build tool
pip install maturin

# Build and install Python bindings (Python 3.9+ supported)
venv/bin/maturin develop --features python --release

Usage

import extracteur_docs_rs as doc_loader

# Quick start - process any supported file format
result = doc_loader.process_file("document.pdf", chunk_size=500)

print(f"Chunks: {result.chunk_count()}")
print(f"Words: {result.total_word_count()}")
print(f"Supported formats: {doc_loader.supported_extensions()}")

# Advanced usage with custom parameters
processor = doc_loader.PyUniversalProcessor()
params = doc_loader.PyProcessingParams(
    chunk_size=400,
    overlap=60,
    clean_text=True,
    extract_metadata=True
)

result = processor.process_file("document.txt", params)

# Process text content directly
text_result = processor.process_text_content("Your text here...", params)

# Export to JSON
json_output = result.to_json()

Python Integration Examples

Status: Production Ready πŸŽ‰

The Python bindings are fully tested and functional with:

Run the demo: venv/bin/python python_demo.py

For complete Python documentation, see docs/python_usage.md.