A comprehensive Rust toolkit for extracting and processing documentation from multiple file formats into a universal JSON structure, optimized for vector stores and RAG (Retrieval-Augmented Generation) systems.
Current Version: 0.3.1
Status: β
Production Ready
Python Bindings: β
Fully Functional
Documentation: β
Complete
git clone https://github.com/WillIsback/doc_loader.git
cd doc_loader
cargo build --release
After building, youβll have access to these CLI tools:
doc_loader
- Universal document processorpdf_processor
- PDF-specific processortxt_processor
- Plain text processorjson_processor
- JSON document processorcsv_processor
- CSV file processordocx_processor
- DOCX document processorProcess any supported document type with the main binary:
# Basic usage
./target/release/doc_loader --input document.pdf
# With custom options
./target/release/doc_loader \
--input document.pdf \
--output result.json \
--chunk-size 1500 \
--chunk-overlap 150 \
--detect-language \
--pretty
Use specialized processors for specific formats:
# Process a PDF
./target/release/pdf_processor --input report.pdf --pretty
# Process a CSV with analysis
./target/release/csv_processor --input data.csv --output analysis.json
# Process a JSON document
./target/release/json_processor --input config.json --detect-language
All processors support these common options:
--input <FILE>
- Input file path (required)--output <FILE>
- Output JSON file (optional, defaults to stdout)--chunk-size <SIZE>
- Maximum chunk size in characters (default: 1000)--chunk-overlap <SIZE>
- Overlap between chunks (default: 100)--no-cleaning
- Disable text cleaning--detect-language
- Enable language detection--pretty
- Pretty print JSON outputAll processors generate a standardized JSON structure:
{
"document_metadata": {
"filename": "document.pdf",
"filepath": "/path/to/document.pdf",
"document_type": "PDF",
"file_size": 1024000,
"created_at": "2025-01-01T12:00:00Z",
"modified_at": "2025-01-01T12:00:00Z",
"title": "Document Title",
"author": "Author Name",
"format_metadata": {
// Format-specific metadata
}
},
"chunks": [
{
"id": "pdf_chunk_0",
"content": "Extracted text content...",
"chunk_index": 0,
"position": {
"page": 1,
"line": 10,
"start_offset": 0,
"end_offset": 1000
},
"metadata": {
"size": 1000,
"language": "en",
"confidence": 0.95,
"format_specific": {
// Chunk-specific metadata
}
}
}
],
"processing_info": {
"processor": "PdfProcessor",
"processor_version": "1.0.0",
"processed_at": "2025-01-01T12:00:00Z",
"processing_time_ms": 150,
"total_chunks": 5,
"total_content_size": 5000,
"processing_params": {
"max_chunk_size": 1000,
"chunk_overlap": 100,
"text_cleaning": true,
"language_detection": true
}
}
}
The project follows a modular architecture:
src/
βββ lib.rs # Main library interface
βββ main.rs # Universal CLI
βββ error.rs # Error handling
βββ core/ # Core data structures
β βββ mod.rs # Universal output format
βββ utils/ # Utility functions
β βββ mod.rs # Text processing utilities
βββ processors/ # Document processors
β βββ mod.rs # Common processor traits
β βββ pdf.rs # PDF processor
β βββ txt.rs # Text processor
β βββ json.rs # JSON processor
β βββ csv.rs # CSV processor
β βββ docx.rs # DOCX processor
βββ bin/ # Individual CLI binaries
βββ pdf_processor.rs
βββ txt_processor.rs
βββ json_processor.rs
βββ csv_processor.rs
βββ docx_processor.rs
Test the functionality with the provided sample files:
# Test text processing
./target/debug/doc_loader --input test_sample.txt --pretty
# Test JSON processing
./target/debug/json_processor --input test_sample.json --pretty
# Test CSV processing
./target/debug/csv_processor --input test_sample.csv --pretty
Use doc_loader as a library in your Rust projects:
use doc_loader::{UniversalProcessor, ProcessingParams};
use std::path::Path;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let processor = UniversalProcessor::new();
let params = ProcessingParams::default()
.with_chunk_size(1500)
.with_language_detection(true);
let result = processor.process_file(
Path::new("document.pdf"),
Some(params)
)?;
println!("Extracted {} chunks", result.chunks.len());
Ok(())
}
[Add your license information here]
Report issues on the projectβs issue tracker. Include:
Doc Loader - Making document processing simple, fast, and universal! π
Doc Loader provides fully functional Python bindings through PyO3, offering the same performance as the native Rust library with a clean Python API.
# Via PyPI (recommandΓ©)
pip install extracteur-docs-rs
# Ou build depuis les sources
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install maturin build tool
pip install maturin
# Build and install Python bindings (Python 3.9+ supported)
venv/bin/maturin develop --features python --release
import extracteur_docs_rs as doc_loader
# Quick start - process any supported file format
result = doc_loader.process_file("document.pdf", chunk_size=500)
print(f"Chunks: {result.chunk_count()}")
print(f"Words: {result.total_word_count()}")
print(f"Supported formats: {doc_loader.supported_extensions()}")
# Advanced usage with custom parameters
processor = doc_loader.PyUniversalProcessor()
params = doc_loader.PyProcessingParams(
chunk_size=400,
overlap=60,
clean_text=True,
extract_metadata=True
)
result = processor.process_file("document.txt", params)
# Process text content directly
text_result = processor.process_text_content("Your text here...", params)
# Export to JSON
json_output = result.to_json()
The Python bindings are fully tested and functional with:
Run the demo: venv/bin/python python_demo.py
For complete Python documentation, see docs/python_usage.md
.