Nexus — Document Ingestion Pipeline

Turn messy spreadsheets into clean data. In the browser.

Nexus is a six-stage document ingestion pipeline written in C with zero external dependencies. Extracts tabular data from XLSX, PDF, and CSV files. Transforms, validates, and emits structured JSON, GeoJSON, or CSV. Heuristic schema discovery — no LLMs, no cloud APIs. Same pipeline runs native, in WASM, or embedded in your application.

192 tests 6-stage pipeline Zero deps C11 WASM Fuzz-tested AGPLv3 + Trucking Exception
Why

Logistics runs on spreadsheets. Spreadsheets run on chaos.

Every trucking company, 3PL, and carrier has "the spreadsheet." The one emailed weekly with depot locations, customer addresses, delivery windows, or station data. Different columns every time. Merged cells. Continuation rows. Hungarian coordinates in a projection system from 1972. A PDF that's actually a scan of a table that was originally an Excel file.

The typical response: a Python script that breaks every time the format changes. Or a manual process with a human in the loop. Or a SaaS ETL tool that costs $50K/year and still can't parse Hungarian EOV coordinates or reconstruct tables from PDF text runs.

Nexus is the opposite of all three.

Schema-driven, not script-driven
Declarative JSON schemas define column mapping, transforms, and validation rules. When the spreadsheet changes, update the schema — not the code. Schemas are versioned, diffable, and machine-readable.
Heuristic discovery, not LLM guessing
Drop a file with no schema. Nexus infers types, detects lat/lon columns, identifies unique fields, spots continuation rows, and generates a schema draft. Deterministic, reproducible, zero API calls. A human reviews and refines.
Structured errors, not stack traces
Every issue has a stage, severity, row number, field name, error code, and human message. Thread the NxIssueList through all six stages. Export as JSON. No more grepping logs to find which row broke the pipeline.
Runs in the browser
The WASM demo runs the full pipeline client-side. Upload a spreadsheet, see extracted data, apply transforms, validate, export. Zero server round-trips. Sensitive data never leaves the machine.

The real problem. Document ingestion is 80% of the work in logistics data integration and 0% of the value. Nexus exists so engineers can stop writing XLSX parsers and start building route optimizers, ETAs, and fleet dashboards.

Pipeline

Six stages, each independently testable

Nexus decomposes document ingestion into six composable stages. Each stage reads JSON, produces JSON, and reports structured issues. You can run the full pipeline or invoke individual stages via CLI or C API.

  XLSX / PDF / CSV                                    JSON / GeoJSON / CSV
       │                                                      ▲
       ▼                                                      │
  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐
  │ Stage A │──▶│ Stage M │──▶│ Stage B │──▶│ Stage X │──▶│ Stage D │
  │ Extract │   │  Merge  │   │Transform│   │Validate │   │  Emit   │
  └─────────┘   └─────────┘   └─────────┘   └─────────┘   └─────────┘
       │              │              │              │              │
   nx_raw JSON    merged rows    nx_canonical   validated     GeoJSON
   (cells, rows)  (cont. rows   (typed fields,  (bounds,      CSV
                   stripped)     derived cols)   deduped)
                                     │
                                     ▼
                               ┌───────────┐
                               │  nx_diff   │  Change detection
                               │ (baseline) │  FNV-1a hashing
                               └───────────┘

  ┌──────────────────────────────────────────────────────────────────┐
  │  NxIssueList   threaded through all stages                      │
  │  stage / severity / row / field / code / message                │
  └──────────────────────────────────────────────────────────────────┘
6
Pipeline Stages
192
Tests
~9.4K
Lines of C
0
Fuzz Crashes
0
Ext Dependencies

Stage A — Extract

Reads XLSX (ZIP → XML → shared strings → cells), PDF (text-run JSON → Y-alignment clustering → X-gap column detection), or CSV (RFC 4180, auto-delimiter). Outputs nx_raw JSON: an array of rows with cell values and metadata.

  • XLSX — ZIP container extraction via miniz, shared string table, A1 notation → row/col
  • PDF — Y-alignment clustering within row tolerance, X-gap column detection, multi-page support
  • CSV — RFC 4180 with auto-delimiter detection (comma, semicolon, tab, pipe)

Stage M — Merge

Handles continuation rows in PDF tables where a single logical row spans multiple lines. Schema-driven: the row_merge config specifies which field identifies continuation and how values are concatenated. Also strips repeated header rows.

Stage B — Transform

Column mapping, type coercion (string/int/double/bool), derived fields (constants), and a five-type multi-transform pipeline: split (delimiter + index), merge (N fields + separator), regex (capture groups), compute (built-in functions), conditional (pattern → value). Row ID generation via slugified templates.

Stage X — Validate

Four validation rules: geo_bounds (lat/lon within bounding box), format (POSIX regex pattern match), unique (dedup by field, first kept), outlier (IQR-based detection with Q1 − 1.5×IQR / Q3 + 1.5×IQR bounds). Issues tagged with row, field, and reason.

Stage D — Emit

Output as RFC 7946 GeoJSON FeatureCollection (configurable lat/lon/id fields, precision) or RFC 4180 CSV (configurable delimiter). Both formats validated downstream by standard tools.

Change Detection (nx_diff)

Compare current output against a baseline. FNV-1a 64-bit hashing for record IDs and content. O(1) hashmap lookup per record. Output: added, removed, modified, unchanged counts with field-level diffs for modified records. Useful for incremental pipeline runs.

Features

What makes Nexus different

Schema Discovery
Feed raw extracted data to nx_discover_schema(). Infers column types (int/double/string), detects lat/lon fields (range + header matching), identifies unique/required fields, spots continuation rows. Outputs a draft JSON schema.
Zero LLMs, zero API calls
Compute Functions
Six built-in compute functions for Hungarian logistics: eov_to_wgs84, dms_to_dd, coalesce, phone_normalize, zip_to_region, opening_hours. Extensible registry via nx_compute_find().
6 built-in transforms
Multi-Transform Pipeline
Chain transforms on a single field: split a combined address, regex-extract a ZIP code, merge first + last name, compute coordinates from EOV, conditionally map values. Virtual columns appended after originals.
5 transform types, composable
PDF Table Reconstruction
Reconstructs tables from PDF text runs using Y-alignment clustering and X-gap column detection. Auto-detects row tolerance and column gaps from median text height. Handles rotation, scaling, multiple pages.
1,082 LOC, battle-tested
Structured Issue Tracking
NxIssueList is an uncapped, realloc-doubling list of structured issues. Each issue: stage, severity (info/warning/error), row, field, code, message. Thread through all stages. Export as JSON. Filter by stage or severity.
No more grepping logs
Production Hardened
Input limits enforced: 100K rows, 1K columns, 100 MB files, 50 MB ZIP entries. Fuzz-tested with libFuzzer under ASan/UBSan: 1.25M+ runs across XLSX, PDF, CSV parsers with zero crashes. Regex complexity guards. No unsafe string functions.
0 crashes in 1.25M fuzz runs

Compute function reference

Function Input Output Use Case
eov_to_wgs84 EOV Y, EOV X lat, lon Hungarian HD72/EOV projection → GPS coordinates
dms_to_dd DMS string decimal degrees 47°29'33"N → 47.4925
coalesce N source fields first non-empty Fallback chain: city || town || village
phone_normalize Hungarian phone E.164 format 06-30-123-4567 → +36301234567
zip_to_region ZIP code region name 1052 → Budapest
opening_hours Hungarian format OSM format H-P: 8-17 → Mo-Fr 08:00-17:00
Schema System

Declarative, versioned, machine-readable

Nexus schemas are JSON documents that define column mapping, type coercion, multi-transforms, validation rules, and output configuration. The schema is the single source of truth for how a file format maps to your data model.

{
  "nx_schema": 2,
  "name": "gls-hu-automata",
  "columns": [
    {"source": "Aut. neve",  "target": "name",    "type": "string"},
    {"source": "EOV Y",     "target": "eov_y",   "type": "double"},
    {"source": "EOV X",     "target": "eov_x",   "type": "double"}
  ],
  "multi_transforms": [
    {"type": "compute", "function": "eov_to_wgs84",
     "sources": ["eov_y", "eov_x"],
     "targets": ["lat", "lon"]}
  ],
  "validate": {
    "geo_bounds": {"lat": "lat", "lon": "lon",
                   "min_lat": 45.7, "max_lat": 48.6,
                   "min_lon": 16.1, "max_lon": 22.9}
  },
  "row_id": "{name}-{zip}"
}

Production schemas (GLS Hungary)

Nexus ships with 10 production schemas for GLS Hungary logistics data. These serve as reference implementations and test fixtures for the full pipeline.

Schema Entity Format Features
gls-hu-automata-v2 Parcel lockers XLSX EOV compute, geo_bounds, regex, split
gls-hu-automata-pdf-v2 Parcel lockers PDF Row merge for continuation, same transforms
gls-hu-pudo-v2 Pickup/dropoff points XLSX Multi-transforms, phone normalize, opening hours
gls-hu-pudo-pdf-v2 Pickup/dropoff points PDF Row merge, same transforms as XLSX
gls-hu-depots-v2 Depot locations XLSX EOV compute, zip_to_region, merge transforms

Schema discovery

Don't have a schema? Feed raw extracted data to nx_discover_schema() and get a draft. The heuristics are deterministic and rule-based:

  • Type inference — Tries strtod/strtol on every cell; consensus wins
  • Lat/lon detection — Numeric columns in [-90, 90] or [-180, 180] with header name matching
  • Required fields — Non-empty in all rows → marked required
  • Uniqueness — All distinct values → candidate row ID
  • Continuation rows — Detects merge patterns, emits row_merge config

Honest assessment. Schema discovery is a starting point, not a finished product. It generates a correct schema ~70% of the time for clean XLSX files with clear headers. For messy PDFs with merged cells and multi-line values, human review is always needed. The point is to save 80% of the manual work, not eliminate it.

Comparison

Nexus vs. the alternatives

Nexus occupies a specific niche: schema-driven tabular extraction from messy documents in logistics. It's not a general ETL framework, not a BI tool, and not a data lake. Here's how it compares to what teams actually use.

Nexus Python scripts Tabula / Camelot dbt + Pandas Trifacta / Dataprep
XLSX parsing Native openpyxl N Via Pandas Y
PDF table extraction Native Manual Y N Limited
CSV parsing RFC 4180 csv module N Via Pandas Y
Schema-driven JSON schema Hardcoded N SQL models Recipes
Schema discovery Heuristic N N N ML-based
Change detection FNV-1a diff Manual N Snapshots N
Structured errors Per-row/field Exceptions Exceptions Test failures UI alerts
GeoJSON output RFC 7946 Manual N Plugin N
Runs in browser WASM N N N Cloud only
Embeddable C / WASM Python runtime Java runtime Python + DB SaaS
Dependencies Zero pip install ... JRE + JAR Python + DB Cloud subscription
Cost Open source Free Free Free + DB $$$ /year
Fuzz-tested 1.25M+ runs Rarely N N Unknown

The key difference. Python scripts are flexible but fragile — they break when the spreadsheet changes. dbt/Pandas are powerful but require a Python runtime and a database. Trifacta/Dataprep are polished but SaaS-only and expensive. Nexus gives you a schema-driven, embeddable, fuzz-tested pipeline with zero dependencies that runs equally well native, in WASM, or embedded in a C application.

API

C functions, JSON in, JSON out

Each stage is a standalone function: takes JSON + optional schema, returns JSON + issues. The orchestrator nx_ingest() chains them automatically, or call stages individually for maximum control.

Orchestrator (full pipeline)

NxIssueList issues;
nx_issue_list_init(&issues);

char *raw, *canon;
size_t raw_len, canon_len;

NxIngestStatus s = nx_ingest(
    data, len,
    NX_FORMAT_XLSX, "depots.xlsx",
    schema, schema_len,
    &raw, &raw_len,
    &canon, &canon_len,
    &issues);

if (s == NX_INGEST_OK) {
    // canon = canonical JSON
    free(raw);
    free(canon);
}
nx_issue_list_free(&issues);

Individual stage

// Stage A: Extract XLSX
SHArena *arena = sh_arena_create(
    32 * 1024 * 1024);
char *raw_json;
size_t raw_len;

NxXlsxStatus s = nx_xlsx_parse(
    data, len,
    &limits, "input.xlsx",
    arena, &issues,
    &raw_json, &raw_len);

// Stage B: Transform
char *canon_json;
size_t canon_len;
nx_xform_apply(
    raw_json, raw_len,
    schema, schema_len,
    arena, &issues,
    &canon_json, &canon_len);

sh_arena_free(arena);

CLI

# Full pipeline: XLSX + schema → canonical JSON
./nx_pipeline input.xlsx --schema schemas/gls-hu-automata-v2.json

# Extract only (raw JSON)
./nx_pipeline input.xlsx --raw

# Emit GeoJSON
./nx_pipeline input.xlsx --schema s.json --emit geojson

# Change detection against baseline
./nx_pipeline input.xlsx --schema s.json --baseline previous.json

# Auto-discover schema from raw data
./nx_pipeline input.xlsx --raw -o raw.json
./nx_pipeline raw.json --discover

Error handling

Every stage returns a typed status enum. Every function accepts an optional NxIssueList*. Issues accumulate across stages — you get the full picture, not just the first failure.

Stage Status Enum Error Codes
Extract (XLSX) NxXlsxStatus NULL, ZIP, NO_SHEETS, XML, ARENA, LIMITS, EMPTY
Extract (PDF) NxPdfStatus NULL, JSON, NO_TEXT, ARENA
Extract (CSV) NxCsvStatus NULL, PARSE, NO_DATA, ARENA, EMPTY
Merge NxMergeStatus NULL, JSON, SCHEMA, NO_TABLE, ARENA
Transform NxXformStatus NULL, SCHEMA, RAW, NO_TABLE, ARENA
Validate NxValidateStatus NULL, JSON, ARENA
Emit NxEmitStatus NULL, JSON, NO_RECORDS, NO_LATLON, ALLOC
Discover NxDiscoverStatus NULL, JSON, NO_TABLE, NO_ROWS, ARENA
Diff NxDiffStatus NULL, PARSE, ARENA
Under the Hood

Memory safety and parser hardening

Nexus processes untrusted input: user-uploaded spreadsheets, PDFs, and CSVs. Every parser is fuzz-tested, input-bounded, and arena-allocated. No user-controlled data reaches malloc without bounds checking.

Input limits

Limit XLSX PDF CSV
Max rows 100,000 Unlimited 100,000
Max columns 1,000 Auto-detected 1,000
Max file size 100 MB Schema: 1 MB No limit
Max sheets 100 N/A N/A
ZIP bomb guard 50 MB/entry N/A N/A

Fuzz testing

Three libFuzzer harnesses, all running under ASan + UBSan:

Parser Fuzz Runs Crashes Coverage
XLSX (nx_xlsx_parse) 364,000+ 0 ZIP, XML, shared strings, cell parsing
PDF (nx_pdf_extract_tables) 484,000+ 0 Y-clustering, X-gap detection, multi-page
CSV (nx_csv_parse) 405,000+ 0 RFC 4180, auto-delimiter, edge cases
Total 1,253,000+ 0

Memory model

  • Arena allocation (SHArena) for all intermediate structures — 32 MB per stage
  • Realloc-doubling for dynamic lists (shared strings, issues)
  • No global state — all context passed through parameters
  • No unsafe string functions — snprintf, strnlen, bounds-checked everywhere

Codebase

~9.4K
Lines of C
192
Tests
13
Source Files
3
Fuzz Harnesses
0
Ext Dependencies
Quick Start

Three ways to try Nexus

1. Browser (zero install)

The API documentation page includes a live WASM demo. Drop an XLSX, PDF, or CSV file to run the full pipeline in your browser. No server, no upload, no data leaves your machine.

2. Build from source

# Clone and build
git clone https://github.com/ottofleet/otto.git
cd otto && make nexus

# Run tests (192 tests)
make test-nexus

# Build CLI tools
make -C nexus tools

3. CLI pipeline

# Extract raw data from XLSX
./nexus/nx_pipeline depots.xlsx --raw

# Full pipeline with schema
./nexus/nx_pipeline depots.xlsx \
  --schema nexus/schemas/gls-hu-depots-v2.json

# Discover schema automatically
./nexus/nx_pipeline depots.xlsx --raw -o raw.json
./nexus/nx_pipeline raw.json --discover -o schema-draft.json

# Emit as GeoJSON
./nexus/nx_pipeline depots.xlsx \
  --schema s.json --emit geojson -o depots.geojson

# Diff against last week's data
./nexus/nx_pipeline depots.xlsx \
  --schema s.json --baseline last_week.json
FAQ

Common questions

Why not just use Pandas?

Pandas is excellent for ad-hoc analysis. For production pipelines that process untrusted user uploads, you need input bounds enforcement, fuzz-tested parsers, structured error reporting, schema-driven transforms, and change detection. Nexus gives you all of that in a zero-dependency C library that compiles to WASM. Pandas requires a Python runtime and doesn't run in a browser.

Can Nexus handle any XLSX file?

Nexus handles standard XLSX files (Office Open XML) with up to 100K rows and 1K columns. It does not support: encrypted XLSX, password-protected files, files with VBA macros (macros are ignored), or legacy .xls format (binary, pre-2007). For those, pre-convert to .xlsx or .csv.

How does PDF table extraction work?

Nexus receives pre-extracted text runs (position + text) from a PDF parser. It reconstructs tables by Y-alignment clustering (grouping text into rows by vertical position) and X-gap column detection (identifying column boundaries by horizontal gaps). Row tolerance and column gaps are auto-detected from median text height. This works well for structured tables but not for free-form layouts or scanned documents.

What about OCR / scanned PDFs?

Nexus does not include OCR. It operates on text-based PDFs where text is extractable. For scanned documents, run OCR first (Tesseract, AWS Textract, etc.) and feed the text-run JSON output to Nexus. The pipeline is composable — plug in your preferred OCR frontend.

Can I add custom compute functions?

Yes. Compute functions are registered in nx_compute.c via a simple function pointer registry. Add a function matching the NxComputeFunc signature (takes source strings, returns output strings) and register it with a name. The schema references it by name in multi_transforms.

What license?

AGPLv3 with a Trucking Exception. If you're a trucking company using Nexus for your own data processing, the exception applies. If you're embedding Nexus in a SaaS product for resale, AGPL requires source disclosure — or contact us for a commercial license.