Why

Logistics runs on spreadsheets. Spreadsheets run on chaos.

Every trucking company, 3PL, and carrier has "the spreadsheet." The one emailed weekly with depot locations, customer addresses, delivery windows, or station data. Different columns every time. Merged cells. Continuation rows. Hungarian coordinates in a projection system from 1972. A PDF that's actually a scan of a table that was originally an Excel file.

The typical response: a Python script that breaks every time the format changes. Or a manual process with a human in the loop. Or a SaaS ETL tool that costs $50K/year and still can't parse Hungarian EOV coordinates or reconstruct tables from PDF text runs.

Nexus is the opposite of all three.

Schema-driven, not script-driven

Declarative JSON schemas define column mapping, transforms, and validation rules. When the spreadsheet changes, update the schema — not the code. Schemas are versioned, diffable, and machine-readable.

Heuristic discovery, not LLM guessing

Drop a file with no schema. Nexus infers types, detects lat/lon columns, identifies unique fields, spots continuation rows, and generates a schema draft. Deterministic, reproducible, zero API calls. A human reviews and refines.

Structured errors, not stack traces

Every issue has a stage, severity, row number, field name, error code, and human message. Thread the NxIssueList through all six stages. Export as JSON. No more grepping logs to find which row broke the pipeline.

Runs in the browser

The WASM demo runs the full pipeline client-side. Upload a spreadsheet, see extracted data, apply transforms, validate, export. Zero server round-trips. Sensitive data never leaves the machine.

The real problem. Document ingestion is 80% of the work in logistics data integration and 0% of the value. Nexus exists so engineers can stop writing XLSX parsers and start building route optimizers, ETAs, and fleet dashboards.

Pipeline

Six stages, each independently testable

Nexus decomposes document ingestion into six composable stages. Each stage reads JSON, produces JSON, and reports structured issues. You can run the full pipeline or invoke individual stages via CLI or C API.

  XLSX / PDF / CSV                                    JSON / GeoJSON / CSV
       │                                                      ▲
       ▼                                                      │
  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐
  │ Stage A │──▶│ Stage M │──▶│ Stage B │──▶│ Stage X │──▶│ Stage D │
  │ Extract │   │  Merge  │   │Transform│   │Validate │   │  Emit   │
  └─────────┘   └─────────┘   └─────────┘   └─────────┘   └─────────┘
       │              │              │              │              │
   nx_raw JSON    merged rows    nx_canonical   validated     GeoJSON
   (cells, rows)  (cont. rows   (typed fields,  (bounds,      CSV
                   stripped)     derived cols)   deduped)
                                     │
                                     ▼
                               ┌───────────┐
                               │  nx_diff   │  Change detection
                               │ (baseline) │  FNV-1a hashing
                               └───────────┘

  ┌──────────────────────────────────────────────────────────────────┐
  │  NxIssueList   threaded through all stages                      │
  │  stage / severity / row / field / code / message                │
  └──────────────────────────────────────────────────────────────────┘

6

Pipeline Stages

192

Tests

~9.4K

Lines of C

0

Fuzz Crashes

0

Ext Dependencies

Stage A — Extract

Reads XLSX (ZIP → XML → shared strings → cells), PDF (text-run JSON → Y-alignment clustering → X-gap column detection), or CSV (RFC 4180, auto-delimiter). Outputs nx_raw JSON: an array of rows with cell values and metadata.

XLSX — ZIP container extraction via miniz, shared string table, A1 notation → row/col
PDF — Y-alignment clustering within row tolerance, X-gap column detection, multi-page support
CSV — RFC 4180 with auto-delimiter detection (comma, semicolon, tab, pipe)

Stage M — Merge

Handles continuation rows in PDF tables where a single logical row spans multiple lines. Schema-driven: the row_merge config specifies which field identifies continuation and how values are concatenated. Also strips repeated header rows.

Stage B — Transform

Column mapping, type coercion (string/int/double/bool), derived fields (constants), and a five-type multi-transform pipeline: split (delimiter + index), merge (N fields + separator), regex (capture groups), compute (built-in functions), conditional (pattern → value). Row ID generation via slugified templates.

Stage X — Validate

Four validation rules: geo_bounds (lat/lon within bounding box), format (POSIX regex pattern match), unique (dedup by field, first kept), outlier (IQR-based detection with Q1 − 1.5×IQR / Q3 + 1.5×IQR bounds). Issues tagged with row, field, and reason.

Stage D — Emit

Output as RFC 7946 GeoJSON FeatureCollection (configurable lat/lon/id fields, precision) or RFC 4180 CSV (configurable delimiter). Both formats validated downstream by standard tools.

Change Detection (nx_diff)

Compare current output against a baseline. FNV-1a 64-bit hashing for record IDs and content. O(1) hashmap lookup per record. Output: added, removed, modified, unchanged counts with field-level diffs for modified records. Useful for incremental pipeline runs.

Features

What makes Nexus different

Schema Discovery

Feed raw extracted data to nx_discover_schema(). Infers column types (int/double/string), detects lat/lon fields (range + header matching), identifies unique/required fields, spots continuation rows. Outputs a draft JSON schema.

Zero LLMs, zero API calls

Compute Functions

Six built-in compute functions for Hungarian logistics: eov_to_wgs84, dms_to_dd, coalesce, phone_normalize, zip_to_region, opening_hours. Extensible registry via nx_compute_find().

6 built-in transforms

Multi-Transform Pipeline

Chain transforms on a single field: split a combined address, regex-extract a ZIP code, merge first + last name, compute coordinates from EOV, conditionally map values. Virtual columns appended after originals.

5 transform types, composable

PDF Table Reconstruction

Reconstructs tables from PDF text runs using Y-alignment clustering and X-gap column detection. Auto-detects row tolerance and column gaps from median text height. Handles rotation, scaling, multiple pages.

1,082 LOC, battle-tested

Structured Issue Tracking

NxIssueList is an uncapped, realloc-doubling list of structured issues. Each issue: stage, severity (info/warning/error), row, field, code, message. Thread through all stages. Export as JSON. Filter by stage or severity.

No more grepping logs

Production Hardened

Input limits enforced: 100K rows, 1K columns, 100 MB files, 50 MB ZIP entries. Fuzz-tested with libFuzzer under ASan/UBSan: 1.25M+ runs across XLSX, PDF, CSV parsers with zero crashes. Regex complexity guards. No unsafe string functions.

0 crashes in 1.25M fuzz runs

Compute function reference

Function	Input	Output	Use Case
eov_to_wgs84	EOV Y, EOV X	lat, lon	Hungarian HD72/EOV projection → GPS coordinates
dms_to_dd	DMS string	decimal degrees	47°29'33"N → 47.4925
coalesce	N source fields	first non-empty	Fallback chain: city \|\| town \|\| village
phone_normalize	Hungarian phone	E.164 format	06-30-123-4567 → +36301234567
zip_to_region	ZIP code	region name	1052 → Budapest
opening_hours	Hungarian format	OSM format	H-P: 8-17 → Mo-Fr 08:00-17:00

Schema System

Declarative, versioned, machine-readable

Nexus schemas are JSON documents that define column mapping, type coercion, multi-transforms, validation rules, and output configuration. The schema is the single source of truth for how a file format maps to your data model.

{
  "nx_schema": 2,
  "name": "gls-hu-automata",
  "columns": [
    {"source": "Aut. neve",  "target": "name",    "type": "string"},
    {"source": "EOV Y",     "target": "eov_y",   "type": "double"},
    {"source": "EOV X",     "target": "eov_x",   "type": "double"}
  ],
  "multi_transforms": [
    {"type": "compute", "function": "eov_to_wgs84",
     "sources": ["eov_y", "eov_x"],
     "targets": ["lat", "lon"]}
  ],
  "validate": {
    "geo_bounds": {"lat": "lat", "lon": "lon",
                   "min_lat": 45.7, "max_lat": 48.6,
                   "min_lon": 16.1, "max_lon": 22.9}
  },
  "row_id": "{name}-{zip}"
}

Production schemas (GLS Hungary)

Nexus ships with 10 production schemas for GLS Hungary logistics data. These serve as reference implementations and test fixtures for the full pipeline.

Schema	Entity	Format	Features
gls-hu-automata-v2	Parcel lockers	XLSX	EOV compute, geo_bounds, regex, split
gls-hu-automata-pdf-v2	Parcel lockers	PDF	Row merge for continuation, same transforms
gls-hu-pudo-v2	Pickup/dropoff points	XLSX	Multi-transforms, phone normalize, opening hours
gls-hu-pudo-pdf-v2	Pickup/dropoff points	PDF	Row merge, same transforms as XLSX
gls-hu-depots-v2	Depot locations	XLSX	EOV compute, zip_to_region, merge transforms

Schema discovery

Don't have a schema? Feed raw extracted data to nx_discover_schema() and get a draft. The heuristics are deterministic and rule-based:

Type inference — Tries strtod/strtol on every cell; consensus wins
Lat/lon detection — Numeric columns in [-90, 90] or [-180, 180] with header name matching
Required fields — Non-empty in all rows → marked required
Uniqueness — All distinct values → candidate row ID
Continuation rows — Detects merge patterns, emits row_merge config

Honest assessment. Schema discovery is a starting point, not a finished product. It generates a correct schema ~70% of the time for clean XLSX files with clear headers. For messy PDFs with merged cells and multi-line values, human review is always needed. The point is to save 80% of the manual work, not eliminate it.

Comparison

Nexus vs. the alternatives

Nexus occupies a specific niche: schema-driven tabular extraction from messy documents in logistics. It's not a general ETL framework, not a BI tool, and not a data lake. Here's how it compares to what teams actually use.

	Nexus	Python scripts	Tabula / Camelot	dbt + Pandas	Trifacta / Dataprep
XLSX parsing	Native	openpyxl	N	Via Pandas	Y
PDF table extraction	Native	Manual	Y	N	Limited
CSV parsing	RFC 4180	csv module	N	Via Pandas	Y
Schema-driven	JSON schema	Hardcoded	N	SQL models	Recipes
Schema discovery	Heuristic	N	N	N	ML-based
Change detection	FNV-1a diff	Manual	N	Snapshots	N
Structured errors	Per-row/field	Exceptions	Exceptions	Test failures	UI alerts
GeoJSON output	RFC 7946	Manual	N	Plugin	N
Runs in browser	WASM	N	N	N	Cloud only
Embeddable	C / WASM	Python runtime	Java runtime	Python + DB	SaaS
Dependencies	Zero	pip install ...	JRE + JAR	Python + DB	Cloud subscription
Cost	Open source	Free	Free	Free + DB	$$$ /year
Fuzz-tested	1.25M+ runs	Rarely	N	N	Unknown

The key difference. Python scripts are flexible but fragile — they break when the spreadsheet changes. dbt/Pandas are powerful but require a Python runtime and a database. Trifacta/Dataprep are polished but SaaS-only and expensive. Nexus gives you a schema-driven, embeddable, fuzz-tested pipeline with zero dependencies that runs equally well native, in WASM, or embedded in a C application.

API

C functions, JSON in, JSON out

Each stage is a standalone function: takes JSON + optional schema, returns JSON + issues. The orchestrator nx_ingest() chains them automatically, or call stages individually for maximum control.

Orchestrator (full pipeline)

NxIssueList issues;
nx_issue_list_init(&issues);

char *raw, *canon;
size_t raw_len, canon_len;

NxIngestStatus s = nx_ingest(
    data, len,
    NX_FORMAT_XLSX, "depots.xlsx",
    schema, schema_len,
    &raw, &raw_len,
    &canon, &canon_len,
    &issues);

if (s == NX_INGEST_OK) {
    // canon = canonical JSON
    free(raw);
    free(canon);
}
nx_issue_list_free(&issues);

Individual stage

// Stage A: Extract XLSX
SHArena *arena = sh_arena_create(
    32 * 1024 * 1024);
char *raw_json;
size_t raw_len;

NxXlsxStatus s = nx_xlsx_parse(
    data, len,
    &limits, "input.xlsx",
    arena, &issues,
    &raw_json, &raw_len);

// Stage B: Transform
char *canon_json;
size_t canon_len;
nx_xform_apply(
    raw_json, raw_len,
    schema, schema_len,
    arena, &issues,
    &canon_json, &canon_len);

sh_arena_free(arena);

CLI

# Full pipeline: XLSX + schema → canonical JSON
./nx_pipeline input.xlsx --schema schemas/gls-hu-automata-v2.json

# Extract only (raw JSON)
./nx_pipeline input.xlsx --raw

# Emit GeoJSON
./nx_pipeline input.xlsx --schema s.json --emit geojson

# Change detection against baseline
./nx_pipeline input.xlsx --schema s.json --baseline previous.json

# Auto-discover schema from raw data
./nx_pipeline input.xlsx --raw -o raw.json
./nx_pipeline raw.json --discover

Error handling

Every stage returns a typed status enum. Every function accepts an optional NxIssueList*. Issues accumulate across stages — you get the full picture, not just the first failure.

Stage	Status Enum	Error Codes
Extract (XLSX)	NxXlsxStatus	NULL, ZIP, NO_SHEETS, XML, ARENA, LIMITS, EMPTY
Extract (PDF)	NxPdfStatus	NULL, JSON, NO_TEXT, ARENA
Extract (CSV)	NxCsvStatus	NULL, PARSE, NO_DATA, ARENA, EMPTY
Merge	NxMergeStatus	NULL, JSON, SCHEMA, NO_TABLE, ARENA
Transform	NxXformStatus	NULL, SCHEMA, RAW, NO_TABLE, ARENA
Validate	NxValidateStatus	NULL, JSON, ARENA
Emit	NxEmitStatus	NULL, JSON, NO_RECORDS, NO_LATLON, ALLOC
Discover	NxDiscoverStatus	NULL, JSON, NO_TABLE, NO_ROWS, ARENA
Diff	NxDiffStatus	NULL, PARSE, ARENA

Under the Hood

Memory safety and parser hardening

Nexus processes untrusted input: user-uploaded spreadsheets, PDFs, and CSVs. Every parser is fuzz-tested, input-bounded, and arena-allocated. No user-controlled data reaches malloc without bounds checking.

Input limits

Limit	XLSX	PDF	CSV
Max rows	100,000	Unlimited	100,000
Max columns	1,000	Auto-detected	1,000
Max file size	100 MB	Schema: 1 MB	No limit
Max sheets	100	N/A	N/A
ZIP bomb guard	50 MB/entry	N/A	N/A

Fuzz testing

Three libFuzzer harnesses, all running under ASan + UBSan:

Parser	Fuzz Runs	Coverage
XLSX (nx_xlsx_parse)	364,000+	ZIP, XML, shared strings, cell parsing
PDF (nx_pdf_extract_tables)	484,000+	Y-clustering, X-gap detection, multi-page
CSV (nx_csv_parse)	405,000+	RFC 4180, auto-delimiter, edge cases
Total	1,253,000+

Memory model

Arena allocation (SHArena) for all intermediate structures — 32 MB per stage
Realloc-doubling for dynamic lists (shared strings, issues)
No global state — all context passed through parameters
No unsafe string functions — snprintf, strnlen, bounds-checked everywhere

Codebase

~9.4K

Lines of C

192

Tests

13

Source Files

3

Fuzz Harnesses

0

Ext Dependencies

Quick Start

Three ways to try Nexus

1. Browser (zero install)

The API documentation page includes a live WASM demo. Drop an XLSX, PDF, or CSV file to run the full pipeline in your browser. No server, no upload, no data leaves your machine.

2. Build from source

# Clone and build
git clone https://github.com/ottofleet/otto.git
cd otto && make nexus

# Run tests (192 tests)
make test-nexus

# Build CLI tools
make -C nexus tools

3. CLI pipeline

# Extract raw data from XLSX
./nexus/nx_pipeline depots.xlsx --raw

# Full pipeline with schema
./nexus/nx_pipeline depots.xlsx \
  --schema nexus/schemas/gls-hu-depots-v2.json

# Discover schema automatically
./nexus/nx_pipeline depots.xlsx --raw -o raw.json
./nexus/nx_pipeline raw.json --discover -o schema-draft.json

# Emit as GeoJSON
./nexus/nx_pipeline depots.xlsx \
  --schema s.json --emit geojson -o depots.geojson

# Diff against last week's data
./nexus/nx_pipeline depots.xlsx \
  --schema s.json --baseline last_week.json

FAQ

Common questions

Why not just use Pandas?

Pandas is excellent for ad-hoc analysis. For production pipelines that process untrusted user uploads, you need input bounds enforcement, fuzz-tested parsers, structured error reporting, schema-driven transforms, and change detection. Nexus gives you all of that in a zero-dependency C library that compiles to WASM. Pandas requires a Python runtime and doesn't run in a browser.

Can Nexus handle any XLSX file?

Nexus handles standard XLSX files (Office Open XML) with up to 100K rows and 1K columns. It does not support: encrypted XLSX, password-protected files, files with VBA macros (macros are ignored), or legacy .xls format (binary, pre-2007). For those, pre-convert to .xlsx or .csv.

How does PDF table extraction work?

Nexus receives pre-extracted text runs (position + text) from a PDF parser. It reconstructs tables by Y-alignment clustering (grouping text into rows by vertical position) and X-gap column detection (identifying column boundaries by horizontal gaps). Row tolerance and column gaps are auto-detected from median text height. This works well for structured tables but not for free-form layouts or scanned documents.

What about OCR / scanned PDFs?

Nexus does not include OCR. It operates on text-based PDFs where text is extractable. For scanned documents, run OCR first (Tesseract, AWS Textract, etc.) and feed the text-run JSON output to Nexus. The pipeline is composable — plug in your preferred OCR frontend.

Can I add custom compute functions?

Yes. Compute functions are registered in nx_compute.c via a simple function pointer registry. Add a function matching the NxComputeFunc signature (takes source strings, returns output strings) and register it with a name. The schema references it by name in multi_transforms.

What license?

AGPLv3 with a Trucking Exception. If you're a trucking company using Nexus for your own data processing, the exception applies. If you're embedding Nexus in a SaaS product for resale, AGPL requires source disclosure — or contact us for a commercial license.

Turn messy spreadsheets into clean data. In the browser.

Logistics runs on spreadsheets. Spreadsheets run on chaos.

Six stages, each independently testable

Stage A — Extract

Stage M — Merge

Stage B — Transform

Stage X — Validate

Stage D — Emit

Change Detection (nx_diff)

What makes Nexus different

Compute function reference

Declarative, versioned, machine-readable

Production schemas (GLS Hungary)

Schema discovery

Nexus vs. the alternatives

C functions, JSON in, JSON out

Orchestrator (full pipeline)

Individual stage

CLI

Error handling

Memory safety and parser hardening

Input limits

Fuzz testing

Memory model

Codebase

Three ways to try Nexus

1. Browser (zero install)

2. Build from source

3. CLI pipeline

Common questions

Why not just use Pandas?

Can Nexus handle any XLSX file?

How does PDF table extraction work?

What about OCR / scanned PDFs?

Can I add custom compute functions?

What license?