Turn messy spreadsheets into clean data. In the browser.
Nexus is a six-stage document ingestion pipeline written in C with zero external dependencies. Extracts tabular data from XLSX, PDF, and CSV files. Transforms, validates, and emits structured JSON, GeoJSON, or CSV. Heuristic schema discovery — no LLMs, no cloud APIs. Same pipeline runs native, in WASM, or embedded in your application.
Logistics runs on spreadsheets. Spreadsheets run on chaos.
Every trucking company, 3PL, and carrier has "the spreadsheet." The one emailed weekly with depot locations, customer addresses, delivery windows, or station data. Different columns every time. Merged cells. Continuation rows. Hungarian coordinates in a projection system from 1972. A PDF that's actually a scan of a table that was originally an Excel file.
The typical response: a Python script that breaks every time the format changes. Or a manual process with a human in the loop. Or a SaaS ETL tool that costs $50K/year and still can't parse Hungarian EOV coordinates or reconstruct tables from PDF text runs.
Nexus is the opposite of all three.
NxIssueList through all six stages. Export as JSON. No more
grepping logs to find which row broke the pipeline.
The real problem. Document ingestion is 80% of the work in logistics data integration and 0% of the value. Nexus exists so engineers can stop writing XLSX parsers and start building route optimizers, ETAs, and fleet dashboards.
Six stages, each independently testable
Nexus decomposes document ingestion into six composable stages. Each stage reads JSON, produces JSON, and reports structured issues. You can run the full pipeline or invoke individual stages via CLI or C API.
XLSX / PDF / CSV JSON / GeoJSON / CSV
│ ▲
▼ │
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Stage A │──▶│ Stage M │──▶│ Stage B │──▶│ Stage X │──▶│ Stage D │
│ Extract │ │ Merge │ │Transform│ │Validate │ │ Emit │
└─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘
│ │ │ │ │
nx_raw JSON merged rows nx_canonical validated GeoJSON
(cells, rows) (cont. rows (typed fields, (bounds, CSV
stripped) derived cols) deduped)
│
▼
┌───────────┐
│ nx_diff │ Change detection
│ (baseline) │ FNV-1a hashing
└───────────┘
┌──────────────────────────────────────────────────────────────────┐
│ NxIssueList threaded through all stages │
│ stage / severity / row / field / code / message │
└──────────────────────────────────────────────────────────────────┘
Stage A — Extract
Reads XLSX (ZIP → XML → shared strings → cells), PDF (text-run JSON →
Y-alignment clustering → X-gap column detection), or CSV (RFC 4180, auto-delimiter).
Outputs nx_raw JSON: an array of rows with cell values and metadata.
- XLSX — ZIP container extraction via miniz, shared string table, A1 notation → row/col
- PDF — Y-alignment clustering within row tolerance, X-gap column detection, multi-page support
- CSV — RFC 4180 with auto-delimiter detection (comma, semicolon, tab, pipe)
Stage M — Merge
Handles continuation rows in PDF tables where a single logical row spans multiple lines.
Schema-driven: the row_merge config specifies which field identifies continuation
and how values are concatenated. Also strips repeated header rows.
Stage B — Transform
Column mapping, type coercion (string/int/double/bool), derived fields (constants), and a five-type multi-transform pipeline: split (delimiter + index), merge (N fields + separator), regex (capture groups), compute (built-in functions), conditional (pattern → value). Row ID generation via slugified templates.
Stage X — Validate
Four validation rules: geo_bounds (lat/lon within bounding box), format (POSIX regex pattern match), unique (dedup by field, first kept), outlier (IQR-based detection with Q1 − 1.5×IQR / Q3 + 1.5×IQR bounds). Issues tagged with row, field, and reason.
Stage D — Emit
Output as RFC 7946 GeoJSON FeatureCollection (configurable lat/lon/id fields, precision) or RFC 4180 CSV (configurable delimiter). Both formats validated downstream by standard tools.
Change Detection (nx_diff)
Compare current output against a baseline. FNV-1a 64-bit hashing for record IDs and content. O(1) hashmap lookup per record. Output: added, removed, modified, unchanged counts with field-level diffs for modified records. Useful for incremental pipeline runs.
What makes Nexus different
nx_discover_schema(). Infers column types
(int/double/string), detects lat/lon fields (range + header matching), identifies
unique/required fields, spots continuation rows. Outputs a draft JSON schema.
eov_to_wgs84, dms_to_dd, coalesce,
phone_normalize, zip_to_region,
opening_hours. Extensible registry via nx_compute_find().
NxIssueList is an uncapped, realloc-doubling list of structured issues.
Each issue: stage, severity (info/warning/error), row, field, code, message.
Thread through all stages. Export as JSON. Filter by stage or severity.
Compute function reference
| Function | Input | Output | Use Case |
|---|---|---|---|
| eov_to_wgs84 | EOV Y, EOV X | lat, lon | Hungarian HD72/EOV projection → GPS coordinates |
| dms_to_dd | DMS string | decimal degrees | 47°29'33"N → 47.4925 |
| coalesce | N source fields | first non-empty | Fallback chain: city || town || village |
| phone_normalize | Hungarian phone | E.164 format | 06-30-123-4567 → +36301234567 |
| zip_to_region | ZIP code | region name | 1052 → Budapest |
| opening_hours | Hungarian format | OSM format | H-P: 8-17 → Mo-Fr 08:00-17:00 |
Declarative, versioned, machine-readable
Nexus schemas are JSON documents that define column mapping, type coercion, multi-transforms, validation rules, and output configuration. The schema is the single source of truth for how a file format maps to your data model.
{
"nx_schema": 2,
"name": "gls-hu-automata",
"columns": [
{"source": "Aut. neve", "target": "name", "type": "string"},
{"source": "EOV Y", "target": "eov_y", "type": "double"},
{"source": "EOV X", "target": "eov_x", "type": "double"}
],
"multi_transforms": [
{"type": "compute", "function": "eov_to_wgs84",
"sources": ["eov_y", "eov_x"],
"targets": ["lat", "lon"]}
],
"validate": {
"geo_bounds": {"lat": "lat", "lon": "lon",
"min_lat": 45.7, "max_lat": 48.6,
"min_lon": 16.1, "max_lon": 22.9}
},
"row_id": "{name}-{zip}"
}
Production schemas (GLS Hungary)
Nexus ships with 10 production schemas for GLS Hungary logistics data. These serve as reference implementations and test fixtures for the full pipeline.
| Schema | Entity | Format | Features |
|---|---|---|---|
| gls-hu-automata-v2 | Parcel lockers | XLSX | EOV compute, geo_bounds, regex, split |
| gls-hu-automata-pdf-v2 | Parcel lockers | Row merge for continuation, same transforms | |
| gls-hu-pudo-v2 | Pickup/dropoff points | XLSX | Multi-transforms, phone normalize, opening hours |
| gls-hu-pudo-pdf-v2 | Pickup/dropoff points | Row merge, same transforms as XLSX | |
| gls-hu-depots-v2 | Depot locations | XLSX | EOV compute, zip_to_region, merge transforms |
Schema discovery
Don't have a schema? Feed raw extracted data to nx_discover_schema() and get
a draft. The heuristics are deterministic and rule-based:
- Type inference — Tries
strtod/strtolon every cell; consensus wins - Lat/lon detection — Numeric columns in [-90, 90] or [-180, 180] with header name matching
- Required fields — Non-empty in all rows → marked required
- Uniqueness — All distinct values → candidate row ID
- Continuation rows — Detects merge patterns, emits
row_mergeconfig
Honest assessment. Schema discovery is a starting point, not a finished product. It generates a correct schema ~70% of the time for clean XLSX files with clear headers. For messy PDFs with merged cells and multi-line values, human review is always needed. The point is to save 80% of the manual work, not eliminate it.
Nexus vs. the alternatives
Nexus occupies a specific niche: schema-driven tabular extraction from messy documents in logistics. It's not a general ETL framework, not a BI tool, and not a data lake. Here's how it compares to what teams actually use.
| Nexus | Python scripts | Tabula / Camelot | dbt + Pandas | Trifacta / Dataprep | |
|---|---|---|---|---|---|
| XLSX parsing | Native | openpyxl | N | Via Pandas | Y |
| PDF table extraction | Native | Manual | Y | N | Limited |
| CSV parsing | RFC 4180 | csv module | N | Via Pandas | Y |
| Schema-driven | JSON schema | Hardcoded | N | SQL models | Recipes |
| Schema discovery | Heuristic | N | N | N | ML-based |
| Change detection | FNV-1a diff | Manual | N | Snapshots | N |
| Structured errors | Per-row/field | Exceptions | Exceptions | Test failures | UI alerts |
| GeoJSON output | RFC 7946 | Manual | N | Plugin | N |
| Runs in browser | WASM | N | N | N | Cloud only |
| Embeddable | C / WASM | Python runtime | Java runtime | Python + DB | SaaS |
| Dependencies | Zero | pip install ... | JRE + JAR | Python + DB | Cloud subscription |
| Cost | Open source | Free | Free | Free + DB | $$$ /year |
| Fuzz-tested | 1.25M+ runs | Rarely | N | N | Unknown |
The key difference. Python scripts are flexible but fragile — they break when the spreadsheet changes. dbt/Pandas are powerful but require a Python runtime and a database. Trifacta/Dataprep are polished but SaaS-only and expensive. Nexus gives you a schema-driven, embeddable, fuzz-tested pipeline with zero dependencies that runs equally well native, in WASM, or embedded in a C application.
C functions, JSON in, JSON out
Each stage is a standalone function: takes JSON + optional schema, returns JSON + issues.
The orchestrator nx_ingest() chains them automatically, or call stages individually
for maximum control.
Orchestrator (full pipeline)
NxIssueList issues; nx_issue_list_init(&issues); char *raw, *canon; size_t raw_len, canon_len; NxIngestStatus s = nx_ingest( data, len, NX_FORMAT_XLSX, "depots.xlsx", schema, schema_len, &raw, &raw_len, &canon, &canon_len, &issues); if (s == NX_INGEST_OK) { // canon = canonical JSON free(raw); free(canon); } nx_issue_list_free(&issues);
Individual stage
// Stage A: Extract XLSX SHArena *arena = sh_arena_create( 32 * 1024 * 1024); char *raw_json; size_t raw_len; NxXlsxStatus s = nx_xlsx_parse( data, len, &limits, "input.xlsx", arena, &issues, &raw_json, &raw_len); // Stage B: Transform char *canon_json; size_t canon_len; nx_xform_apply( raw_json, raw_len, schema, schema_len, arena, &issues, &canon_json, &canon_len); sh_arena_free(arena);
CLI
# Full pipeline: XLSX + schema → canonical JSON ./nx_pipeline input.xlsx --schema schemas/gls-hu-automata-v2.json # Extract only (raw JSON) ./nx_pipeline input.xlsx --raw # Emit GeoJSON ./nx_pipeline input.xlsx --schema s.json --emit geojson # Change detection against baseline ./nx_pipeline input.xlsx --schema s.json --baseline previous.json # Auto-discover schema from raw data ./nx_pipeline input.xlsx --raw -o raw.json ./nx_pipeline raw.json --discover
Error handling
Every stage returns a typed status enum. Every function accepts an optional NxIssueList*.
Issues accumulate across stages — you get the full picture, not just the first failure.
| Stage | Status Enum | Error Codes |
|---|---|---|
| Extract (XLSX) | NxXlsxStatus | NULL, ZIP, NO_SHEETS, XML, ARENA, LIMITS, EMPTY |
| Extract (PDF) | NxPdfStatus | NULL, JSON, NO_TEXT, ARENA |
| Extract (CSV) | NxCsvStatus | NULL, PARSE, NO_DATA, ARENA, EMPTY |
| Merge | NxMergeStatus | NULL, JSON, SCHEMA, NO_TABLE, ARENA |
| Transform | NxXformStatus | NULL, SCHEMA, RAW, NO_TABLE, ARENA |
| Validate | NxValidateStatus | NULL, JSON, ARENA |
| Emit | NxEmitStatus | NULL, JSON, NO_RECORDS, NO_LATLON, ALLOC |
| Discover | NxDiscoverStatus | NULL, JSON, NO_TABLE, NO_ROWS, ARENA |
| Diff | NxDiffStatus | NULL, PARSE, ARENA |
Memory safety and parser hardening
Nexus processes untrusted input: user-uploaded spreadsheets, PDFs, and CSVs. Every parser
is fuzz-tested, input-bounded, and arena-allocated. No user-controlled data reaches
malloc without bounds checking.
Input limits
| Limit | XLSX | CSV | |
|---|---|---|---|
| Max rows | 100,000 | Unlimited | 100,000 |
| Max columns | 1,000 | Auto-detected | 1,000 |
| Max file size | 100 MB | Schema: 1 MB | No limit |
| Max sheets | 100 | N/A | N/A |
| ZIP bomb guard | 50 MB/entry | N/A | N/A |
Fuzz testing
Three libFuzzer harnesses, all running under ASan + UBSan:
| Parser | Fuzz Runs | Crashes | Coverage |
|---|---|---|---|
| XLSX (nx_xlsx_parse) | 364,000+ | 0 | ZIP, XML, shared strings, cell parsing |
| PDF (nx_pdf_extract_tables) | 484,000+ | 0 | Y-clustering, X-gap detection, multi-page |
| CSV (nx_csv_parse) | 405,000+ | 0 | RFC 4180, auto-delimiter, edge cases |
| Total | 1,253,000+ | 0 |
Memory model
- Arena allocation (SHArena) for all intermediate structures — 32 MB per stage
- Realloc-doubling for dynamic lists (shared strings, issues)
- No global state — all context passed through parameters
- No unsafe string functions — snprintf, strnlen, bounds-checked everywhere
Codebase
Three ways to try Nexus
1. Browser (zero install)
The API documentation page includes a live WASM demo. Drop an XLSX, PDF, or CSV file to run the full pipeline in your browser. No server, no upload, no data leaves your machine.
2. Build from source
# Clone and build git clone https://github.com/ottofleet/otto.git cd otto && make nexus # Run tests (192 tests) make test-nexus # Build CLI tools make -C nexus tools
3. CLI pipeline
# Extract raw data from XLSX ./nexus/nx_pipeline depots.xlsx --raw # Full pipeline with schema ./nexus/nx_pipeline depots.xlsx \ --schema nexus/schemas/gls-hu-depots-v2.json # Discover schema automatically ./nexus/nx_pipeline depots.xlsx --raw -o raw.json ./nexus/nx_pipeline raw.json --discover -o schema-draft.json # Emit as GeoJSON ./nexus/nx_pipeline depots.xlsx \ --schema s.json --emit geojson -o depots.geojson # Diff against last week's data ./nexus/nx_pipeline depots.xlsx \ --schema s.json --baseline last_week.json
Common questions
Why not just use Pandas?
Pandas is excellent for ad-hoc analysis. For production pipelines that process untrusted user uploads, you need input bounds enforcement, fuzz-tested parsers, structured error reporting, schema-driven transforms, and change detection. Nexus gives you all of that in a zero-dependency C library that compiles to WASM. Pandas requires a Python runtime and doesn't run in a browser.
Can Nexus handle any XLSX file?
Nexus handles standard XLSX files (Office Open XML) with up to 100K rows and 1K columns. It does not support: encrypted XLSX, password-protected files, files with VBA macros (macros are ignored), or legacy .xls format (binary, pre-2007). For those, pre-convert to .xlsx or .csv.
How does PDF table extraction work?
Nexus receives pre-extracted text runs (position + text) from a PDF parser. It reconstructs tables by Y-alignment clustering (grouping text into rows by vertical position) and X-gap column detection (identifying column boundaries by horizontal gaps). Row tolerance and column gaps are auto-detected from median text height. This works well for structured tables but not for free-form layouts or scanned documents.
What about OCR / scanned PDFs?
Nexus does not include OCR. It operates on text-based PDFs where text is extractable. For scanned documents, run OCR first (Tesseract, AWS Textract, etc.) and feed the text-run JSON output to Nexus. The pipeline is composable — plug in your preferred OCR frontend.
Can I add custom compute functions?
Yes. Compute functions are registered in nx_compute.c via a simple function
pointer registry. Add a function matching the NxComputeFunc signature
(takes source strings, returns output strings) and register it with a name. The schema
references it by name in multi_transforms.
What license?
AGPLv3 with a Trucking Exception. If you're a trucking company using Nexus for your own data processing, the exception applies. If you're embedding Nexus in a SaaS product for resale, AGPL requires source disclosure — or contact us for a commercial license.