Locus Geocoding Engine

Pure C geocoding. Prefix trie autocomplete, trigram fuzzy search, spatial grid reverse geocoding. Zero-copy mmap index. Compiles to WASM.

68 tests ~6.5K LoC <20µs queries WASM-ready Zero deps

Why Locus

Most geocoding requires cloud APIs, internet connectivity, and per-request pricing. Locus runs entirely on your device—in the browser, on an edge server, or embedded in a truck’s ELD. The entire binary index loads via mmap with zero deserialization.

Three Search Algorithms

Prefix trie for instant autocomplete, trigram index for typo-tolerant fuzzy search, spatial grid for reverse geocoding. Each optimized for its task.

Zero-Copy Index

Binary v4 format (.lcx) loads via mmap—no deserialization, no parsing, no memory allocation. Shared between processes.

Full UTF-8 Pipeline

Diacritic removal, case folding, abbreviation expansion, address parsing. Handles Latin, Cyrillic, Greek, CJK. European and US address formats.

Edge-Native

Compiles to WASM for browser deployment. No cloud, no API keys, no per-request costs. Sub-20µs latency means real-time typeahead.

Search Architecture

Locus uses three complementary data structures, each specialized for a different query type. The search engine tries exact match first, falls back to prefix, then fuzzy.

  Query: "montec"                    Reverse: (43.73, 7.42)
     │                                   │
     ▼                                   ▼
  ┌─────────────────┐            ┌───────────────────┐
  │  lc_normalize()  │            │    Spatial Grid    │
  │  montec → montec │            │  0.01° cells (~1km)│
  └────────┬─────────┘            └─────────┬─────────┘
           │                                │
     ┌─────┴─────┐                    Grid Lookup
     ▼           ▼                    O(1) per cell
  ┌──────┐  ┌────────┐                     │
  │ Trie │  │ Prefix │                     ▼
  │Exact │  │ Match  │           ┌──────────────────┐
  └──┬───┘  └───┬────┘           │  K-NN + Distance  │
     │          │                │  Point-to-segment  │
  No match   "Monte Carlo"      └──────────────────┘
     │       "Monte Carlo Bay"         │
     ▼          │                      ▼
  ┌──────────┐  │               Street, Place, POI
  │ Trigrams  │ │               with admin hierarchy
  │  Fuzzy   │  │
  │ Jaccard  │  │
  └────┬─────┘  │
       ▼        ▼
    Ranked by similarity + importance
<11µs Exact Search
<19µs Autocomplete
<25µs Reverse Geocode
16 Feature Classes

Prefix Trie

37-slot alphabet (a–z, 0–9, space). Lazy child allocation. Entity ID lists only at matching nodes. Drives real-time autocomplete with O(m) lookup where m is query length.

Trigram Index

Decomposes names into 3-character n-grams. Ranks candidates by Jaccard similarity: |intersection| / |union|. Configurable threshold (default 0.3). Falls back here when exact and prefix fail—handles typos, partial names, reordered words.

Spatial Grid

Uniform grid at 0.01° resolution (~1km cells). O(1) point lookup, radius search for K-NN. Haversine-based point-to-segment distance for accurate street matching. Returns structured results: place, street, address, POI, admin hierarchy.

Text Normalization

StageExamplePurpose
Case folding“Budapest” → “budapest”UTF-8 aware lowercase
Diacritic removal“Österreich” → “osterreich”Latin-1 + Extended-A maps
Abbreviation expansion“St” → “Street”Common address abbreviations
Whitespace collapse“Foo   Bar” → “Foo Bar”Unicode space normalization
Punctuation removal“St.” → “St”Clean for index matching

Binary Index Format

The .lcx binary format (v4) is designed for zero-copy access via mmap. No deserialization, no memory allocation—the OS pages in data on demand.

Format
Magic:   0x4C4F4355 ("LOCU")
Version: 4

Header (96 bytes):
  entity_count      uint32
  string_pool_size  uint32
  trie_node_count   uint32
  ngram_count       uint32
  grid_cell_count   uint32
  geometry_count    uint32
  bounds            SHBBox
  cell_size         double

Sections:
  1. Entity records    (72B each)
  2. Alt name indices
  3. String pool       (interned)
  4. Trie nodes        (160B each)
  5. Entity ID lists
  6. Grid cells + entries
  7. N-gram entries
  8. Geometry offsets + points
Loading
/* Zero-copy mmap load */
LCIndex *idx = lc_index_mmap("region.lcx");

/* Or from memory (WASM) */
LCIndex *idx = lc_index_load_memory(
    data, len
);

/* Check format version */
if (lc_binary_version("region.lcx") < 4)
    rebuild_index();

/* Index stats */
uint32_t n = lc_index_entity_count(idx);
size_t mem = lc_index_memory_usage(idx);
SHBBox bounds = lc_index_bounds(idx);

/* Cleanup */
lc_index_free(idx);
Feature ClassExamplesImportance
LC_CLASS_COUNTRYMonaco, HungaryHighest
LC_CLASS_CITYBudapest, ViennaHigh
LC_CLASS_STREETAvenue de Monte-CarloMedium
LC_CLASS_POICasino de Monte-CarloMedium
LC_CLASS_ADDRESS12 Rue GrimaldiLow
LC_CLASS_WATERPort HerculeLow

API

Forward Search

C
#include "lc_index.h"

/* Build index from PBF */
LCIndex *idx = lc_index_create();
lc_index_build_from_pbf(idx, "monaco.osm.pbf", NULL);

/* Search with options */
LCSearchOptions opts = lc_search_options_default();
opts.limit = 10;
opts.lang  = "en";           /* prefer English names */

LCSearchResult result;
lc_search(idx, "monte carlo", &opts, &result);

for (int i = 0; i < result.count; i++) {
    const LCEntity *e = lc_search_get_entity(idx, &result.matches[i]);
    printf("%s (%.4f, %.4f) score=%.2f\n",
           e->name, e->centroid.lat, e->centroid.lon,
           result.matches[i].score);
}
lc_search_result_free(&result);

Autocomplete

C
/* Real-time typeahead */
LCSearchResult result;
lc_autocomplete(idx, "bud", 5, &result);

/* Returns: Budapest, Budaörs, Budakeszi, ... */
for (int i = 0; i < result.count; i++) {
    const LCEntity *e = lc_search_get_entity(idx, &result.matches[i]);
    printf("  %s (%s)\n", e->name, lc_class_string(e->fclass));
}
lc_search_result_free(&result);

Reverse Geocoding

C
/* Coordinate to address */
LCReverseOptions opts = lc_reverse_options_default();
opts.radius_m = 100;         /* search radius */

LCReverseResult rev;
SHCoord coord = { .lat = 43.7384, .lon = 7.4246 };
lc_reverse(idx, coord, &opts, &rev);

/* Format as address string */
char buf[256];
lc_format_address(&rev, buf, sizeof(buf));
printf("%s\n", buf);  /* "Avenue de Monte-Carlo, Monte Carlo, Monaco" */

lc_reverse_result_free(&rev);

REST API

EndpointDescription
GET /api/v1/search?q=monte+carloForward geocoding with fuzzy fallback
GET /api/v1/autocomplete?q=bud&limit=5Prefix-based typeahead
GET /api/v1/reverse?lat=43.73&lon=7.42Coordinate to structured address
GET /api/v1/healthHealth check
GET /api/v1/statsEntity count, memory, bounds, queue stats

Architecture

Address Parsing

Locus parses structured address queries in both European (“Kossuth utca 12”) and US (“123 Main Street”) formats. House number detection handles variants: ranges (“12-14”), letters (“3B”), and compound forms (“123/A”).

Language Support

ScriptCoverageFeatures
Latin (ASCII)Native trie alphabet37 slots: a-z, 0-9, space
Latin-1 Supplement64-entry diacritic mapÀ, É, Ö, ü, ñ, ç
Latin Extended-A128-entry mapĀ, ċ, Ė, Ğ, Ī
CyrillicUTF-8 awareFull case folding
GreekUTF-8 awareFull case folding

Performance Profile

OperationMonaco (2.7K entities)Hungary (1.1M entities)
Exact search5–11 µs2 µs
Autocomplete5–19 µs
Reverse geocode23–24 µs24.8 µs
Fuzzy search34.8 ms *
Index load (binary)<100 ms43.4 s *
PBF parse53.7 s *

Honest assessment. Fuzzy search and binary load at scale need optimization (roadmap items: O(n²) hit tracking, string deduplication, mmap for large indexes). Forward search and reverse geocoding already meet production targets. Memory usage (1.9GB for Hungary) needs reduction via interned string pools.

Server Architecture

Transport-agnostic handler pattern. Core logic in lc_api.c, thin wrappers for HTTP (Mongoose) and WASM. Rate limiting, work queue, CORS, structured stats—all via shared infrastructure.

Architecture
locus/
├── include/
│   ├── locus.h            # Top-level unified API
│   ├── lc_api.h           # Transport-agnostic handler
│   ├── lc_index.h         # Search/autocomplete/reverse
│   ├── lc_trie.h          # Prefix trie
│   ├── lc_ngram.h         # Trigram fuzzy search
│   ├── lc_spatial.h       # Spatial grid
│   ├── lc_normalize.h     # UTF-8 text pipeline
│   ├── lc_query.h         # Address parsing
│   ├── lc_types.h         # Entity, feature classes
│   ├── lc_serialize.h     # Save/load/mmap
│   └── lc_mmap.h          # Zero-copy access (v4)
├── src/
│   ├── lc_api.c           # REST handler (622 lines)
│   ├── lc_index.c         # Unified index (933 lines)
│   ├── lc_serialize.c     # Binary format (1,740 lines)
│   ├── lc_pbf.c           # OSM PBF parser (1,108 lines)
│   ├── lc_normalize.c     # Text normalization (603 lines)
│   ├── lc_ngram.c         # Fuzzy search (416 lines)
│   ├── lc_spatial.c       # Grid index (357 lines)
│   ├── lc_trie.c          # Prefix trie (234 lines)
│   ├── lc_types.c         # Entity/address (327 lines)
│   └── lc_query.c         # Address parsing (169 lines)
├── api/src/main.c         # Mongoose HTTP wrapper
├── wasm/                  # WASM wrapper
└── tests/test_locus.c     # 68 tests

Quick Start

Build
# Build library + tests
make locus

# Run tests
make test-locus         # 68 tests

# Build API server
make locus-api

# Start server
./locus/api/locus-geocoder data/monaco-latest.osm.pbf
# Or with pre-built index:
./locus/api/locus-geocoder data/monaco.lcx

# Build binary index for fast loading
./locus/api/locus-geocoder --build-only -S data/monaco.lcx data/monaco-latest.osm.pbf

# WASM build (requires Emscripten)
make locus-wasm
Query
# Forward search
curl "http://localhost:8083/api/v1/search?q=monte+carlo"

# Autocomplete
curl "http://localhost:8083/api/v1/autocomplete?q=bud&limit=5"

# Reverse geocode
curl "http://localhost:8083/api/v1/reverse?lat=43.7384&lon=7.4246"

# Health check
curl "http://localhost:8083/api/v1/health"

Developer FAQ

How does search ranking work?

Exact trie matches rank highest, then prefix matches (sorted by feature class importance: country > city > street > POI), then trigram fuzzy matches (by Jaccard similarity score). The lang option prefers localized names when available (e.g., name:en).

What’s the binary index format?

Version 4 .lcx files with magic 0x4C4F4355. Designed for mmap—all sections are offset-addressable with no pointer fixups needed. Entities are 72-byte fixed records, strings live in an interned pool. Use lc_index_mmap() for zero-copy access or lc_index_load_memory() for WASM.

How does reverse geocoding find the nearest street?

The spatial grid does O(1) cell lookup, then iterates entities in the cell. For streets, it computes Haversine-based point-to-segment distance using lc_point_to_linestring_distance(), not just centroid distance. This gives accurate results even for long, curved streets.

What about non-Latin scripts?

The normalization pipeline handles UTF-8 properly—multi-byte decode, case folding for Latin/Cyrillic/Greek, diacritic removal for Latin-1 and Extended-A ranges. CJK characters pass through as-is (the trie maps them to the space slot, but trigram search handles them natively). Extending to full Unicode case folding is a roadmap item.

Can I use it without OSM data?

Yes. Build an LCEntityStore manually with lc_entity_store_add(), then lc_index_build() from the store. The PBF parser is one data source—the index builder accepts any entity store. Useful for proprietary address databases.

Why three separate data structures instead of one?

Each query type has fundamentally different access patterns. A trie is optimal for prefix completion (O(m) where m is query length). A trigram index excels at fuzzy matching (set intersection). A spatial grid is optimal for geographic proximity (O(1) cell access). One data structure can’t do all three well. The total memory overhead is modest—most space is in the entity records and string pool shared by all three.