Pure C geocoding. Prefix trie autocomplete, trigram fuzzy search, spatial grid reverse geocoding. Zero-copy mmap index. Compiles to WASM.
Most geocoding requires cloud APIs, internet connectivity, and per-request pricing. Locus runs entirely on your device—in the browser, on an edge server, or embedded in a truck’s ELD. The entire binary index loads via mmap with zero deserialization.
Prefix trie for instant autocomplete, trigram index for typo-tolerant fuzzy search, spatial grid for reverse geocoding. Each optimized for its task.
Binary v4 format (.lcx) loads via mmap—no deserialization, no parsing, no memory allocation. Shared between processes.
Diacritic removal, case folding, abbreviation expansion, address parsing. Handles Latin, Cyrillic, Greek, CJK. European and US address formats.
Compiles to WASM for browser deployment. No cloud, no API keys, no per-request costs. Sub-20µs latency means real-time typeahead.
Locus uses three complementary data structures, each specialized for a different query type. The search engine tries exact match first, falls back to prefix, then fuzzy.
Query: "montec" Reverse: (43.73, 7.42)
│ │
▼ ▼
┌─────────────────┐ ┌───────────────────┐
│ lc_normalize() │ │ Spatial Grid │
│ montec → montec │ │ 0.01° cells (~1km)│
└────────┬─────────┘ └─────────┬─────────┘
│ │
┌─────┴─────┐ Grid Lookup
▼ ▼ O(1) per cell
┌──────┐ ┌────────┐ │
│ Trie │ │ Prefix │ ▼
│Exact │ │ Match │ ┌──────────────────┐
└──┬───┘ └───┬────┘ │ K-NN + Distance │
│ │ │ Point-to-segment │
No match "Monte Carlo" └──────────────────┘
│ "Monte Carlo Bay" │
▼ │ ▼
┌──────────┐ │ Street, Place, POI
│ Trigrams │ │ with admin hierarchy
│ Fuzzy │ │
│ Jaccard │ │
└────┬─────┘ │
▼ ▼
Ranked by similarity + importance37-slot alphabet (a–z, 0–9, space). Lazy child allocation. Entity ID lists only at matching nodes. Drives real-time autocomplete with O(m) lookup where m is query length.
Decomposes names into 3-character n-grams. Ranks candidates by Jaccard similarity: |intersection| / |union|. Configurable threshold (default 0.3). Falls back here when exact and prefix fail—handles typos, partial names, reordered words.
Uniform grid at 0.01° resolution (~1km cells). O(1) point lookup, radius search for K-NN. Haversine-based point-to-segment distance for accurate street matching. Returns structured results: place, street, address, POI, admin hierarchy.
| Stage | Example | Purpose |
|---|---|---|
| Case folding | “Budapest” → “budapest” | UTF-8 aware lowercase |
| Diacritic removal | “Österreich” → “osterreich” | Latin-1 + Extended-A maps |
| Abbreviation expansion | “St” → “Street” | Common address abbreviations |
| Whitespace collapse | “Foo Bar” → “Foo Bar” | Unicode space normalization |
| Punctuation removal | “St.” → “St” | Clean for index matching |
The .lcx binary format (v4) is designed for zero-copy access via mmap. No deserialization, no memory allocation—the OS pages in data on demand.
Magic: 0x4C4F4355 ("LOCU")
Version: 4
Header (96 bytes):
entity_count uint32
string_pool_size uint32
trie_node_count uint32
ngram_count uint32
grid_cell_count uint32
geometry_count uint32
bounds SHBBox
cell_size double
Sections:
1. Entity records (72B each)
2. Alt name indices
3. String pool (interned)
4. Trie nodes (160B each)
5. Entity ID lists
6. Grid cells + entries
7. N-gram entries
8. Geometry offsets + points
/* Zero-copy mmap load */
LCIndex *idx = lc_index_mmap("region.lcx");
/* Or from memory (WASM) */
LCIndex *idx = lc_index_load_memory(
data, len
);
/* Check format version */
if (lc_binary_version("region.lcx") < 4)
rebuild_index();
/* Index stats */
uint32_t n = lc_index_entity_count(idx);
size_t mem = lc_index_memory_usage(idx);
SHBBox bounds = lc_index_bounds(idx);
/* Cleanup */
lc_index_free(idx);| Feature Class | Examples | Importance |
|---|---|---|
LC_CLASS_COUNTRY | Monaco, Hungary | Highest |
LC_CLASS_CITY | Budapest, Vienna | High |
LC_CLASS_STREET | Avenue de Monte-Carlo | Medium |
LC_CLASS_POI | Casino de Monte-Carlo | Medium |
LC_CLASS_ADDRESS | 12 Rue Grimaldi | Low |
LC_CLASS_WATER | Port Hercule | Low |
#include "lc_index.h"
/* Build index from PBF */
LCIndex *idx = lc_index_create();
lc_index_build_from_pbf(idx, "monaco.osm.pbf", NULL);
/* Search with options */
LCSearchOptions opts = lc_search_options_default();
opts.limit = 10;
opts.lang = "en"; /* prefer English names */
LCSearchResult result;
lc_search(idx, "monte carlo", &opts, &result);
for (int i = 0; i < result.count; i++) {
const LCEntity *e = lc_search_get_entity(idx, &result.matches[i]);
printf("%s (%.4f, %.4f) score=%.2f\n",
e->name, e->centroid.lat, e->centroid.lon,
result.matches[i].score);
}
lc_search_result_free(&result);
/* Real-time typeahead */
LCSearchResult result;
lc_autocomplete(idx, "bud", 5, &result);
/* Returns: Budapest, Budaörs, Budakeszi, ... */
for (int i = 0; i < result.count; i++) {
const LCEntity *e = lc_search_get_entity(idx, &result.matches[i]);
printf(" %s (%s)\n", e->name, lc_class_string(e->fclass));
}
lc_search_result_free(&result);
/* Coordinate to address */
LCReverseOptions opts = lc_reverse_options_default();
opts.radius_m = 100; /* search radius */
LCReverseResult rev;
SHCoord coord = { .lat = 43.7384, .lon = 7.4246 };
lc_reverse(idx, coord, &opts, &rev);
/* Format as address string */
char buf[256];
lc_format_address(&rev, buf, sizeof(buf));
printf("%s\n", buf); /* "Avenue de Monte-Carlo, Monte Carlo, Monaco" */
lc_reverse_result_free(&rev);| Endpoint | Description |
|---|---|
GET /api/v1/search?q=monte+carlo | Forward geocoding with fuzzy fallback |
GET /api/v1/autocomplete?q=bud&limit=5 | Prefix-based typeahead |
GET /api/v1/reverse?lat=43.73&lon=7.42 | Coordinate to structured address |
GET /api/v1/health | Health check |
GET /api/v1/stats | Entity count, memory, bounds, queue stats |
Locus parses structured address queries in both European (“Kossuth utca 12”) and US (“123 Main Street”) formats. House number detection handles variants: ranges (“12-14”), letters (“3B”), and compound forms (“123/A”).
| Script | Coverage | Features |
|---|---|---|
| Latin (ASCII) | Native trie alphabet | 37 slots: a-z, 0-9, space |
| Latin-1 Supplement | 64-entry diacritic map | À, É, Ö, ü, ñ, ç |
| Latin Extended-A | 128-entry map | Ā, ċ, Ė, Ğ, Ī |
| Cyrillic | UTF-8 aware | Full case folding |
| Greek | UTF-8 aware | Full case folding |
| Operation | Monaco (2.7K entities) | Hungary (1.1M entities) |
|---|---|---|
| Exact search | 5–11 µs | 2 µs |
| Autocomplete | 5–19 µs | — |
| Reverse geocode | 23–24 µs | 24.8 µs |
| Fuzzy search | — | 34.8 ms * |
| Index load (binary) | <100 ms | 43.4 s * |
| PBF parse | — | 53.7 s * |
Honest assessment. Fuzzy search and binary load at scale need optimization (roadmap items: O(n²) hit tracking, string deduplication, mmap for large indexes). Forward search and reverse geocoding already meet production targets. Memory usage (1.9GB for Hungary) needs reduction via interned string pools.
Transport-agnostic handler pattern. Core logic in lc_api.c, thin wrappers for HTTP (Mongoose) and WASM. Rate limiting, work queue, CORS, structured stats—all via shared infrastructure.
locus/ ├── include/ │ ├── locus.h # Top-level unified API │ ├── lc_api.h # Transport-agnostic handler │ ├── lc_index.h # Search/autocomplete/reverse │ ├── lc_trie.h # Prefix trie │ ├── lc_ngram.h # Trigram fuzzy search │ ├── lc_spatial.h # Spatial grid │ ├── lc_normalize.h # UTF-8 text pipeline │ ├── lc_query.h # Address parsing │ ├── lc_types.h # Entity, feature classes │ ├── lc_serialize.h # Save/load/mmap │ └── lc_mmap.h # Zero-copy access (v4) ├── src/ │ ├── lc_api.c # REST handler (622 lines) │ ├── lc_index.c # Unified index (933 lines) │ ├── lc_serialize.c # Binary format (1,740 lines) │ ├── lc_pbf.c # OSM PBF parser (1,108 lines) │ ├── lc_normalize.c # Text normalization (603 lines) │ ├── lc_ngram.c # Fuzzy search (416 lines) │ ├── lc_spatial.c # Grid index (357 lines) │ ├── lc_trie.c # Prefix trie (234 lines) │ ├── lc_types.c # Entity/address (327 lines) │ └── lc_query.c # Address parsing (169 lines) ├── api/src/main.c # Mongoose HTTP wrapper ├── wasm/ # WASM wrapper └── tests/test_locus.c # 68 tests
# Build library + tests make locus # Run tests make test-locus # 68 tests # Build API server make locus-api # Start server ./locus/api/locus-geocoder data/monaco-latest.osm.pbf # Or with pre-built index: ./locus/api/locus-geocoder data/monaco.lcx # Build binary index for fast loading ./locus/api/locus-geocoder --build-only -S data/monaco.lcx data/monaco-latest.osm.pbf # WASM build (requires Emscripten) make locus-wasm
# Forward search curl "http://localhost:8083/api/v1/search?q=monte+carlo" # Autocomplete curl "http://localhost:8083/api/v1/autocomplete?q=bud&limit=5" # Reverse geocode curl "http://localhost:8083/api/v1/reverse?lat=43.7384&lon=7.4246" # Health check curl "http://localhost:8083/api/v1/health"
Exact trie matches rank highest, then prefix matches (sorted by feature class importance: country > city > street > POI), then trigram fuzzy matches (by Jaccard similarity score). The lang option prefers localized names when available (e.g., name:en).
Version 4 .lcx files with magic 0x4C4F4355. Designed for mmap—all sections are offset-addressable with no pointer fixups needed. Entities are 72-byte fixed records, strings live in an interned pool. Use lc_index_mmap() for zero-copy access or lc_index_load_memory() for WASM.
The spatial grid does O(1) cell lookup, then iterates entities in the cell. For streets, it computes Haversine-based point-to-segment distance using lc_point_to_linestring_distance(), not just centroid distance. This gives accurate results even for long, curved streets.
The normalization pipeline handles UTF-8 properly—multi-byte decode, case folding for Latin/Cyrillic/Greek, diacritic removal for Latin-1 and Extended-A ranges. CJK characters pass through as-is (the trie maps them to the space slot, but trigram search handles them natively). Extending to full Unicode case folding is a roadmap item.
Yes. Build an LCEntityStore manually with lc_entity_store_add(), then lc_index_build() from the store. The PBF parser is one data source—the index builder accepts any entity store. Useful for proprietary address databases.
Each query type has fundamentally different access patterns. A trie is optimal for prefix completion (O(m) where m is query length). A trigram index excels at fuzzy matching (set intersection). A spatial grid is optimal for geographic proximity (O(1) cell access). One data structure can’t do all three well. The total memory overhead is modest—most space is in the entity records and string pool shared by all three.