Explanation
Deep dives into the concepts and architecture of poiidx.
Overview
poiidx is designed to solve a common problem in location-based applications: efficiently finding nearby Points of Interest (POIs) from OpenStreetMap data. This section explains the design decisions, architecture, and concepts behind poiidx.
Automatic Schema Management and Data Lifecycle
poiidx implements automatic schema detection to ensure data integrity. On each initialization (init() or init_if_new()), poiidx:
- Computes a hash of the current database schema (from model definitions)
- Compares it with the stored schema hash in the database
- Checks if the filter configuration has changed
- Automatically drops and recreates all tables if any mismatch is detected
This means:
- Data in the database should be considered temporary and regeneratable
- Updating the poiidx library may trigger a full data recreation
- Changing your filter configuration will trigger a full data recreation
- You cannot add custom tables to the poiidx database reliably
- The database acts as a managed cache, not persistent storage
This design ensures that the data structure always matches the code, preventing schema migration issues and data corruption.
Architecture
System Components
┌─────────────────────────────────────────────────────────┐
│ Your Python Application │
└───────────────────────────┬─────────────────────────────┘
│
▼
┌───────────────────────┐
│ poiidx.__init__ │ High-level API
│ (init, get_nearest │ functions
│ get_admin_hierarchy)│
└───────────┬───────────┘
│
▼
┌──────────────────────────────────────┐
│ PoiIdx (Manager) │
│ ┌────────────────────────────────┐ │
│ │ • connect/init_if_new │ │
│ │ • init_regions_by_shape │ │
│ │ • get_nearest_pois │ │
│ │ • get_administrative_hierarchy │ │
│ └────────────────────────────────┘ │
└──┬───────────┬───────────┬───────────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────────┐
│ Region │ │ PBF │ │ Scanner │
│ Finder │ │ Handler │ │ (poi_scan, │
│ │ │ │ │ admin_scan) │
└──────────┘ └────┬─────┘ └──────┬───────┘
│ │
▼ │
┌─────────────────┐ │
│ Geofabrik │ │
│ Downloader │ │
└────────┬────────┘ │
│ │
▼ ▼
┌──────────────────────────────┐
│ Local PBF Cache │
│ (~/.cache/poiidx/) │
└──────────────────────────────┘
│
│ Read PBF files
▼
┌──────────────────────────────┐
│ Osmium Library │
│ (OSM data parsing) │
└──────────────┬───────────────┘
│
▼
┌──────────────────────────────────────┐
│ Peewee ORM Models │
│ ┌────────┐ ┌────────┐ ┌──────────┐ │
│ │ Poi │ │ Admin │ │ Country │ │
│ │ │ │Boundary│ │ │ │
│ └────────┘ └────────┘ └──────────┘ │
│ ┌────────────┐ │
│ │ System │ │
│ └────────────┘ │
└──────────────┬───────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ PostgreSQL + PostGIS │
│ ┌────────────────────────────────┐ │
│ │ Tables with Spatial Indexes │ │
│ │ • poi (SPGIST on coordinates) │ │
│ │ • administrativeboundary │ │
│ │ (GIST on coordinates) │ │
│ │ • country (GIST on geometry) │ │
│ │ • system (region index + │ │
│ │ filter config) │ │
│ └────────────────────────────────┘ │
└──────────────────────────────────────┘
Data Source: OpenStreetMap via Geofabrik (downloads on-demand)
Key Components
- poiidx module (
__init__.py) - High-level API that wraps PoiIdx functionality and converts results to dictionaries for easy use.
- PoiIdx Manager
- Central orchestrator that manages database connections, schema initialization, region detection, and coordinates data loading.
- RegionFinder
- Spatial index for efficiently determining which Geofabrik regions contain a query point. Uses a serialized R-tree stored in the System table.
- Geofabrik Downloader
- Downloads regional OSM extracts (PBF files) from Geofabrik's servers on-demand when a region is first queried.
- PBF Handler
- Manages local cache of PBF files in
~/.cache/poiidx/and provides access for scanning. - Scanner (poi_scan, administrative_scan)
- Uses Osmium library to parse PBF files, filter OSM data based on tags, and extract POIs and administrative boundaries.
- Peewee ORM Models
- Database models representing POIs, administrative boundaries, countries, and system configuration.
- System Table
- Stores filter configuration and a serialized region index (R-tree) for fast region lookups.
Spatial Indexing
Why Spatial Indexes Matter
Finding the nearest POI to a location is computationally expensive without proper indexing. A naive approach would calculate the distance from the query point to every POI in the database - for millions of POIs, this is impractical.
PostGIS Spatial Indexes
poiidx uses PostGIS spatial indexes (specifically SPGIST for points) to achieve efficient queries:
- SPGIST (Space-Partitioned GIST)
- Optimized for point geometries. Creates a spatial tree structure that allows the database to quickly eliminate irrelevant POIs without calculating distances.
- GIST (Generalized Search Tree)
- Used for administrative boundaries which are polygons. Supports efficient containment queries.
KNN Index Operator
The KNN (K-Nearest Neighbor) operator (<->) in PostGIS allows finding the K nearest geometries efficiently:
This query runs in logarithmic time O(log n) instead of linear time O(n) thanks to the spatial index.
Distance Filtering
Geography vs Geometry
poiidx uses PostGIS geography types for distance calculations:
- Geography
- Treats coordinates as points on an ellipsoid (Earth). Distance calculations account for the Earth's curvature, providing accurate results in meters.
- Geometry
- Treats coordinates as points in a flat Cartesian plane. Faster but less accurate for distance calculations.
The max_distance Parameter
When you specify max_distance, poiidx uses PostGIS's ST_DWithin function:
This function: 1. Uses the spatial index to find candidates 2. Calculates exact geodesic distances 3. Returns only POIs within the threshold
The combination with KNN ordering ensures you get the nearest POIs that are also within your distance limit.
Region Management
Why Regions?
OpenStreetMap data is massive (hundreds of GB globally). poiidx uses a regional approach:
- On-demand download: Only downloads data for regions you query
- Efficient storage: Stores only the filtered POIs you care about
- Fast initialization: Subsequent queries to the same region are instant
Region Hierarchy
Regions follow the Geofabrik structure:
world
├── africa
│ ├── egypt
│ └── south-africa
├── europe
│ ├── germany
│ │ ├── berlin
│ │ └── bavaria
│ └── france
└── north-america
└── us
├── california
└── new-york
Region Initialization Process
When you query a location:
- Region Detection: Determine which region(s) contain the query point
- Download Check: If region data isn't in the database, download it
- PBF Processing: Extract POIs matching your filters from the PBF file
- Database Insert: Store POIs with spatial indexes
- Cache: Store PBF file locally for future use
Filter Configuration
How Filters Work
Filters define which POIs to extract from OpenStreetMap data:
- symbol: food # Category identifier
description: Restaurants # Human-readable description
filters: # List of tag combinations
- amenity: restaurant # Match amenity=restaurant
- amenity: cafe # OR amenity=cafe
Filter Logic:
- Each item in filters is a dictionary of OSM tags
- All tags within a single dict must match (AND logic)
- Different dicts in the list are alternatives (OR logic)
Example - Simple OR filter:
Example - Complex AND filter:
filters:
- public_transport: station # Match BOTH of these
train: 'yes'
- railway: station # OR match this
When processing OSM data, poiidx: 1. Scans all POI nodes and ways 2. Checks if their tags match any filter combination 3. Extracts matching features 4. Stores them with the specified symbol
Rank System
Ranks are calculated automatically based on:
- POI size: Larger areas get lower ranks (higher priority)
- Place tag: POIs with
place=cityrank higher thanplace=village - Default rank: POIs without size/place info get maximum rank
The rank calculation ensures that important, larger POIs are prioritized in queries.
You can filter queries by rank range to focus on important POIs using the rank_range parameter.
Administrative Boundaries
OSM Admin Levels
OpenStreetMap uses standardized admin levels:
- Level 2: Countries
- Level 4: States/Provinces
- Level 6: Counties/Regions
- Level 8: Cities/Municipalities
- Level 9: City districts
- Level 10: Suburbs/Neighborhoods
Different countries may use levels slightly differently, but the general hierarchy is consistent.
Localization
Administrative boundaries in OSM often include localized names:
poiidx's get_administrative_hierarchy_string() uses these tags to provide localized names when available.
Country Resolution
When country information is not directly available in the OpenStreetMap administrative boundaries, poiidx uses Wikidata to resolve country names:
- Administrative boundaries in OSM often include a
wikidatatag with the entity ID - poiidx queries the Wikidata API to find the country (P17 property) for that entity
- Country names and localized labels are retrieved from Wikidata
- Results are cached in the database for future queries
This approach ensures comprehensive country information across different regions, even where OSM data may be incomplete.
Database Schema
POI Table
CREATE TABLE poi (
id SERIAL PRIMARY KEY,
osm_id VARCHAR(255),
name VARCHAR(255),
region VARCHAR(255),
coordinates GEOGRAPHY(POINT, 4326), -- WGS84
filter_item VARCHAR(255),
filter_expression VARCHAR(255),
rank INTEGER,
symbol VARCHAR(255)
);
CREATE INDEX poi_coordinates_idx
ON poi USING SPGIST(coordinates);
Why SPGIST?
SPGIST is ideal for point data because: - Faster for points: Outperforms GIST for point-only datasets - Better KNN: Optimized for nearest neighbor queries - Space efficient: Smaller index size than GIST
Administrative Boundary Table
CREATE TABLE administrativeboundary (
id SERIAL PRIMARY KEY,
osm_id VARCHAR(255),
name VARCHAR(255),
admin_level INTEGER,
geometry GEOGRAPHY(GEOMETRY, 4326),
tags JSONB
);
CREATE INDEX admin_geometry_idx
ON administrativeboundary USING GIST(geometry);
GIST is used here because boundaries are polygons, not points.
Performance Considerations
Query Performance
Typical query times (depends on database size and hardware):
- Nearest POI query: 1-10ms
- Admin hierarchy query: 5-20ms
- Region initialization (first time): Seconds to minutes
Optimization Strategies
- Use appropriate max_distance
- Smaller distances = faster queries. Don't query the entire world if you only need nearby POIs.
- Filter by region
- If you know the region, specify it to avoid unnecessary spatial filtering.
- Use rank filtering
- Limiting to high-priority POIs reduces the search space.
- Batch queries
- If querying multiple points, keep the connection open rather than reinitializing.
Index Maintenance
PostGIS indexes are automatically maintained, but you can optimize:
Data Flow
Initial Setup
User calls poiidx.init()
↓
Creates database tables
↓
Downloads region index from Geofabrik
↓
Stores filter configuration
First Query to a Region
User calls get_nearest_pois(berlin_point)
↓
Detect region (europe/germany)
↓
Check if region data exists (No)
↓
Download PBF file for germany
↓
Scan PBF for matching POIs
↓
Insert POIs into database
↓
Create spatial indexes
↓
Execute query
↓
Return results
Subsequent Queries
User calls get_nearest_pois(another_berlin_point)
↓
Detect region (europe/germany)
↓
Check if region data exists (Yes)
↓
Execute query (fast!)
↓
Return results
Design Decisions
Why PostgreSQL + PostGIS?
- PostgreSQL
- Industry-standard open-source database with excellent spatial support.
- PostGIS
- The most mature and feature-rich spatial database extension. Provides: - Accurate geodesic calculations - Sophisticated spatial indexes - Rich set of spatial functions - Active development and community
Why Peewee ORM?
Peewee is a lightweight Python ORM that: - Has excellent PostgreSQL support - Supports spatial extensions - Minimal boilerplate - Good performance - Easy to use
Why Geofabrik?
Geofabrik provides: - Regional extracts of OSM data - Regular updates (daily/weekly) - Reliable hosting - Free for non-commercial use - Standardized naming scheme
Why On-Demand Loading?
Pre-loading all OSM data globally would require: - Hundreds of GB of disk space - Hours of processing time - Constant updates
On-demand loading provides: - Fast startup - Minimal storage - Only download what you need
Limitations
Coverage
- Limited to regions available on Geofabrik
- Data freshness depends on Geofabrik update frequency
- Some regions may have incomplete OSM data
Accuracy
- Distance calculations assume WGS84 ellipsoid (accurate to ~1m)
- POI locations depend on OSM data quality
- Administrative boundaries may have political disputes
Performance
- First query to a new region requires download and processing
- Very large regions (e.g., entire continents) may be slow
- Database size grows with number of regions
Concurrency
- Database connections are not thread-safe by default
- Use connection pooling for multi-threaded applications
Future Directions
Potential enhancements:
- Real-time updates: Sync with OSM change feeds
- Custom data sources: Import POIs from other sources
- Routing integration: Calculate distances along roads
- Clustering: Group nearby POIs for visualization
- Caching layer: Redis/Memcached for frequently accessed queries
Related Technologies
- OpenStreetMap: Source of geographic data
- Geofabrik: OSM data distribution
- Wikidata: Knowledge base for country information when not available in OSM
- PostGIS: Spatial database extension
- Shapely: Python library for geometric objects
- Osmium: Fast OSM data processing
- Pyproj: Coordinate transformations