Spatial searches with Zend_Search_Lucene

The situation: I have a database of businesses, each with a name, product, and physical address.  The goal: Using Zend_Search_Lucene, I want to find all businesses selling a certain product within a certain distance from my current location.

In a hurry?  Skip the explanations and go right to the final code.

I need a few things to meet this goal:

  1. A geocoder, to translate addresses into latitude and longitude ("lat/lon").
  2. A Lucene index with the name, product, and lat/lon for each business.
  3. A way to get the current location as a lat/lon.
  4. An algorithm to limit hits to the spatial area I'm interested in.

In future articles, I'll show the code for the first three.  In this article, I'm focusing on what was the hardest part to implement.

The first version: filtering hits after a query

The obvious way is to perform your query, then iterating over the hits and throwing out those who are not in range.  The code would be something like this:

$query  = 'apples'; // what product I'm looking for
$center = new Position(40.9, -73.1);  // my current location, as a lat/lon
$radius = 5; // how far away I want to look, in miles

$index = Zend_Search_Lucene::open('/path/to/index');
foreach ($index->find($query) as $hit) {
    if ($hit->getPosition()->within($radius, $center)) {
        // this is a match!
    }
}

Obviously, I'm hand-waving the Position class and its within() method, but this is pretty easy to implement.  The Haversine formula is a good start along those lines.

Now imagine I have a very large database, say hundreds of thousands of businesses all internationally.  And, to keep performance good, I wanted to limit my results to the first 150 with Zend_Search_Lucene::setResultSetLimit().  With this algorithm, both results number 150 and number 151 match my query, but result 150 might NOT be in the range, while result 151 might be.  Whoops!

The second version: spatial ranges

Really what I want to do is add a spatial range to my query.  Hand-waving again, I want something like "+apples +within:5".  That is, I want the find() method to include only those businesses that sell apples and are within 5 miles of me.  To do that, I've got to index the latitude and longitude of the businesses, and I've got to use some Lucene expression to describe the area.

Enter Zend_Search_Lucene_Search_Query_Range.  The ranges I'm going to give it are latitude and longitude ranges corresponding to the area I want.  How do I convert the variables radial distance and center point to an area I can use in a range?  Look at this picture:

Animated picture showing algorithm to compute minimum bounding rectangle from a center latitude/longitude and radius.

First, the center point is defined and the radius drawn from it.  Then, a square is circumscribed around that circle.  That square is called a minimum bounding rectangle and gives enough information to provide a range of latitudes and longitudes to search over.  What I've got now is something like this: "+apples +latitude:[-120.01 TO -119.87] +longitude:[79.0881 TO 80.1024]", which I can build with this code:

class Application_Service_Search_Query {
    public static function addRadialDistanceFromPointToQuery(Zend_Search_Lucene_Search_Query $query, $latitude, $longitude, $radiusInMiles) {
        // convert radial distance in miles to degrees latitude
        // 1 deg lat = 69.170323 statue miles (pi/180*R_earth_in_miles)
        static $milesPerDegreeLatitude = 69.170323;
        $radiusInDegreesLatitude = $radiusInMiles/$milesPerDegreeLatitude;

        // get range of latitude by adding & subtracting radius in degrees latitude from geocoded latitude
        $lat1 = $latitude - $radiusInDegreesLatitude;
        $lat2 = $latitude + $radiusInDegreesLatitude;

        // get range of longitude by adding & subtracting radius in degrees longitude from geocoded longitude
        // NOTE: 1 deg long = 1 deg lat * cosine(lat)
        $lon1 = $longitude - ($radiusInDegreesLatitude * cos(deg2rad($lat1)));
        $lon2 = $longitude + ($radiusInDegreesLatitude * cos(deg2rad($lat2)));

        // normalize values for inclusion in query
        $minLat = Application_Service_Search_Normalize::latitude(min($lat1, $lat2));
        $maxLat = Application_Service_Search_Normalize::latitude(max($lat1, $lat2));
        $minLon = Application_Service_Search_Normalize::longitude(min($lon1, $lon2));
        $maxLon = Application_Service_Search_Normalize::longitude(max($lon1, $lon2));

        // add a range term for latitude
        $latRange = new Zend_Search_Lucene_Search_Query_Range(
                        new Zend_Search_Lucene_Index_Term($minLat, 'latitude'),
                        new Zend_Search_Lucene_Index_Term($maxLat, 'latitude'),
                        true // inclusive
                    );
        $query->addSubquery($latRange, true /* required */);

        // add a range term for longitude
        $lonRange = new Zend_Search_Lucene_Search_Query_Range(
                        new Zend_Search_Lucene_Index_Term($minLon, 'longitude'),
                        new Zend_Search_Lucene_Index_Term($maxLon, 'longitude'),
                        true // inclusive
                    );
        $query->addSubquery($lonRange, true /* required */);
    }
}

Notice the "normalization" calls?  In an ideal world, I wouldn't need them.  But the reality is that Lucene's query range uses lexigraphical comparison: values are compared character-by-character, like how strcmp() behaves.  That's why you see examples with date ranges in a canonical format of "20110807", which is YYYYMMDD format.

So every latitude and longitude must be normalized into a format that can be lexigraphically compared.  This is a rare case where it's easier done than said.  All we need do is convert the possible latitudes and longitudes into strings of equal length.  That means the minimum value should correspond to "0000" out to some desired precision (4 digits in that example, 8 digits below) and there should never be negatives.  The algorithms are:

class Application_Service_Search_Normalize {
    public static function latitude($latitude) {
        return sprintf('%08d', round((85.0511+$latitude)*100000));
    }
    public static function longitude($longitude) {
        return sprintf('%08d', round((180.0+$longitude)*100000));
    }
}

Beware normalization

The normalization needs to happen in both the query and the indexing.  For the query, we're adding terms in programmatically, so we do the normalization programmatically.  But, when you're building your index, you need to call the normalization methods for latitude and longitude.

Conclusion: Putting it all together

To search for documents within a certain radius of a point with Zend_Search_Lucene, you need:

  1. Normalized latitude and longitude for each document in your index
  2. To represent the spatial area using Zend_Search_Lucene_Search_Query_Range

The code below puts the pieces together:

class Application_Service_Search_Normalize {
    public static function latitude($latitude) {
        return sprintf('%08d', round((85.0511+$latitude)*100000));
    }
    public static function longitude($longitude) {
        return sprintf('%08d', round((180.0+$longitude)*100000));
    }
}
class Application_Service_Search_Query {
    public static function addRadialDistanceFromPointToQuery(Zend_Search_Lucene_Search_Query $query, $latitude, $longitude, $radiusInMiles) {
        static $milesPerDegreeLatitude = 69.170323;
        $radiusInDegreesLatitude = $radiusInMiles/$milesPerDegreeLatitude;

        $lat1 = $latitude - $radiusInDegreesLatitude;
        $lat2 = $latitude + $radiusInDegreesLatitude;
        $lon1 = $longitude - ($radiusInDegreesLatitude * cos(deg2rad($lat1)));
        $lon2 = $longitude + ($radiusInDegreesLatitude * cos(deg2rad($lat2)));

        $minLat = Application_Service_Search_Normalize::latitude(min($lat1, $lat2));
        $maxLat = Application_Service_Search_Normalize::latitude(max($lat1, $lat2));
        $minLon = Application_Service_Search_Normalize::longitude(min($lon1, $lon2));
        $maxLon = Application_Service_Search_Normalize::longitude(max($lon1, $lon2));

        $latRange = new Zend_Search_Lucene_Search_Query_Range(
                        new Zend_Search_Lucene_Index_Term($minLat, 'latitude'),
                        new Zend_Search_Lucene_Index_Term($maxLat, 'latitude'),
                        true
                    );
        $query->addSubquery($latRange, true);

        $lonRange = new Zend_Search_Lucene_Search_Query_Range(
                        new Zend_Search_Lucene_Index_Term($minLon, 'longitude'),
                        new Zend_Search_Lucene_Index_Term($maxLon, 'longitude'),
                        true
                    );
        $query->addSubquery($lonRange, true);
    }
}

$index = new Zend_Search_Lucene('/path/to/index');

// the inputs
$query = 'apples'; // probably from a form
$lat = 36.987;     // from a geocoder, the original location coming from a form
$lon = -78.242;    // more on geocoding in a future article
$radius = 5;       // probably from a form

// run the query
$parse = Zend_Search_Lucene_Search_QueryParser::parse($query);
Application_Service_Search_Query::addRadialDistanceFromPointToQuery($parse, $lat, $lon, $radius);

$index->find($parse); // only contains apples within 5 miles of (36.987,-78.242)

// remember to build your index with **normalized** latitude and longitude, like:
$doc = new Zend_Search_Lucene_Document();
$doc->addField(
  Zend_Search_Lucene_Field::UnStored('latitude', Application_Service_Search_Normalize::latitude(36.987))
); 
$doc->addField(
  Zend_Search_Lucene_Field::UnStored('longitude', Application_Service_Search_Normalize::longitude(-78.242))
);
$index->addDocument($doc);

If you're read this far, you deserve chocolate!

Trackback URL for this post:

http://www.ideacode.com/trackback/47

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Syntax highlight code surrounded by the {syntaxhighlighter SPEC}...{/syntaxhighlighter} tags, where SPEC is a Syntaxhighlighter options string or "class="OPTIONS" title="the title".
  • Lines and paragraphs break automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.

Quick Links