I have an ESRI shape file containing a library of about 150 contiguous geographic areas (polygons) that together comprise a geographic region. I also have a csv file containing 60,000 events which each have a unique set of x,y point coordinates. Each of the events in the csv file occurred within one (and only one) of the 150 polygons in the shape file, but I do not know which one of the polygons to associate with each record. Therefore, I need to write an algorithm that will find out the identity of the polygon within which each of the 60,000 events occured. The output from the algorithm that I need to write will enable me to subsequently compute statistics, such as the likelihood of specific types of events occurring within specific polygons(geographic areas).
(Also, of course, the shape file is not just one file. It is a directory containing 8 files with file extensions including .dbf, .prj, .qix, .sbn, .sbx, .shp, .xml, and .shx.)
I do not have an ArcGIS license. I have found the File Geodatabase API at http://resources.arcgis.com/content/geodatabases/10.0/file-gdb-api but I am not sure that it is the right toolset, and I am also having trouble finding sample code.
Can anyone show me some code for finding which polygon (from a shape file) each of a large number of points (from an external data source such as a csv file) falls within?
Also, I need to specify that I need to be able to add a specific code for the identity of the relevant polygon to the csv record for each event. So it is not enough for me to simply plot the points on the map to visualize which polygons contain the events. I do not need to visualize this data at all. Instead, what I need is to be able to tag a polygon id to each event record in the csv file so that I can do subsequent numerical analysis that is not visual in nature.
Links to articles, tutorials, and other resources on this topic are also appreciated. I imagine that this is a problem that people solve every day, so there must be an established code base out there, if a person knows how to look for it. I code in Java every day, so Java solutions are preferred. However, I can port something from another language if you have a good code sample written in a different language.
*EDIT: *
I tried the following R code based on Spacedman's suggestion, and I got the following error message:  
> myCSV <- read.csv(file="myCSVFile.csv",head=TRUE,sep=",")  
> pts = SpatialPoints(myCSV)  
> ZipCodes = readShapeSpatial("path/myshapefile.shp")  
> overlay(myCSV,ZipCodes)  
Error in function (classes, fdef, mtable)  : unable to find an inherited method for function "overlay", for signature "data.frame", "SpatialPolygonsDataFrame"  
> 
See my other comments below.
SECOND EDIT:
The R code that I ended up using is:
myCSV <- read.csv(file="myData.csv",head=TRUE,sep=",")  
pts = SpatialPoints(myCSV)  
ZipCodes = readShapeSpatial("myPath/ZipCodes.shp")  
write.csv(ZipCodes$ZIPCODE[overlay(pts,ZipCodes)], "ZipMatches.csv", quote=FALSE, row.names=FALSE)
Note: I had to use:
summary(ZipCodes)  
to find the name of the field that contained the encodings for the zip codes. Before I ran summary(ZipCodes), the script was just putting out the index of each zip code and not the zip code itself.
The java library for doing point/poly operations is the JTS:
http://www.vividsolutions.com/jts/JTSHome.htm
You probably need something else to read the shapefile, maybe this:
http://openmap.bbn.com/doc/api/com/bbn/openmap/layer/shape/ShapeFile.html
or maybe java bindings to OGR and GDAL:
http://gdal.org/java/
HOWEVER.. you might just be able to do this with an open source GIS package: Quantum GIS is my fave, but if you want a Java-based one there's a few - OpenJump, gvSIG (www.osgeo.org for everything on those things).
In R, if you have read your point coordinates into a matrix or data frame:
> xy
            [,1]       [,2]
 [1,]  15.224275  -0.840066
 [2,]  -1.207046   7.959644
 [3,]   9.397658  17.426323
 [4,]  28.242840 -29.581008
 [5,]  18.664603  15.361146
 [6,]   0.975846   8.534910
 [7,] -10.941825  10.438541
 [8,] -10.331097  20.260005
 [9,]   8.105570   9.595169
[10,] -14.468177  14.366980
Then, using the maptools package and its dependencies:
> require(maptools)
First make a SpatialPoints object from your coordinates:
> pts = SpatialPoints(xy)
Then read in your shapefile:
> africa=readShapeSpatial("/path/to/africa.shp")
Now overlay:
> overlay(pts,africa)
 [1] 12 20 39 27 10 55 22 33 40 45
that's a vector of row numbers in the shapefile. You can lookup attributes in the shapefile easy enough:
> africa$CNTRY_NAME[overlay(pts,africa)]
 [1] Congo      Ghana      Niger      Lesotho    Chad       Togo      
 [7] Guinea     Mauritania Nigeria    Senegal   
61 Levels: Algeria Angola Benin Botswana Burkina Faso Burundi ... Zimbabwe
Then if you want to write a vector to a CSV file,
> write.csv(africa$CNTRY_NAME[overlay(pts,africa)],file="out.csv")
produces:
"","x"
"1","Congo"
"2","Ghana"
"3","Niger"
"4","Lesotho"
"5","Chad"
"6","Togo"
"7","Guinea"
"8","Mauritania"
"9","Nigeria"
"10","Senegal"
Comma separated, quoted, with a header called 'x'. These things are tweakable with other options to write.csv.
If any of your points fall outside the polygons, the return overlay vector will be an 'NA' value, you may want to test for this:
> if(any(is.na(overlay(pts,africa)))){stop("splash!")}
Error: splash!
Nuff?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With