Modeling spatial relationships
This document provides additional information about tool parameters but also introduces essential vocabulary and concepts important when you analyze your data using the Spatial Statistics tools.
Whenever distance is a component of your analysis, which is almost always the case with the Spatial Statistics tools, project your data using a projected coordinate system (rather than a geographic coordinate system based on degrees, minutes, seconds).
Conceptualization of spatial relationships
An important difference between spatial and traditional (aspatial or nonspatial) statistics is that spatial statistics integrate space and spatial relationships directly into their mathematics. Consequently, many of the tools in the spatial statistics toolbox require the user to select a value for the Conceptualization of Spatial Relationships parameter prior to analysis. Common conceptualizations include inverse distance, travel time, fixed distance, K nearest neighbors, and contiguity. The conceptualization of spatial relationships you use will depend on what you're measuring. If you're measuring clustering of a particular species of seed-propagating plant, for example, inverse distance is probably most appropriate. However, if you are assessing the geographic distribution of a region's commuters, travel time or travel cost might be better choices for describing those spatial relationships. For some analyses, space and time might be less important than more abstract concepts like familiarity (the more familiar something is, the more functionally near it is) or spatial interaction (there are many more phone calls, for example, between Los Angeles and New York than between New York and a smaller town nearer to New York, like Poughkeepsie; some might argue that Los Angeles and New York are functionally closer).
Options for the Conceptualization of Spatial Relationships parameter are discussed below. The option you select determines neighbor relationships for tools that assess each feature within the context of neighboring features. These tools include the Spatial Autocorrelation (Global Moran's I), Hot Spot Analysis (Getis-Ord Gi*), and Cluster and Outlier Analysis (Anselin Local Moran's I) tools. Note that some of these options are only available if you use the Generate Spatial Weights Matrix or Generate Network Spatial Weights tools.
Inverse distance, inverse distance squared (impedance)
With the Inverse Distance options, the conceptual model of spatial relationships is one of impedance, or distance decay. All features impact/influence all other features, but the farther away something is, the smaller the impact it has. You will generally want to specify a Distance Band or Threshold Distance value when you use an inverse distance conceptualization to reduce the number of required computations, especially with large datasets. When no distance band or threshold distance is specified, a default threshold value is computed for you. You can force all features to be a neighbor of all other features by setting Distance Band or Threshold Distance to zero.
Inverse Euclidean distance is appropriate for modeling continuous data like temperature variations, for example. Inverse Manhattan distance might work best when analyses involve the locations of hardware stores or other fixed urban facilities, in the case where road network data isn't available. The conceptual model when you use the Inverse Distance Squared option is the same as with Inverse Distance except the slope is sharper so neighbor influences drop off more quickly and only a target feature's closest neighbors will exert substantial influence in computations for that feature.
Distance band (sphere of influence)
For some tools, like Hot Spot Analysis, a fixed distance band is the default conceptualization of spatial relationships. With the Fixed Distance Band option, you impose a "sphere of influence" or moving window conceptual model of spatial interactions onto the data. Each feature is analyzed within the context of those neighboring features located within the distance you specify for Distance Band or Threshold Distance. Neighbors within the specified distance are weighted equally. Features outside the specified distance don't influence calculations (their weight is zero). Use the Fixed Distance Band method when you want to evaluate the statistical properties of your data at a particular (fixed) spatial scale. If you are studying commuting patterns and know that the average journey to work is 15 miles, for example, you may want to use a 15-mile fixed distance for your analysis. See Selecting a fixed distance.
Zone of indifference
The Zone of Indifference option for the Conceptualization of Spatial Relationships parameter combines the Inverse Distance and Fixed Distance Band models. Features within the distance band or threshold distance are included in analyses for the target feature. Once the critical distance is exceeded, the level of influence (the weighting) quickly drops off. Suppose you're looking for a job and have the choice between a job 5 miles away and another job 6 miles away. You probably won't think much about distance in making a decision about which job to take. Now, suppose you have the choice between one job 5 miles away and another 20 miles away. In this case, distance becomes more of an impedance and may be factored into your decision making. Use this method when you want to hold the scale of analysis fixed but don't want to impose sharp boundaries on the neighboring features included in target feature computations.
Polygon contiguity (first order)
For polygon feature classes, you can choose first order contiguity. Polygons that share an edge (have coincident boundaries) are included in computations for the target polygon. Polygons that do not share an edge are excluded from the target feature computations. This option is also referred to as Polygon Contiguity Edges Only. Polygon Contiguity Edges and Corners (available using the Generate Spatial Weights Matrix tool) constructs neighbors from polygons that share either a boundary (edge) or a corner (node). Use one of these contiguity conceptualizations with polygon features in cases where you are modeling some type of contagious process or are dealing with continuous data represented as polygons.
K nearest neighbors
Neighbor relationships may also be constructed so that each feature is assessed within the spatial context of a fixed number of its closest neighbors. If K (the number of neighbors) is 8, then the eight closest neighbors to the target feature will be included in computations for that feature. In locations where feature density is high, the spatial context of the analysis will be smaller. Similarly, in locations where feature density is sparse, the spatial context for the analysis will be larger. An advantage to this model of spatial relationships is that it ensures there will be some neighbors for every target feature, even when feature densities vary widely across the study area. This method is available using the Generate Spatial Weights Matrix tool.
Delaunay triangulation (natural neighbors)
The Delaunay Triangulation option constructs neighbors by creating Voronoi triangles from point features or from feature centroids such that each point/centroid is a triangle node. Nodes connected by a triangle edge are considered neighbors. Using Delaunay triangulation ensures every feature will have at least one neighbor even when data includes islands and/or widely varying features densities. This method is available using the Generate Spatial Weights Matrix tool.
Get spatial weights from file (user-defined spatial relationships)
You can also provide a path to a formatted ASCII text file that defines your own custom conceptualization of spatial relationships (based on spatial interaction, for example). If you want to define spatial relationships using travel time or travel costs derived from a network dataset, create a spatial weights matrix file using the Generate Network Spatial Weights tool, then use the resultant .swm file for your analysis. You can also construct a spatial weights matrix file using the Generate Spatial Weights Matrix tool. If the spatial relationships for your features are defined in a table, you can use the Generate Spatial Weights Matrix tool to convert that table into a spatial weights matrix (.swm) file. Particular fields are required to convert a table to a .swm file.
Selecting a conceptualization of spatial relationships: Best practices
The more realistically you can model how features interact with each other in space, the more accurate your results will be. Your choice for the Conceptualization of Spatial Relationships parameter should reflect inherent relationships among the features you are analyzing. Sometimes your choice will also be influenced by characteristics of your data.
The inverse distance methods, for example, are most appropriate with continuous data or to model processes where the closer two features are in space, the more likely they are to interact/influence each other. With this spatial conceptualization, every feature is potentially a neighbor of every other feature, and with large datasets, the number of computations involved will be enormous. You should always try to include a Distance Band or Threshold Distance value when using the inverse distance conceptualizations. This is particularly important for large datasets. If you leave the Distance Band or Threshold Distance parameter blank, a threshold distance will be computed for you, but this may not be the most appropriate distance for your analysis; the default distance threshold will be the minimum distance that ensures every feature has at least one neighbor.
The fixed distance band method works well for polygon data where there is a large variation in polygon size (very large polygons at the edge of the study area and very small polygons at the center of the study area, for example). Fixed Distance Band is also recommended for point data when running Hot Spot Analysis (Getis-Ord Gi*). See Selecting a fixed distance below for strategies to help you determine an appropriate distance band for your analysis.
The zone of indifference conceptualization works well when Fixed Distance is appropriate but imposing sharp boundaries on neighborhood relationships is not an accurate representation of your data. Keep in mind that the Zone of Indifference conceptual model considers every feature to be a neighbor of every other feature. This option is not appropriate for large datasets since the Distance Band or Threshold Distance value supplied does not limit the number of neighbors but only specifies where the intensity of spatial relationships begins to wane.
The polygon contiguity conceptualizations are effective when polygons are similar in size and distribution and when spatial relationships are a function of polygon proximity (the idea that if two polygons share a boundary, spatial interaction between them increases). When you select a polygon contiguity conceptualization, you will almost always want to select row standardization for tools that have the Row Standardization parameter.
The K nearest neighbors option is effective when you want to ensure you have a minimum number of neighbors for your analysis. Especially when the values associated with your features are skewed (are not normally distributed), it is important that each feature is evaluated within the context of at least eight or so neighbors (this is a rule of thumb only). When the distribution of your data varies across your study area so that some features are far away from all other features, this method works well. Note, however, that the spatial context of your analysis changes depending on variations in the sparsity/density of your features. When fixing the scale of analysis is less important than fixing the number of neighbors, the K nearest neighbors method is appropriate.
Some analysts consider Delaunay triangulation a way to construct natural neighbors for a set of features. This method is a good option when your data includes island polygons (isolated polygons that do not share any boundaries with other polygons) or in cases where there is a very uneven spatial distribution of features. Similar to the K nearest neighbors method, Delaunay triangulation ensures every feature has at least one neighbor but uses the distribution of the data itself to determine how many neighbors each feature gets.
For some applications, spatial interaction is best modeled in terms of travel time or travel distance. If you are modeling accessibility to urban services, for example, or looking for urban crime hot spots, modeling spatial relationships in terms of a network is a good option. Use the Generate Network Spatial Weights tool to create a spatial weights matrix file (.swm) prior to analysis; select GET_SPATIAL_WEIGHTS_FROM_FILE for your Conceptualization of Spatial Relationships value; then, for the Weights Matrix File parameter, provide the full path to the .swm file you created.
ESRI Data & Maps, free to ArcGIS users, contains StreetMap data including a prebuilt network dataset in SDC format. The coverage for this dataset is the United States and Canada. These network datasets can be used directly by the Generate Network Spatial Weights tool.
If none of the options for the Conceptualization of Spatial Relationships parameter work well for your analysis, you can create an ASCII text file or table with the feature to feature relationships and use these to build a spatial weights matrix file. It is also possible to edit spatial weights matrix files.
Selecting a fixed distance band value
Think of the fixed distance band you select as a moving window that momentarily settles on top of each feature and looks at that feature within the context of its neighbors. There are several guidelines to help you identify an appropriate distance band for analysis:
- Select a distance based on what you know about the geographic extent of the spatial processes promoting clustering for the phenomena you are studying. Often, you won't know this, but if you do, you should use your knowledge to select a distance value. Suppose, for example, you know that the average journey-to-work commute distance is 15 miles. Using 15 miles for the distance band is a good strategy for analyzing commuting data.
- Use a distance band that is large enough to ensure all features will have at least one neighbor. Especially if the input data is skewed (does not create a nice bell curve when you plot the values as a histogram), you will want to make sure that your distance band is neither too small (most features have only one or two neighbors) nor too large (several features include all other features as neighbors), because that would make resultant z-scores less reliable. The z-scores are reliable (even with skewed data) as long as the distance band is large enough to ensure several neighbors (approximately 8) for each feature.
- Use a distance band that reflects maximum spatial autocorrelation. Whenever you see spatial clustering on the landscape, you are seeing evidence of underlying spatial processes at work. The distance band that exhibits maximum clustering, as measured by the Spatial Autocorrelation (Global Moran's I) or Multi-Distance Spatial Cluster Analysis (Ripley's k-function) tools, is the distance where those spatial process are most "active" or most pronounced. Run the Spatial Autocorrelation tool at multiple distances (0.5, 1.0, 1.5 miles, and so forth) and note where the resulting z-score seems to peak. Use the distance associated with the peak value for your analysis. Alternatively, if you are working with incident data, run Multi-Distance Spatial Cluster Analysis (Ripley's k-function) on the unaggregated incidents for a range of distances and identify where the difference between the observed and expected K values peak (the DiffK field). Use the distance associated with the largest difference for your analysis. Note: Distance values should be entered using the same units as specified by the geoprocessing environment output coordinate system.
- Every peak represents a distance where the processes promoting spatial clustering are pronounced. Multiple peaks are common. Generally the peaks associated with larger distances reflect broad trends (a broad east to west trend, for example, where the west is a giant hot spot and the east is a giant cold spot); generally you will be most interested in peaks associated with smaller distances.
- An inconspicuous peak often means there are many different spatial processes operating at a variety of spatial scales. You probably want to look for other criteria to determine which fixed distance to use for your analysis (perhaps the most effective distance for remediation).
- If the z-score never peaks (in other words, it just keeps increasing) and if you are using aggregated data (for example, counties), it usually means the aggregation scheme is too coarse; the spatial processes of interest are operating at a scale that is smaller than the scale of your aggregation units. If you can move to a smaller scale of analysis (moving from counties to tracts, for example), this may help find a peak distance.
- Try not to get stuck on the idea that there is only one correct distance band. Reality is never that simple. Most likely there are multiple/interacting spatial processes promoting observed clustering. Rather than thinking you need one distance band, think of the pattern analysis tools as effective methods for exploring spatial relationships at multiple spatial scales. Consider that when you change the scale (change the distance band value), you could be asking a different question. Suppose you are looking at income data. With small distance bands, you can examine neighborhood income patterns, middle scale distances might reflect community or city income patterns, and the largest distance bands would highlight broad regional income patterns.
Distance method
Many of the tools in the Spatial Statistics toolbox use distance in their calculations. These tools provide you with the choice of either Euclidean or Manhattan distance.
- Euclidean distance is calculated as
D = sq root [(x1–x2)**2.0 + (y1–y2)**2.0]
where (x1,y1) is the coordinate for point A, (x2,y2) is the coordinate for point B, and D is the straight-line distance between points A and B.
- Manhattan distance is calculated as
D = abs(x1–x2) + abs(y1–y2)
where (x1,y1) is the coordinate for point A, (x2,y2) is the coordinate for point B, and D is the vertical plus horizontal difference between points A and B. It is the distance you must travel if you are restricted to north–south and east–west travel only. This method is generally more appropriate than Euclidean distance when travel is restricted to a street network and where actual street network travel costs are not available.
Self-potential (field giving intrazonal weight)
Several tools in the Spatial Statistics toolbox allow you to provide a field representing the weight to use for self-potential. Self-potential is the distance or weight between a feature and itself. Often this weight is zero, but in some cases, you may want to specify another fixed value or a different value for every feature. If your conceptualization of spatial relationships is based on distances traveled within and among census tracts, for example, you might decide to model self-potential to reflect average intrazonal travel costs based on polygon size:
dii = 0.5*[(Ai / π)**0.5]
where dii is the travel cost associated with intrazonal travel for polygon featurei, and Ai is the area associated with polygon featurei.
Standardization
Row standardization is recommended whenever the distribution of your features is potentially biased due to sampling design or an imposed aggregation scheme. When row standardization is selected, each weight is divided by its row sum (the sum of the weights of all neighboring features). Row standardized weighting is often used with fixed distance neighborhoods and almost always used for neighborhoods based on polygon contiguity. This is to mitigate bias due to features having different numbers of neighbors. Row standardization will scale all weights so they are between 0 and 1, creating a relative, rather than absolute, weighting scheme. Anytime you are working with polygon features representing administrative boundaries, you will likely want to choose the Row Standardization option.
Distance band or threshold distance
Distance Band or Threshold Distance sets the scale of analysis for most conceptualizations of spatial relationships (for example, Inverse Distance, Fixed Distance Band). It is a positive numeric value representing a cutoff distance. Features outside the specified cutoff for a target feature are ignored in the analysis for that feature. With Zone of Indifference, however, the influence of features outside the given distance is reduced in relation to proximity, while those inside the distance threshold are equally considered.
Choosing an appropriate distance is important. Some spatial statistics require each feature to have at least one neighbor for the analysis to be reliable. If the value you set for Distance Band or Threshold Distance is too small (so that some features have no neighbors), a warning message appears suggesting that you try again with a larger distance value. The Calculate Distance Band from Neighbor Count tool will evaluate minimum, average, and maximum distances for a specified number of neighbors and can help you determine an appropriate distance band value to use for analysis. See also Selecting a fixed distance band value for additional guidelines.
When no value is specified, a default threshold distance is computed. The table below indicates how different choices for the Conceptualization of Spatial Relationships parameter behave for each of three possible input types (negative values are not valid):
Inverse Distance, Inverse Distance Squared |
Fixed Distance Band, Zone of Indifference |
Polygon Contiguity, Delaunay Triangulation, K Nearest Neighbors |
|
0 |
No threshold or cutoff is applied; every feature is a neighbor of every other feature. |
Invalid. Runtime error will be generated. |
Ignored. |
blank |
A default distance will be computed. This default will be the minimum distance to ensure that every feature has at least one neighbor. |
A default distance will be computed. This default will be the minimum distance to ensure that every feature has at least one neighbor. |
Ignored. |
positive number |
The nonzero, positive value specified will be used as a cutoff distance; neighbor relationships will only exist among features within this distance of each other. |
For Fixed Distance Band, only features within this specified cutoff of each other will be neighbors. For Zone of Indifference, features within this specified cutoff of each other will be neighbors; features outside the cutoff are neighbors too, but they are assigned a smaller and smaller weight/influence as distance increases. |
Ignored. |
Number of neighbors
Specify a positive integer to represent the number of neighbors to include in the analysis for each target feature. When the value chosen for conceptualization of spatial relationships is K Nearest Neighbors, each target feature will be evaluated within the context of the closest K features (where K is the number of neighbors specified). For Inverse Distance or Fixed Distance Band, specifying a value for the Number of Neighbors parameter will ensure that each feature has a minimum of K neighbors. For Polygon Contiguity, the value specified for Number of Neighbors is only applied to island polygons: the K nearest polygons to each target island polygon will be considered neighbors for analysis.
Weights matrix file
Several tools allow you to define spatial relationships among features by providing a path to a spatial weights matrix file. Spatial weights are numbers that reflect the distance, time, or other cost between each feature and every other feature in the dataset. The spatial weights matrix file may be created using the Generate Spatial Weights Matrix tool or Generate Network Spatial Weights tool, or it may be a simple ASCII file.
When the spatial weights matrix file is a simple ASCII text file, the first line should be the name of a unique ID field. This gives you the flexibility to use any numeric field in your dataset as the ID when generating this file; however, the ID field must be type INTEGER and have unique values for every feature. After the first line, the spatial weights file should be formatted into three columns:
- From feature ID
- To feature ID
- Weight
For example, suppose you have three gas stations. The field you are using as the ID field is called StationID, and the feature IDs are 1, 2, and 3. You want to model spatial relationships among these three gas stations using travel time in minutes. You could create an ASCII file that might look like the following:
StationID 1 1 0 1 2 1/10 1 3 1/7 2 1 1/10 2 3 1/20 3 1 1/6 3 2 1/15 3 3 0
Generally when weights represent distance or time, they are inverted (for example, 1/10 when the distance is 10 miles or 10 minutes) so that nearer features have a larger weight than features that are farther away. Notice from the weights above that gas station 1 is 10 minutes from gas station 2. Notice also that travel time is not symmetrical in this example (traveling from gas station 1 to gas station 3 is 7 minutes, but traveling from gas station 3 to gas station 1 is only 6 minutes). Notice that the weight between gas station 1 and itself is 0 and that there is no entry for gas station 2 to itself. Missing entries are assumed to have a weight of 0.
Typing the values for the spatial weights matrix file can be a tedious job at best, even for small datasets. A better approach is to use the Generate Spatial Weights Matrix tool or to write a quick Python script to perform this task for you.
Spatial weights matrix file (.swm)
The Generate Spatial Weights Matrix or Generate Network Spatial Weights tool will create a binary spatial weights matrix file (.swm) defining the spatial relationships among all the features in your dataset based on the parameters you specify.
If you have a table defining the spatial relationships among features in a feature class, use the Generate Spatial Weights Matrix tool to convert the table to a spatial weights matrix file (.swm). The table will need the following fields:
Field name |
Description |
---|---|
<Unique ID field name> |
An integer field that exists in the input feature class with a unique ID for each feature. This is the from feature ID. |
NID |
An integer field containing neighbor feature IDs. This is the to feature ID. |
WEIGHT |
This is the numeric weight quantifying the spatial relationship between the from and to features. Larger values reflect bigger weights and stronger influence, or interaction, between two features. |
Sharing spatial weights matrix files
The output from the Generate Spatial Weights Matrix and Generate Network Spatial Weights tools is an .swm file. This file is tied to the input feature class, the unique ID field, and the output coordinate system settings when the .swm file was created. Other people can duplicate the spatial relationships you define for analysis by using both your .swm file and your input feature class. Especially if you plan to share your .swm files with others, try to avoid the situation where your output coordinate system differs from the spatial reference associated with your input feature class. A better strategy is to project the input feature class, then set the output coordinate system to Same as Input Feature Class prior to creating spatial weights matrix files.