我正在尝试为我处理的数据集选择一个合适的SQL Server 2008空间索引设置,感觉很有趣。
该数据集是表示全球轮廓线的多边形。表中有106000行,多边形存储在几何字段中。
我的问题是,许多多边形涵盖了地球的大部分区域。这似乎使得很难获得一个能够在主要过滤器中消除许多行的空间索引。例如,请看以下查询:
SELECT "ID","CODE","geom".STAsBinary() as "geom" FROM "dbo"."ContA"
WHERE "geom".Filter(
geometry::STGeomFromText('POLYGON ((-142.03193662573682 59.53396984952896,
-142.03193662573682 59.88928136451884,
-141.32743833481925 59.88928136451884,
-141.32743833481925 59.53396984952896,
-142.03193662573682 59.53396984952896))', 4326)
) = 1
这是查询与表中仅两个多边形相交的区域。无论我选择哪种空间索引设置,Filter()始终返回约60,000行数据。用STIntersects()替换Filter()当然只返回我想要的两个多边形,但需要更长时间(Filter()为6秒,STIntersects()为12秒)。是否有空间索引设置可以改善60,000行数据,或者我的数据集不适合SQL Server的空间索引?更多信息:如建议的那样,我使用全球4x4网格将多边形分割开来。我找不到使用QGIS的方法,所以我编写了自己的查询。首先,我定义了16个边界框,第一个看起来像这样:declare @box1 geometry = geometry::STGeomFromText('POLYGON ((
-180 90,
-90 90,
-90 45,
-180 45,
-180 90))', 4326)
然后我使用每个边界框来选择和截断与该框相交的多边形:
insert ContASplit
select CODE, geom.STIntersection(@box1), CODE_DESC from ContA
where geom.STIntersects(@box1) = 1
我显然对4x4网格中的所有16个边界框都这样做了。最终结果是我有了一个新表,大约有107,000行(这证实了我实际上没有很多巨大的多边形)。
我添加了一个空间索引,每个对象有1024个单元格,并且每个级别的单元格数量都非常低。
然而,非常奇怪的是,这个拆分多边形的新表的性能仍然与旧表相同。进行上面列出的.Filter()仍然返回大约60,000行。我真的完全不理解这一点,显然我不理解空间索引的工作原理。
矛盾的是,虽然.Filter()仍然返回大约60,000行,但它的性能已经得到了改善。.Filter()现在需要大约2秒钟,而不是6秒钟,.STIntersects()现在需要6秒钟,而不是12秒钟。
按要求,这里是索引的SQL示例:
CREATE SPATIAL INDEX [contasplit_sidx] ON [dbo].[ContASplit]
(
[geom]
)USING GEOMETRY_GRID
WITH (
BOUNDING_BOX =(-90, -180, 90, 180),
GRIDS =(LEVEL_1 = LOW,LEVEL_2 = LOW,LEVEL_3 = LOW,LEVEL_4 = LOW),
CELLS_PER_OBJECT = 1024,
PAD_INDEX = OFF,
SORT_IN_TEMPDB = OFF,
DROP_EXISTING = OFF,
ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
请记住,我已经尝试了各种网格和每个对象的单元格设置,但每次结果都相同。
这是运行sp_help_spatial_geometry_index的结果,在我的拆分数据集上,没有任何单个多边形占据了超过1/16的地球表面:
Base_Table_Rows 215138 Bounding_Box_xmin -90 Bounding_Box_ymin -180 Bounding_Box_xmax 90 Bounding_Box_ymax 180 Grid_Size_Level_1 64 Grid_Size_Level_2 64 Grid_Size_Level_3 64 Grid_Size_Level_4 64 Cells_Per_Object 16 Total_Primary_Index_Rows 378650 Total_Primary_Index_Pages 1129 Average_Number_Of_Index_Rows_Per_Base_Row 1 Total_Number_Of_ObjectCells_In_Level0_For_QuerySample 1 Total_Number_Of_ObjectCells_In_Level0_In_Index 60956 Total_Number_Of_ObjectCells_In_Level1_In_Index 361 Total_Number_Of_ObjectCells_In_Level2_In_Index 2935 Total_Number_Of_ObjectCells_In_Level3_In_Index 32420 Total_Number_Of_ObjectCells_In_Level4_In_Index 281978 Total_Number_Of_Interior_ObjectCells_In_Level2_In_Index 1 Total_Number_Of_Interior_ObjectCells_In_Level3_In_Index 49 Total_Number_Of_Interior_ObjectCells_In_Level4_In_Index 4236 Total_Number_Of_Intersecting_ObjectCells_In_Level1_In_Index 29 Total_Number_Of_Intersecting_ObjectCells_In_Level2_In_Index 1294 Total_Number_Of_Intersecting_ObjectCells_In_Level3_In_Index 29680 Total_Number_Of_Intersecting_ObjectCells_In_Level4_In_Index 251517 Total_Number_Of_Border_ObjectCells_In_Level0_For_QuerySample 1 Total_Number_Of_Border_ObjectCells_In_Level0_In_Index 60956 Total_Number_Of_Border_ObjectCells_In_Level1_In_Index 332 Total_Number_Of_Border_ObjectCells_In_Level2_In_Index 1640 Total_Number_Of_Border_ObjectCells_In_Level3_In_Index 2691 Total_Number_Of_Border_ObjectCells_In_Level4_In_Index 26225 Interior_To_Total_Cells_Normalized_To_Leaf_Grid_Percentage 0.004852925 Intersecting_To_Total_Cells_Normalized_To_Leaf_Grid_Percentage 0.288147586 Border_To_Total_Cells_Normalized_To_Leaf_Grid_Percentage 99.70699949 Average_Cells_Per_Object_Normalized_To_Leaf_Grid 405.7282349 Average_Objects_PerLeaf_GridCell 0.002464704 Number_Of_SRIDs_Found 1 Width_Of_Cell_In_Level1 2.8125 Width_Of_Cell_In_Level2 0.043945313 Width_Of_Cell_In_Level3 0.000686646 Width_Of_Cell_In_Level4 1.07E-05 Height_Of_Cell_In_Level1 5.625 Height_Of_Cell_In_Level2 0.087890625 Height_Of_Cell_In_Level3 0.001373291 Height_Of_Cell_In_Level4 2.15E-05 Area_Of_Cell_In_Level1 1012.5 Area_Of_Cell_In_Level2 15.8203125 Area_Of_Cell_In_Level3 0.247192383 Area_Of_Cell_In_Level4 0.003862381 CellArea_To_BoundingBoxArea_Percentage_In_Level1 1.5625 CellArea_To_BoundingBoxArea_Percentage_In_Level2 0.024414063 CellArea_To_BoundingBoxArea_Percentage_In_Level3 0.00038147 CellArea_To_BoundingBoxArea_Percentage_In_Level4 5.96E-06 Number_Of_Rows_Selected_By_Primary_Filter 60956 Number_Of_Rows_Selected_By_Internal_Filter 0 Number_Of_Times_Secondary_Filter_Is_Called 60956 Number_Of_Rows_Output 2 Percentage_Of_Rows_NotSelected_By_Primary_Filter 71.66655821 Percentage_Of_Primary_Filter_Rows_Selected_By_Internal_Filter 0 Internal_Filter_Efficiency 0 Primary_Filter_Efficiency 0.003281055"Base_Table_Rows 215138"这个信息对我来说并没有太多意义,因为表格里只有10.7万行,而不是21.5万行。
当渲染数据集时,它看起来像这样:
(来源:norman.cx)
进一步调查:
我仍然对此数据的主要过滤器性能表现感到困惑。因此,我进行了一项测试,以确定我的数据如何分割。使用原始未分割的特征,我向表格中添加了一个“单元格”列。我随后运行了16个查询,以计算每个特征跨越的4x4网格中有多少个单元格。所以,我为每个单元格运行了如下查询:
declare @box1 geometry = geometry::STGeomFromText('POLYGON ((
-180 90,
-90 90,
-90 45,
-180 45,
-180 90))', 4326)
update ContA set cells = cells + 1 where
geom.STIntersects(@box1) = 1
如果我查看表格中的“单元格”列,整个数据集中只有672个特征与4x4网格内超过1个单元格相交。那么在查看一个仅覆盖200英里宽矩形的查询时,这个主要过滤器怎么可能会返回60,000个特征呢?
此时,看起来我可以编写自己的索引方案,使其比SQL Server为这些特征执行的方案更加有效。