PROPOSAL

Efficient Spatial Data Loading for Data Science

Supervisors: Eleni Tzirita Zacharatou
Semester: Fall 2023
Tags: spatial data analysis, data science, data loading, GIS file formats, geospatial data

Geospatial data refers to information that is tied to specific geographic locations on the Earth’s surface. It includes both the location coordinates (such as latitude, longitude, and, potentially, altitude) and attribute data associated with those locations. Geospatial data is categorized into two types: raster and vector.

Vector data represents geographic features as points, lines, and polygons. Points represent individual locations (e.g., cities), lines represent linear features (e.g., roads, rivers), and polygons represent areas (e.g., countries, land parcels). Vector data is often used for accurate representation of discrete features.

Raster data organizes information into a grid of cells or pixels, where each cell holds a value that corresponds to a particular attribute. It’s commonly used for continuous data like satellite imagery or elevation models. Each pixel represents a specific area on the Earth’s surface.

To store spatial data, there exist multiple different spatial file formats (see https://en.wikipedia.org/wiki/GIS_file_format). Shapefile and GeoJSON are popular file formats for vector data, while GeoTiff and JPEG 2000 are commonly used to store raster data.

The project’s goal is to analyze and compare the methods employed for importing spatial file formats into a data science environment like Python. To that end, we will examine their runtime performance and memory usage, and identify any potential bottlenecks.

The scope of the project can be adjusted for either a single student or a team of two to three students.