This repository contains a collection of Python scripts created for processing, cleaning, and cross-referencing geospatial and demographic data from IBGE (Brazilian Institute of Geography and Statistics), SISMIGRA, and CRAI. The scripts were developed with a focus on automation, generating geographic meshes (GeoPackage files), and heat maps for the state of São Paulo and all of Brazil.
The data engineering of this project used the GeoPandas library as the main engine for spatial operations, alongside the following methodologies:
- Official Data Extraction (APIs): Direct connections to IBGE V2 and V3 APIs to fetch territorial meshes (Shapefiles converted to GeoJSONs) of municipalities, states, and aggregated census data.
- Spatial and Tabular Joins (Inner Joins): Cross-referencing government databases (SISMIGRA, CRAI Databases) with maps using normalized municipality and district names.
- Graph Theory: Use of the NetworkX library with the Minimum Spanning Tree algorithm to calculate the shortest distances and draw railway paths connecting isolated train and subway stations in São Paulo.
- Cleaning and Standardization: Cleaning algorithms to remove accents (Regex) and adjust text formatting using a base data dictionary, minimizing data loss when joining with the official IBGE database.
- Demographics by Categories: Cross-filters applied using the Pandas library to classify gender densities (Demographic Proportion) and isolate specific nationalities.
To run the scripts, it is recommended to install the following libraries:
pip install geopandas pandas requests networkx shapelyTip: It is also possible to run these scripts natively using the Python environment attached to QGIS.
process_censo.py: Connects to the IBGE API, downloads the territorial limits (mesh) of São Paulo, and groups the total populations from the Demographic Census.
process_sismigra.py: The first basic SISMIGRA join, uniting municipality data with the official São Paulo mesh.process_sismigra_historico.py: Reads demographic data, counts records by city (filling zero-count municipalities with-1), and generates a layer of absolute data.process_sismigra_predominancia.py: Performs socio-demographic reading of the file to calculate proportions (Male Majority, Female Majority, or Balanced).
process_crai.py: Reads data from the CRAI Database, uses the Data Dictionary to standardize columns, and cross-references it with the official map of districts in the capital.filter_bolivia.py: Filters the general CRAI layer, retaining exclusively data corresponding to Bolivian immigrants.
process_meis_nacional.py: Queries the IBGE National API, processes the extraction of state acronyms (UFs), and maps entrepreneurs across the entire Brazilian territory.
draw_rail_lines.py: Loads Shapefiles of isolated stations and draws the official connection line for Trains and Subways via the Minimum Spanning Tree algorithm.
extract_csv.py: Cleans the heavy original CSV tables, extracting only the columns used in processing and adding blank checking columns.test_agg.py: Support script used to test Pandas aggregation functions before inserting them into production pipelines.