neaRThings A spatial doodle

3. The Data - Spatial, Part 2

[A Data Science Doodle]

Shapes Meet Words

This stage is about bringing together the suburbs attribute data of 1. The Data and the spatial data from 2. The Data - Spatial.

I imported the service_request_2011 data to the geopackage, the data types at this stage are all text and the relation looks thus

Requests in GeoPackage

  • The fields subcouncil and wards are largely unpopulated and are deleted.

  • modified table in DB Browser for SQLite to reflect correct data type,

Field Name Data Type
notification_id (big)int
notification_date date
req_type text
suburb* varchar80
description text
work_center text
x_coord double
y_coord double

Getting The Point

The service request each have an (x_coord, y_coord) pair. However the majority have not been assigned or most likely where assigned to the suburb centroid. So to check on the ‘stacked’ points I ran the query

SELECT
    x_coord,y_coord, COUNT(*) AS a
FROM
    service_requests_2011
GROUP BY
    x_coord,y_coord
HAVING 
    COUNT(*) > 1
ORDER BY a DESC

The query revealed:

  • Coordinates to be in metres (likely UTM).
  • 903 043 records without coordinates.
  • The highest repeated coordinate pairs being 11, 7, 5 and the rest 4 and below.
  • 3127 records had unique pairs.[ COUNT(*) = 1 in the query above]

Now, identifying invalid x_coord and y_coord i.e coordinates not in the Southern Hemisphere, East of the Greenwich Meridian.

SELECT
    x_coord, COUNT(*) AS a
FROM
    service_requests_2011
WHERE x_coord >0;

(Query was repeated for y coordinate.).

For all the invalid and ‘out-of-place’ coordinate pairs, NULL was assigned.

UPDATE service_requests_2011
SET x_coord = null
WHERE x_coord = 'Not assigned';

(Query was repeated for y coordinate.).

Get the point right in sql

Coming From ?

How Many Requests In This Area

As realised while exploring the service request data, most records did not have the (x, y) coordinate pair, yet consistently had suburb associated with them.

The next step was to assign a coordinate pair to each service request.

The following procedure was adopted.

  • Count the records in each suburb (excluding suburb = INVALID). This resulted in a new table service_req_2011_suburb_count.
SELECT suburb, count(suburb) AS records
    FROM service_requests_2011
        WHERE suburb != 'INVALID'
    GROUP BY suburb
    ORDER BY records DESC;

For each of the 4 reference ‘suburb’ layers …

SELECT *
    FROM census_2011_sal AS a
    INNER JOIN service_req_2011_suburb_count AS b
     ON a.suburb = b.suburb

This is essentially a spatial query. ‘Show me all the suburbs from the census_2011_sal layer whose names appear in the *service_request_2011 data.*’ Executed and loaded in QGIS this looks like…

Get the point right in sql

Saved the resultant layer as census_2011_sal_a.

Now the subsequent query must exclude from the service_req_2011_suburb_count all the suburbs already used by census_2011_sal (Why? Some suburbs names are repeated across data sets. This is an ANTI JOIN and one of its variations was adopted).

To get the remainder of the suburbs

SELECT a.suburb, a.records
    FROM service_req_2011_suburb_count AS a
        LEFT JOIN census_2011_sal_a AS b
          ON a.suburb = b.suburb
          WHERE b.suburb IS NULL;

The above query creates a new layer service_req_2011_suburb_count_a.

Now to get the matching suburbs from ofc_suburbs

SELECT *
    FROM ofc_suburbs AS a
    INNER JOIN service_req_2011_suburb_count_a AS b
     ON a.suburb = b.suburb

The results being a new layer, ofc_suburbs_a

Using the same procedure, I ended up with

  • census_2011_sal_a

  • ofc_suburbs_a

  • informal_areas_a

  • suburb_extra_a

which I combined to have one suburbs data layer suburbs with 775 records. This compound suburbs layer is a bit ‘dirty’ (overlapping areas) for some uses but for the current use case, perfect (each area is identifiable by name).

GeoCoding The Service Requests

From the initial investigation there were 3127 records which had a valid x,y coordinate pair. This set is about 0.3% of the entire set. I decided to ignore these and bunch them together with the rest of the unassigned records. Of the 906 500 records of service request, 96 688 have status of INVALID for the suburb, leaving 809 812 to be geocoded.

To geocode the service request, I chose the approach to have point for each service request and these appropriately dotted within a polygon.

A Scatter of Points

Step 1 was to generate random points within each suburb corresponding to the number of requests in it. This is accessible in QGIS.

Get the point right in sql

I call this “points scattering.”

The next step was to assign suburbs to these points i.e. join attributes by location.

Join attributes by Location

Exported the data to geopackage as service_request_points_a with 809 812 points.

So I could do some advanced sql queries on the data. I loaded the data to PostGIS with EPSG 4326.

I then ended up with a set of points, 809 812, which were supposed to have appropriate notification_ids associated with them.

Geography to Requests

Getting the generate points to have notification ids proved to be a challenge. (Revealing a knowledge gap in my skillset). I resorted to tweeter without success.

I decided to ‘brute-force’ the stage for progress’ sake with the intention of reviewing this at a later stage.

So I sorted the service_request_2011 data by suburb, then exported the result (notification_id, suburb) to csv.

SELECT notification_id, suburb from service_requests_2011
WHERE suburb != 'INVALID'
ORDER BY suburb ASC;

The sorting procedure was repeated for notifification_points_no_ids (the random spatial points data created in A Scatter of Points .), exporting (fid, suburb) to csv.

In a spreadsheet program (LibreOpenCalc), I opened the two csv files, pasted the fields side by side (the suburbs aligned well since they were sorted in alphabetical order). The resultant table, with fields - notificaion_id and fid, was exported as csv, reimported into the database as a table.

To get the geocoded service_requests_2011

-- Assign ids of spatial points to service requests --- 
CREATE TABLE requests_2011_with_ids AS
SELECT a.*, b.fid
 FROM requests_2011 as a
 JOIN requests_notif_id_points_ids as b
 ON a.notification_id = b.notification_id;

--- Attach service requests to the points ---
CREATE TABLE requests_2011_geocoded AS
SELECT b.*, a.geom
 FROM requests_points as a
 JOIN requests_2011_with_ids as b
 ON a.fid = b.fid;

Next up…some eye candy! With a purpose.

#Postscript

  • Here’s the GeoPackage (Caution! 382MB file) with all the data up to this stage.

  • Using each of the ‘suburbs’ layers used, I created centroids - this can be used later for labelling a cluster of points to get context.

  • While generating points in polygon based on count, the QGIS tool bombed out at a stage. I iteratively ran the tool (taking groups of suburb polygons at a time) to isolate the problem polygon - JOE SLOVO (LANGA).I then manually edited the polygon, then generated the points for it. This error made me question the validity checking procedure I had performed on the data. Add to that, doing an intersect using data in a geographic coordinate system led to unexpected results - Q warns on this beforehand. One scenario was having more than the specified number of random points generated per polygon, e.g. 18 instead of a specified 6. (With the algorithm reading the value from attribute of the data.)

  • Running the spatial queries in the GUI (QGIS) took considerable time due to the sheer number of records. Intersecting the points and the polygons can be improved, taking a tip by Paul Ramsey from here for speed, with the data in PostGIS.

-- CHANGE STORAGE TYPE --
ALTER TABLE suburbs
 ALTER COLUMN geom
 SET STORAGE EXTERNAL;

--Force COLUMN TO REWRITE--
UPDATE suburbs
 SET geom = ST_SetSRID(geom, 4326)

--Create a new table of points with the correct suburbs--
CREATE TABLE points_with_suburbs AS
SELECT *
 FROM suburbs a
 JOIN points_in_polygons b
 ON ST_Intersects(a.geom, b.geom)