Week 4 - Educational Attainment of Young People in English Towns
ns year-2024.week-4.analysis
(:require
(:as hc]
[aerial.hanami.common :as ht]
[aerial.hanami.templates :as str]
[clojure.string :as kind]
[scicloj.kindly.v4.kind :as hanami]
[scicloj.noj.v1.vis.hanami :as tc])) [tablecloth.api
This week’s data is about English students. We’ll try to answer the question from the example article, why do children and young people in smaller towns do better academically than those in larger towns?
Plotting the data as a scatterplot
The first graph in the article is a kind of scatterplot segmented by the size of the town. We’ll load the data first:
def english-education
(-> "data/year_2024/week_4/english_education.csv"
(:key-fn keyword}))) (tc/dataset {
english-education
data/year_2024/week_4/english_education.csv [1104 31]:
:town11cd | :town11nm | :population_2011 | :size_flag | :rgn11nm | :coastal | :coastal_detailed | :ttwa11cd | :ttwa11nm | :ttwa_classification | :job_density_flag | :income_flag | :university_flag | :level4qual_residents35-64_2011 | :ks4_2012-2013_counts | :key_stage_2_attainment_school_year_2007_to_2008 | :key_stage_4_attainment_school_year_2012_to_2013 | :level_2_at_age_18 | :level_3_at_age_18 | :activity_at_age_19:_full-time_higher_education | :activity_at_age_19:_sustained_further_education | :activity_at_age_19:_appprenticeships | :activity_at_age_19:employment_with_earnings_above£0 | :activity_at_age_19:employment_with_earnings_above£10,000 | :activity_at_age_19:_out-of-work | :highest_level_qualification_achieved_by_age_22:_less_than_level_1 | :highest_level_qualification_achieved_by_age_22:_level_1_to_level_2 | :highest_level_qualification_achieved_by_age_22:_level_3_to_level_5 | :highest_level_qualification_achieved_by_age_22:_level_6_or_above | :highest_level_qualification_achieved_b_age_22:_average_score | :education_score |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
E34000007 | Carlton in Lindrick BUA | 5456.0 | Small Towns | East Midlands | Non-coastal | Smaller non-coastal town | E30000291 | Worksop and Retford | Majority town and city (small) | Residential | Higher deprivation towns | No university | Low | 65.0 | 65.00000000 | 70.76923077 | 72.30769231 | 50.76923077 | 30.76923076923077 | 21.53846153846154 | * | 52.307692307692314 | 36.92307692307693 | * | * | 34.9 | 39.7 | * | 3.32307692 | -0.53375042 |
E34000016 | Dorchester (West Dorset) BUA | 19060.0 | Small Towns | South West | Non-coastal | Smaller non-coastal town | E30000046 | Dorchester and Weymouth | Majority town and city (small) | Working | Mid deprivation towns | No university | Medium | 239.0 | 69.05829596 | 71.12970711 | 85.71428571 | 60.08403361 | 41.84100418410041 | 13.389121338912133 | 10.0418410041841 | 51.04602510460251 | 24.686192468619247 | 4.184100418410042 | * | 21.7 | 44.6 | 33.3 | 3.73221757 | 1.95201875 |
E34000020 | Ely BUA | 19090.0 | Small Towns | East of England | Non-coastal | Smaller non-coastal town | E30000186 | Cambridge | Majority town and city (Large Towns) | Working | Lower deprivation towns | No university | Medium | 155.0 | 71.42857143 | 56.12903226 | 83.87096774 | 45.80645161 | 35.483870967741936 | 10.967741935483872 | 7.741935483870968 | 57.41935483870968 | 27.741935483870968 | * | * | 34.4 | 31.2 | 32.5 | 3.54838710 | -1.04412799 |
E34000026 | Market Weighton BUA | 6429.0 | Small Towns | Yorkshire and The Humber | Non-coastal | Smaller non-coastal town | E30000220 | Hull | Majority town and city (Large Towns) | Residential | Lower deprivation towns | No university | Medium | 58.0 | 70.96774194 | 53.44827586 | 91.22807018 | 49.12280702 | 25.862068965517242 | 25.862068965517242 | * | 58.620689655172406 | 31.03448275862069 | * | * | * | 66.1 | * | 3.48275862 | -1.24926195 |
E34000027 | Downham Market BUA | 10884.0 | Small Towns | East of England | Non-coastal | Smaller non-coastal town | E30000225 | King’s Lynn | Majority rural | Mixed | Higher deprivation towns | No university | Low | 93.0 | 78.04878049 | 62.36559140 | 78.49462366 | 40.86021505 | 26.881720430107524 | 20.43010752688172 | 15.053763440860216 | 55.91397849462365 | 30.107526881720432 | 15.053763440860216 | * | 32.6 | 44.2 | * | 3.16129032 | -1.16907845 |
E34000039 | Penrith BUA | 15181.0 | Small Towns | North West | Non-coastal | Smaller non-coastal town | E30000106 | Penrith | Majority rural | Working | Lower deprivation towns | No university | Low | 158.0 | 73.85620915 | 74.05063291 | 91.13924051 | 46.20253165 | 31.645569620253166 | 16.455696202531644 | 12.658227848101266 | 62.0253164556962 | 37.34177215189873 | 7.59493670886076 | * | 29.4 | 40.0 | 28.1 | 3.47468354 | 0.84513109 |
E34000048 | Bolsover BUA | 11754.0 | Small Towns | East Midlands | Non-coastal | Smaller non-coastal town | E30000190 | Chesterfield | Majority town and city (Large Towns) | Residential | Higher deprivation towns | No university | Low | 118.0 | 70.90909091 | 50.00000000 | 76.27118644 | 37.28813559 | 21.1864406779661 | 27.11864406779661 | 20.33898305084746 | 50.847457627118644 | 26.27118644067797 | 11.864406779661017 | * | 34.5 | 47.9 | * | 3.10169492 | -3.71791969 |
E34000055 | March BUA | 21051.0 | Medium Towns | East of England | Non-coastal | Large non-coastal town | E30000287 | Wisbech | Majority town and city (small) | Residential | Higher deprivation towns | No university | Low | 235.0 | 71.06382979 | 57.44680851 | 84.25531915 | 40.42553191 | 25.53191489361702 | 17.4468085106383 | 10.212765957446807 | 47.65957446808511 | 26.80851063829787 | 6.808510638297872 | * | 37.8 | 37.8 | 23.6 | 3.28510638 | -2.17130716 |
E34000056 | Southam BUA | 6567.0 | Small Towns | West Midlands | Non-coastal | Smaller non-coastal town | E30000228 | Leamington Spa | Majority town and city (small) | Working | Lower deprivation towns | No university | Medium | 92.0 | 81.92771084 | 91.30434783 | 90.21739130 | 67.39130435 | 38.04347826086957 | 21.73913043478261 | 16.304347826086957 | 52.17391304347826 | 25.0 | * | * | 19.6 | 47.8 | 32.6 | 3.84782609 | 6.43766140 |
E34000067 | Royston BUA | 15781.0 | Small Towns | East of England | Non-coastal | Smaller non-coastal town | E30000186 | Cambridge | Majority town and city (Large Towns) | Working | Lower deprivation towns | No university | Medium | 169.0 | 73.03370787 | 65.08875740 | 82.24852071 | 53.84615385 | 23.668639053254438 | 21.893491124260358 | 14.201183431952662 | 57.98816568047337 | 39.053254437869825 | * | * | 23.7 | 50.9 | 23.7 | 3.38461538 | 0.29540337 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
E35001501 | Reading BUASD | 218705.0 | Large Towns | South East | Non-coastal | Large non-coastal town | E30000256 | Reading | Majority town and city (Large Towns) | Working | Mid deprivation towns | University | Medium | 2243.0 | 73.81263617 | 62.50557289 | 82.65715560 | 52.07311636 | 37.672759696834596 | 16.272848863129738 | 9.80829246544806 | 48.14979937583593 | 24.83281319661168 | 5.21622826571556 | 2.4 | 25.7 | 42.5 | 29.5 | 3.53811859 | 0.39981293 |
E35001502 | Hedge End BUASD | 25117.0 | Medium Towns | South East | Non-coastal | Large non-coastal town | E30000267 | Southampton | Majority town and city (Large Towns) | Mixed | Lower deprivation towns | No university | Medium | 307.0 | 77.63578275 | 70.68403909 | 84.03908795 | 58.30618893 | 34.20195439739413 | 19.218241042345277 | 14.65798045602606 | 60.91205211726385 | 35.50488599348534 | 5.537459283387622 | * | 18.1 | 51.5 | 29.1 | 3.63517915 | 2.48698423 |
E35001503 | Westergate BUASD | 9783.0 | Small Towns | South East | Non-coastal | Smaller non-coastal town | E30000191 | Chichester and Bognor Regis | Majority town and city (small) | Mixed | Lower deprivation towns | No university | Medium | 108.0 | 80.00000000 | 62.96296296 | 88.88888889 | 56.48148148 | 27.77777777777778 | 14.814814814814813 | * | 54.629629629629626 | 25.0 | * | * | 22.9 | 48.6 | 27.5 | 3.53703704 | 1.56425938 |
E35001507 | Loughton BUASD | 31106.0 | Medium Towns | East of England | Non-coastal | Large non-coastal town | E30000234 | London | Majority conurbation | Working | Mid deprivation towns | No university | Medium | 357.0 | 75.55555556 | 70.30812325 | 84.83146067 | 49.71910112 | 35.01400560224089 | 16.5266106442577 | 10.92436974789916 | 42.857142857142854 | 28.57142857142857 | 4.481792717086835 | * | 22.9 | 46.4 | 29.3 | 3.57422969 | 1.26655373 |
E35001516 | Milton Keynes BUASD | 171750.0 | Large Towns | South East | Non-coastal | Large non-coastal town | E30000243 | Milton Keynes | Majority town and city (Large Towns) | Working | Mid deprivation towns | University | Medium | 2106.0 | 70.43314501 | 62.67806268 | 85.99240266 | 52.08926876 | 42.26020892687559 | 13.437796771130103 | 6.837606837606838 | 43.49477682811016 | 21.225071225071225 | 6.172839506172839 | 1.9 | 26.4 | 38.9 | 32.8 | 3.65337132 | 0.34182404 |
E35001517 | Redditch BUASD | 81919.0 | Large Towns | West Midlands | Non-coastal | Large non-coastal town | E30000169 | Birmingham | Majority conurbation | Working | Higher deprivation towns | No university | Low | 1006.0 | 67.58349705 | 68.09145129 | 87.17693837 | 48.40954274 | 32.30616302186879 | 18.78727634194831 | 9.642147117296222 | 50.0 | 22.962226640159045 | 8.846918489065606 | 1.7 | 29.7 | 41.3 | 27.3 | 3.47117296 | -0.29414288 |
K06000004 | Chester BUASD | 86011.0 | Large Towns | North West | Non-coastal | Large non-coastal town | K01000011 | Chester | Majority town and city (Large Towns) | Working | Higher deprivation towns | University | Medium | 739.0 | 70.72192513 | 61.02841678 | 80.62330623 | 46.20596206 | 32.47631935047362 | 19.485791610284167 | 9.066305818673884 | 45.33152909336942 | 18.26792963464141 | 7.983761840324763 | 2.0 | 27.3 | 43.0 | 27.7 | 3.47496617 | -0.81093526 |
Inner London BUAs | Inner London BUA | London | 24630.0 | 69.91000918 | 63.21965083 | 84.88381540 | 53.47335067 | 49.26918392204628 | 14.023548518067397 | 5.952090946000812 | 30.07308160779537 | 10.117742590336988 | 7.182298010556232 | 1.9 | 19.0 | 44.0 | 35.1 | 3.78457166 | 0.06759137 | |||||||||||
Outer london BUAs | Outer london BUA | London | 52998.0 | 74.51446320 | 66.58364467 | 86.60700800 | 57.05331521 | 48.47352730291709 | 14.249594324314124 | 7.462545756443639 | 36.38816559115438 | 15.551530246424395 | 5.520963055209631 | 1.7 | 18.8 | 42.5 | 37.0 | 3.86403261 | 1.26241027 | |||||||||||
Other Small BUAs | Other Small BUAs | 59743.0 | 77.35164037 | 65.12562141 | 86.39697264 | 54.48745856 | 37.30144117302445 | 18.96623872252816 | 12.269219824916727 | 50.26530304805584 | 24.33757929799307 | 5.4985521316304835 | 1.7 | 23.3 | 44.9 | 30.1 | 3.65651206 | 1.22187679 | ||||||||||||
Not BUA | Not BUA | 21765.0 | 78.23469971 | 67.50746612 | 87.83783784 | 57.17503218 | 38.98460831610384 | 18.111647139903514 | 11.504709395818976 | 48.86744773719274 | 23.06914771422008 | 4.741557546519641 | 1.5 | 21.2 | 45.1 | 32.2 | 3.70797151 | 1.80208942 |
We can plot the education scores sorted by town size like in the article to start:
(hanami/plot english-education
ht/point-chart:X :education_score
{:Y :size_flag
:YTYPE "nominal"})
We can add jitter to this chart to spread out the points along the y axis since they’re all bunched up because of how we’re sorting them along the y axis:
(hanami/plot english-education;; This is not ideal.. one of my many projects this year is to think up ways
;; to improve Clojure's dataviz wrappers
-> ht/point-chart
(assoc :encoding (assoc (hc/get-default :ENCODING)
(:yOffset :YOFFSET)))
:X :education_score
{:Y :size_flag
:HEIGHT {:step 60}
:TRANSFORM [{:calculate
;; Generate Gaussian jitter with a Box-Muller transform, we could
;; also use `random()`, but that jitters the points uniformly and
;; looks worse
"sqrt(-2*log(random()))*cos(2*PI*random())"
:as "jitter"}]
:YTYPE "nominal"
:YOFFSET {:field "jitter" :type "quantitative"}})
This is slightly better, but it’s already obvious that we’re never going to be able to replicate the graph in the article this way – there’s no way to specify anything precise about how to offset the points along the y axis with the jittering that vega-lite supports. To proceed and try to replicate the graph from the article, we can make a beeswarm plot. For this we’ll have to use vega itself (as opposed to vega-lite). Vega-lite is awesome, but it’s an intentionally less complete (but as a tradeoff simpler) layer on top of vega. Every once in a while I come across a type of graph that is not well supported by vega-lite and drop into vega. Fortunately there are lots of solutions for rendering vega plots in Clojure notebooks (like Clay, which I’m using to build this book).
Vega spec for a beeswarm graph
Getting a tablecloth dataset into vega
The first step is getting our data to render in a vega spec through Clojure. For this we can use Daniel Slutsky’s amazing library kindly. It tells our namespace in a notebook-agnostic way how to render different things. If we pass a valid vega spec to kind/vega
, our notebook will render it properly as a graph. So cool.
So first, to just get any data rendering in our graph, we’ll read (or in Clojure, slurp
) our data from the relevant file. We have to specify the format since csv
is not the default. For now we’ll just make a simple scatterplot to get the data on the page.
(kind/vega;; every dataset in vega needs to be named, this is how we reference the
;; data in the rest of the spec
:data [{:name "english-education"
{:values (slurp "data/year_2024/week_4/english_education.csv")
:format {:type "csv"}}]
:width 500
:height 500
;; the minimal vega spec includes scales, marks, and axes
;; scales map data values to visual values, defining the nature of visual encodings
:scales [{:name "yscale"
:type "band"
:domain {:data "english-education" :field "size_flag"}
:range "height"}
:name "xscale"
{:domain [-12 12]
:range "width"}]
;; axes label and provide context for the scales
:axes [{:orient "bottom" :scale "xscale"}
:orient "left" :scale "yscale"}]
{;; marks map visual values to shapes on the screen
:marks [{:type "symbol"
:from {:data "english-education"}
:encode {:enter {:y {:field "size_flag"
:scale "yscale"}
:x {:field "education_score"
:scale "xscale"}}}}]})
This is the minimal reproduction of the vega-lite scatterplot we made above. This is where we’ll dive deeper into vega to do some things vega-lite can’t do. In order to turn this into a beeswarm plot, we can add what vega calls forces.
Turning a scatterplot into a beeswarm graph in vega
Force transforms compute force-directed layouts, so we can use one to compute the placement of each point in our graph such that they cluster together based on their y value but don’t overlap. To compute a beeswarm layout we’ll tell vega to treat each point like it’s attracted to other ones that share its y value but not to collide with them.
(kind/vega:data [{:name "english-education"
{:values (slurp "data/year_2024/week_4/english_education.csv")
:format {:type "csv"}}]
:width 800
:height 800
:scales [{:name "yscale"
:type "band"
:domain {:data "english-education" :field "size_flag"}
:range "height"}
:name "xscale"
{:domain [-12 12]
:range "width"}]
:axes [{:orient "bottom" :scale "xscale"}
:orient "left" :scale "yscale"}]
{:marks [{:type "symbol"
:from {:data "english-education"}
:encode {:enter {;; we update the mark encoding to use the x and y values
;; computed by the force transform
:yfocus {:field "size_flag"
:scale "yscale"
:band 0.5}
:xfocus {:field "education_score"
:scale "xscale"
:band 0.5}}}
;; this is the new part -- a simulated force attracting the points to each other
:transform [{:type "force"
:static true
:forces [{:force "x" :x "xfocus"}
:force "y" :y "yfocus"}
{:force "collide" :radius 4}]}]}]}) {
This is pretty close to the graph we’re going for! We can add some more aesthetic details to make it look neater, like changing the definition of the points and adding a grid along the x axis:
(kind/vega:data [{:name "english-education"
{:values (slurp "data/year_2024/week_4/english_education.csv")
:format {:type "csv"}}]
:width 800
:height 800
:scales [{:name "yscale"
:type "band"
:domain {:data "english-education" :field "size_flag"}
:range "height"}
:name "xscale"
{:domain [-12 12]
:range "width"}]
:axes [{:orient "bottom" :scale "xscale" :grid true}
:orient "left" :scale "yscale"}]
{:marks [{:type "symbol"
:from {:data "english-education"}
:encode {:enter {:yfocus {:field "size_flag"
:scale "yscale"
:band 0.5}
:xfocus {:field "education_score"
:scale "xscale"
:band 0.5}
:fill {:value "skyblue"}
:stroke {:value "white"}
:strokeWidth {:value 1}
:zindex {:value 0}}}
:transform [{:type "force"
:static true
:forces [{:force "x" :x "xfocus"}
:force "y" :y "yfocus"}
{:force "collide" :radius 4}]}]}]}) {
Woohoo. Making progress. Next we can add the average lines. We’ll transform the data we already have into a new grouped dataset with the average computed per size_flag
, and add another mark layer on top to show these lines:
(kind/vega:data [{:name "english-education"
{:values (slurp "data/year_2024/week_4/english_education.csv")
:format {:type "csv"}}
;; this is the new dataset, computed from the existing one
:name "averages"
{:source "english-education"
:transform [{:type "aggregate",
:fields ["education_score"],
:groupby ["size_flag"],
:ops ["average"],
:as ["avg"]}]}]
:width 800
:height 800
:scales [{:name "yscale"
:type "band"
:domain {:data "english-education" :field "size_flag"}
:range "height"}
:name "xscale"
{:domain [-12 12]
:range "width"}]
:axes [{:orient "bottom" :scale "xscale" :grid true :title "Education score"}
:orient "left" :scale "yscale" :title "Town size"}]
{:marks [{:type "symbol"
:from {:data "english-education"}
:encode {:enter {:yfocus {:field "size_flag"
:scale "yscale"
:band 0.5}
:xfocus {:field "education_score"
:scale "xscale"
:band 0.5}
:fill {:value "skyblue"}
:stroke {:value "white"}
:strokeWidth {:value 1}
:zindex {:value 0}}}
:transform [{:type "force"
:static true
:forces [{:force "x" :x "xfocus"}
:force "y" :y "yfocus"}
{:force "collide" :radius 4}]}]}
{;; this is the new layer with the lines -- it's second so that the lines show up on top
:type "symbol"
{;; we tell it to use the data from our new, computed dataset
:from {:data "averages"}
:encode {:enter {:x {:field "avg" :scale "xscale"}
:y {:field "size_flag" :scale "yscale" :band 0.5}
:shape {:value "stroke"}
:strokeWidth {:value 1.5}
:stroke {:value "black"}
:size {:value 14000}
:angle {:value 90}}}}]})
We can add the town selection dropdown like in the example graph using vega signals.
let [town-name-values (-> english-education
(:town11nm])
(tc/select-columns [
tc/rows
flattensort
conj ""))]
(
(kind/vega:data [{:name "english-education"
{:values (slurp "data/year_2024/week_4/english_education.csv")
:format {:type "csv"}}
:name "averages"
{:source "english-education"
:transform [{:type "aggregate",
:fields ["education_score"],
:groupby ["size_flag"],
:ops ["average"],
:as ["avg"]}]}]
:width 800
:height 800
:scales [{:name "yscale"
:type "band"
:domain {:data "english-education" :field "size_flag"}
:range "height"}
:name "xscale"
{:domain [-12 12]
:range "width"}]
:axes [{:orient "bottom" :scale "xscale" :grid true :title "Education score"}
:orient "left" :scale "yscale" :title "Town size"}]
{:signals [{:name "highlight"
:value nil
:bind {:input :select
:options town-name-values
:labels (map #(str/replace % #" BUAS?D?" "") town-name-values)}}]
:marks [{:type "symbol"
:from {:data "english-education"}
:encode {:enter {:yfocus {:field "size_flag"
:scale "yscale"
:band 0.5}
:xfocus {:field "education_score"
:scale "xscale"
:band 0.5}
:fill {:value "skyblue"}
:stroke {:value "white"}
:strokeWidth {:value 1}
:zindex {:value 0}}
:update {:fill {:signal "datum['town11nm'] === highlight ? 'orange' : 'skyblue'"}
:stroke {:signal "datum['town11nm'] === highlight ? 'purple' : 'white'"}
:strokeWidth {:signal "datum['town11nm'] === highlight ? 2 : 1"}
:zindex {:signal "datum['town11nm'] === highlight ? 1 : 0"}}}
:transform [{:type "force"
:static true
:forces [{:force "x" :x "xfocus"}
:force "y" :y "yfocus"}
{:force "collide" :radius 4}]}]}
{:type "symbol"
{:from {:data "averages"}
:encode {:enter {:x {:field "avg" :scale "xscale"}
:y {:field "size_flag" :scale "yscale" :band 0.5}
:shape {:value "stroke"}
:strokeWidth {:value 1.5}
:stroke {:value "black"}
:size {:value 14000}
:angle {:value 90}}}}]}))
The select dropdown isn’t super nice or in a great spot. The way to style and position it better is with CSS, but I’m going to call that out of scope for this exercise. In theory, this dropdown would be above the chart and look much nicer. This is great, though. We have more or less re-created the main graphic from the article. We can see that smaller towns seem to have higher educational attainment on average, and check the value for any given town by selecting its value from a dropdown.
One of the many projects I’d like to tackle is making a wrapper for vega to make it intuitive to use from Clojure. Anyway, moving on to the next graph in the article.
Income deprivation and town size
Normalized, stacked bar chart
The next one is a simple normalized stacked bar chart exploring the relationship between town size and income deprivation. For this one we can drop back into vega-lite (and hanami):
-> english-education
(
(hanami/plot ht/bar-chart:XAGG "sum"
{:XSTACK "normalize"
:XTITLE "Count"
:X :education_score
:YTYPE "nominal"
:Y :size_flag
:YTITLE "Town size"
:COLOR "income_flag"
:CTYPE "nominal"}))
The data from the “BUA”s and London is making this look less than ideal, so we’ll filter for only the categories of towns that are included in the article’s graph:
-> english-education
("Small Towns" "Medium Towns" "Large Towns" "City"}
(tc/select-rows #(#{:size_flag %)))
(
(hanami/plot ht/bar-chart:XAGG "sum"
{:XSTACK "normalize"
:XTITLE "Count"
:X :education_score
:YTYPE "nominal"
:Y :size_flag
:YTITLE "Town size"
:COLOR "income_flag"
:CTYPE "nominal"}))
This seems to show that the incidence of income deprivation is higher the larger the town size. We can visualize this relationship in another way by making a scatterplot, like in the article.
Scatterplot
For this graph we’ll use the dataset provided in the article (since it contains income deprivation scores, not just classifications).
-> "data/year_2024/week_4/education-and-income-scores.csv"
(
tc/dataset
(hanami/plot ht/point-chart:X "Educational attainment score"
{:Y "Income deprivation score"
:YSCALE {:zero false}}))
This reveals that income deprivation seems to be worse in larger towns, but this isn’t the whole story. The article also investigates regional differences. To see the effect of region on education scores, we can plot scores vs region and encode the town size in the color of the points. We’ll go back to our original dataset for this. The rgn11nm
column contains the region we’re looking for.
Education score by region
-> english-education
(
(hanami/plot ht/point-chart:X :education_score
{:Y :rgn11nm
:YTYPE "nominal"
:COLOR "income_flag"
:CTYPE "nominal"}))
This is pretty close, but we can take the average of all scores in a given region to reduce some of the noise, and make the points larger to make it easier to see them.
-> english-education
(
(hanami/plot ht/point-chart:X :education_score
{:XAGG "mean"
:Y :rgn11nm
:YTYPE "nominal"
:MSIZE 100
:COLOR "income_flag"
:CTYPE "nominal"}))
This reveals an interesting relationship between region and education scores. The North West has the highest education scores at all levels of income deprivation.
Effect of being near the coast
The next graph is an interesting one showing that coastal towns have worse outcomes than non-coastal towns. The graph is similar to the previous one, but in order to get the data into a sensible shape to visualize I’m going to wrangle it a bit ahead of throwing it into our viz function. We have the coastal_detailed
column that gives us, well, details about the coastal-ness of each town. Inspecting the distinct values in this column shows us it’s a bit of a mess, though.
-> english-education :coastal_detailed vec distinct) (
"Smaller non-coastal town"
("Large non-coastal town"
"Smaller seaside town"
"Large seaside town"
"Smaller other coastal town"
"Large other coastal town"
"Cities"
nil)
This one column contains strings that describe a town as either “smaller” or “large”, and “non-coastal”, “seaside”, or “other coastal”. The first thing we’ll do is tidy up this data. Each variable should be in its own column, plus we’ll delete all the rows that don’t have a value for coastal_detailed
. Then we’ll pass it to our viz function, taking the average of the education score for each category we’re plotting, just like above.
-> english-education
(:education_score :coastal_detailed])
(tc/select-columns [:coastal_detailed)
(tc/drop-missing :town_size [:coastal_detailed]
(tc/map-columns fn [v]
(when v
(if (str/includes? v "Smaller") "Small" "Large"))))
(:coastal_type [:coastal_detailed]
(tc/map-columns fn [v]
(when v
(cond
("non-coastal") "Non-coastal"
(str/includes? v "seaside") "Seaside"
(str/includes? v :else "Other coastal"))))
(hanami/plot ht/point-chart:X :education_score
{:XAGG "mean"
:Y :town_size
:YTYPE "nominal"
:COLOR "coastal_type"
:CTYPE "nominal"
:MSIZE 100
:HEIGHT 100}))
This reveals in interesting relationship between the proximity of a town to the coast and education scores. For whatever reason, inland towns have better educational attainment.
Widening educational attainment gap over time
The article also observes that the gap in educational attainment between students from high vs low income deprivation areas widens over time. We can see this in the data by finding the average “key stage 2” (end of primary school in the UK) attainment in a given income deprivation category and comparing it to two later measures of educational attainment (key stage 4 and level 3).
I’ll define a function for computing the average of a sequence of numbers, for the sake of clarifying the data transformation.
defn- average [vals]
(/ (reduce + vals) (count vals))) (
#'year-2024.week-4.analysis/average
We’ll also filter out rows that have “Cities” and nil
as their income_flag
because these aren’t comparable to the other income flag values.
-> english-education
(:income_flag)
(tc/drop-missing = "Cities" (:income_flag %)))
(tc/drop-rows #(:income_flag])
(tc/group-by [:key_stage_2_attainment_school_year_2007_to_2008
(tc/aggregate-columns [:key_stage_4_attainment_school_year_2012_to_2013
:level_3_at_age_18]
average))
_unnamed [3 4]:
:income_flag | :key_stage_2_attainment_school_year_2007_to_2008 | :key_stage_4_attainment_school_year_2012_to_2013 | :level_3_at_age_18 |
---|---|---|---|
Higher deprivation towns | 69.46241954 | 55.61679938 | 41.59478974 |
Mid deprivation towns | 72.95656558 | 59.34720275 | 47.82498013 |
Lower deprivation towns | 79.29026824 | 67.95547806 | 57.96412983 |
We’ll calculate the gap in attainment between the different deprivation categories. I’m going to rename the columns first to make them more succinct. Then we’ll add new columns that show the gap between each stage for a given income deprivation level (compared to the lowest deprivation category).
let [ds (-> english-education
(:income_flag)
(tc/drop-missing = "Cities" (:income_flag %)))
(tc/drop-rows #(:income_flag])
(tc/group-by [:key_stage_2_attainment_school_year_2007_to_2008
(tc/aggregate-columns [:key_stage_4_attainment_school_year_2012_to_2013
:level_3_at_age_18]
average):key_stage_2_attainment_school_year_2007_to_2008 :key_stage_2
(tc/rename-columns {:key_stage_4_attainment_school_year_2012_to_2013 :key_stage_4
:level_3_at_age_18 :level_3}))
:income_flag %) "Lower"))]
lowest-deprivation-vals (tc/select-rows ds #(str/includes? (-> ds
(:key_stage_2_gap [:key_stage_2]
(tc/map-columns - % (first (:key_stage_2 lowest-deprivation-vals))))
#(:key_stage_4_gap [:key_stage_4]
(tc/map-columns - % (first (:key_stage_4 lowest-deprivation-vals))))
#(:level_3_gap [:level_3]
(tc/map-columns - % (first (:level_3 lowest-deprivation-vals)))))) #(
_unnamed [3 7]:
:income_flag | :key_stage_2 | :key_stage_4 | :level_3 | :key_stage_2_gap | :key_stage_4_gap | :level_3_gap |
---|---|---|---|---|---|---|
Higher deprivation towns | 69.46241954 | 55.61679938 | 41.59478974 | -9.82784870 | -12.33867868 | -16.36934009 |
Mid deprivation towns | 72.95656558 | 59.34720275 | 47.82498013 | -6.33370266 | -8.60827531 | -10.13914970 |
Lower deprivation towns | 79.29026824 | 67.95547806 | 57.96412983 | 0.00000000 | 0.00000000 | 0.00000000 |
This data is already pretty easy to interpret just as a table. For the sake of the exercise, we can plot it in a similar way to the article. To do this we’ll tidy up the data in a different way, in all senses of the word. Right now this data is structured in a way that makes it easy to interpret when it’s printed as a table, but it’s hard to plot because some of the column names encode a variable in our dataset across multiple columns (the education attainment stage). To fix this, we’ll use tablecloth’s pivot->longer
, so that each row represents one observation and each column represents a single variable. Then we can plot it.
let [ds (-> english-education
(:income_flag)
(tc/drop-missing = "Cities" (:income_flag %)))
(tc/drop-rows #(:income_flag])
(tc/group-by [:key_stage_2_attainment_school_year_2007_to_2008
(tc/aggregate-columns [:key_stage_4_attainment_school_year_2012_to_2013
:level_3_at_age_18]
average):key_stage_2_attainment_school_year_2007_to_2008 :key_stage_2
(tc/rename-columns {:key_stage_4_attainment_school_year_2012_to_2013 :key_stage_4
:level_3_at_age_18 :level_3}))
:income_flag %) "Lower"))]
lowest-deprivation-vals (tc/select-rows ds #(str/includes? (-> ds
(:key_stage_2_gap [:key_stage_2]
(tc/map-columns - % (first (:key_stage_2 lowest-deprivation-vals))))
#(:key_stage_4_gap [:key_stage_4]
(tc/map-columns - % (first (:key_stage_4 lowest-deprivation-vals))))
#(:level_3_gap [:level_3]
(tc/map-columns - % (first (:level_3 lowest-deprivation-vals))))
#(:income_flag :key_stage_2_gap :key_stage_4_gap :level_3_gap])
(tc/select-columns [:key_stage_2_gap :key_stage_4_gap :level_3_gap]
(tc/pivot->longer [:target-columns :gap
{:value-column-name :value})
(hanami/plot ht/point-chart:X :gap
{:XTYPE "nominal"
:Y :value
:COLOR "income_flag"
:CTYPE "nominal"
:MSIZE 100})))
This could be tidied up a bit with nicer names and colours, but this is the idea. I think the main takeaway from this graph is that data should be tidy for the sake of visualizing it, and that naming columns is important.
Pursuit of higher education
The next question we can answer with this data is how many students from different income deprivation areas pursue higher education. This chart looks complex but it’s just a simple bar chart faceted by educational attainment milestones. We have the information to make this chart, but again it’s not organized in a nice way for visualizing.
The data we care about for this chart are the town size flag, and the three educational attainment values in the columns level_3_at_age_18
, activity_at_age_19:_full-time_higher_education
, and activity_at_age_19:_appprenticeships
. We’ll pivot our data again in a different way this time to organize the data for this visualization, then aggregate the values to calculate the average value for each town size/attainment level pair. As an added bonus, the data in the activity...
columns are strings, not numbers, so we’ll coerce those to numbers too so that we can do math on them (like calculating the average).
-> english-education
(:size_flag
(tc/select-columns [:level_3_at_age_18
:activity_at_age_19:_full-time_higher_education
:activity_at_age_19:_sustained_further_education])
:level_3_at_age_18 "Level 3 qualifications at age 18"
(tc/rename-columns {:activity_at_age_19:_full-time_higher_education "In full-time higher education at age 19"
:activity_at_age_19:_sustained_further_education "In further education at age 19"})
"Level 3 qualifications at age 18"
(tc/pivot->longer ["In full-time higher education at age 19"
"In further education at age 19"]
:target-columns :attainment_measure
{:value-column-name :value})
;; coerce all the values to be numbers
:value (partial map #(if (string? %) (parse-double %) %)))
(tc/update-columns ;; lose the nil ones, it messes up our calculation
:value)
(tc/drop-missing ;; calculate the average value for each size flag/attainment measure pair
:size_flag :attainment_measure])
(tc/group-by [:value average)
(tc/aggregate-columns ;; plot this
:facet :FACET
(hanami/plot {:spec {:encoding :ENCODING
:layer :LAYER
:width :WIDTH}
:data {:values :DATA
:format :DFMT}}
:Y :size_flag
{:YTITLE "Town size"
:YTYPE "nominal"
:YSORT ["Small towns"
"Medium towns"
"Large towns"
"Cities"
"Outer London"
"Inner London"]
:X :value
:XTITLE "Percentage of population"
:FACET {:column {:field :attainment_measure
:type "nominal"
:title "Attainment measure"
:sort ["Level 3 qualifications at age 18"
"In full-time higher education at age 19"
"In further education at age 19"]}}
:LAYER [{:mark "bar"}
:mark {:type "text" :dx -3 :color "white" :align "right"}
{:encoding {:text {:field "value" :format ".1f"}}}]
:WIDTH 140}))
Connection to other town residents
The last graph in the article is one showing the relationship between the educational attainment of older and younger residents. We can see that education scores seem to be correlated with the incidence of high educational attainment among older residents of a town. We don’t have exactly the right data to reproduce this graph, but we can do something similar with the data we do have.
We can plot the educational attainment values vs. the educational attainment classification of residents aged 35-64 and add some jitter to spread out the dots, which will reveal the same general relationship as the more precise data – students in towns with more highly educated older generations tend to have higher educational attainment. It’s still worth pointing out that there is a huge overlap between the highest low-education town educational attainment scores and the lowest high-education town scores, so there are obviously many other factors at play. But it’s still an interesting observation.
-> english-education
(:level4qual_residents35-64_2011)
(tc/drop-missing :encoding (assoc (hc/get-default :ENCODING)
(hanami/plot {:yOffset :YOFFSET)
:mark {:type "circle"}
:transform :TRANSFORM
:height 300
:width 500
:data {:values :DATA
:format :DFMT}}
:X :education_score
{:Y :level4qual_residents35-64_2011
:YTYPE "nominal"
:YSORT ["High" "Medium" "Low"]
:YOFFSET {:field "jitter" :type "quantitative"}
:TRANSFORM [{:calculate "sqrt(-2*log(random()))*cos(2*PI*random())"
:as "jitter"}]}))
That sums up our work this week. There are so many graphs in this one, it was a lot of fun to play around and see some common patterns emerging. I’m looking forward to many projects revolving around tidying up the dataviz story in Clojure.
See you next week :)