Week 4 - Educational Attainment of Young People in English Towns

(ns year-2024.week-4.analysis
  (:require
   [aerial.hanami.common :as hc]
   [aerial.hanami.templates :as ht]
   [clojure.string :as str]
   [scicloj.kindly.v4.kind :as kind]
   [scicloj.noj.v1.vis.hanami :as hanami]
   [tablecloth.api :as tc]))

This week’s data is about English students. We’ll try to answer the question from the example article, why do children and young people in smaller towns do better academically than those in larger towns?

Plotting the data as a scatterplot

The first graph in the article is a kind of scatterplot segmented by the size of the town. We’ll load the data first:

(def english-education
  (-> "data/year_2024/week_4/english_education.csv"
      (tc/dataset {:key-fn keyword})))
english-education

data/year_2024/week_4/english_education.csv [1104 31]:

:town11cd :town11nm :population_2011 :size_flag :rgn11nm :coastal :coastal_detailed :ttwa11cd :ttwa11nm :ttwa_classification :job_density_flag :income_flag :university_flag :level4qual_residents35-64_2011 :ks4_2012-2013_counts :key_stage_2_attainment_school_year_2007_to_2008 :key_stage_4_attainment_school_year_2012_to_2013 :level_2_at_age_18 :level_3_at_age_18 :activity_at_age_19:_full-time_higher_education :activity_at_age_19:_sustained_further_education :activity_at_age_19:_appprenticeships :activity_at_age_19:employment_with_earnings_above£0 :activity_at_age_19:employment_with_earnings_above£10,000 :activity_at_age_19:_out-of-work :highest_level_qualification_achieved_by_age_22:_less_than_level_1 :highest_level_qualification_achieved_by_age_22:_level_1_to_level_2 :highest_level_qualification_achieved_by_age_22:_level_3_to_level_5 :highest_level_qualification_achieved_by_age_22:_level_6_or_above :highest_level_qualification_achieved_b_age_22:_average_score :education_score
E34000007 Carlton in Lindrick BUA 5456.0 Small Towns East Midlands Non-coastal Smaller non-coastal town E30000291 Worksop and Retford Majority town and city (small) Residential Higher deprivation towns No university Low 65.0 65.00000000 70.76923077 72.30769231 50.76923077 30.76923076923077 21.53846153846154 * 52.307692307692314 36.92307692307693 * * 34.9 39.7 * 3.32307692 -0.53375042
E34000016 Dorchester (West Dorset) BUA 19060.0 Small Towns South West Non-coastal Smaller non-coastal town E30000046 Dorchester and Weymouth Majority town and city (small) Working Mid deprivation towns No university Medium 239.0 69.05829596 71.12970711 85.71428571 60.08403361 41.84100418410041 13.389121338912133 10.0418410041841 51.04602510460251 24.686192468619247 4.184100418410042 * 21.7 44.6 33.3 3.73221757 1.95201875
E34000020 Ely BUA 19090.0 Small Towns East of England Non-coastal Smaller non-coastal town E30000186 Cambridge Majority town and city (Large Towns) Working Lower deprivation towns No university Medium 155.0 71.42857143 56.12903226 83.87096774 45.80645161 35.483870967741936 10.967741935483872 7.741935483870968 57.41935483870968 27.741935483870968 * * 34.4 31.2 32.5 3.54838710 -1.04412799
E34000026 Market Weighton BUA 6429.0 Small Towns Yorkshire and The Humber Non-coastal Smaller non-coastal town E30000220 Hull Majority town and city (Large Towns) Residential Lower deprivation towns No university Medium 58.0 70.96774194 53.44827586 91.22807018 49.12280702 25.862068965517242 25.862068965517242 * 58.620689655172406 31.03448275862069 * * * 66.1 * 3.48275862 -1.24926195
E34000027 Downham Market BUA 10884.0 Small Towns East of England Non-coastal Smaller non-coastal town E30000225 King’s Lynn Majority rural Mixed Higher deprivation towns No university Low 93.0 78.04878049 62.36559140 78.49462366 40.86021505 26.881720430107524 20.43010752688172 15.053763440860216 55.91397849462365 30.107526881720432 15.053763440860216 * 32.6 44.2 * 3.16129032 -1.16907845
E34000039 Penrith BUA 15181.0 Small Towns North West Non-coastal Smaller non-coastal town E30000106 Penrith Majority rural Working Lower deprivation towns No university Low 158.0 73.85620915 74.05063291 91.13924051 46.20253165 31.645569620253166 16.455696202531644 12.658227848101266 62.0253164556962 37.34177215189873 7.59493670886076 * 29.4 40.0 28.1 3.47468354 0.84513109
E34000048 Bolsover BUA 11754.0 Small Towns East Midlands Non-coastal Smaller non-coastal town E30000190 Chesterfield Majority town and city (Large Towns) Residential Higher deprivation towns No university Low 118.0 70.90909091 50.00000000 76.27118644 37.28813559 21.1864406779661 27.11864406779661 20.33898305084746 50.847457627118644 26.27118644067797 11.864406779661017 * 34.5 47.9 * 3.10169492 -3.71791969
E34000055 March BUA 21051.0 Medium Towns East of England Non-coastal Large non-coastal town E30000287 Wisbech Majority town and city (small) Residential Higher deprivation towns No university Low 235.0 71.06382979 57.44680851 84.25531915 40.42553191 25.53191489361702 17.4468085106383 10.212765957446807 47.65957446808511 26.80851063829787 6.808510638297872 * 37.8 37.8 23.6 3.28510638 -2.17130716
E34000056 Southam BUA 6567.0 Small Towns West Midlands Non-coastal Smaller non-coastal town E30000228 Leamington Spa Majority town and city (small) Working Lower deprivation towns No university Medium 92.0 81.92771084 91.30434783 90.21739130 67.39130435 38.04347826086957 21.73913043478261 16.304347826086957 52.17391304347826 25.0 * * 19.6 47.8 32.6 3.84782609 6.43766140
E34000067 Royston BUA 15781.0 Small Towns East of England Non-coastal Smaller non-coastal town E30000186 Cambridge Majority town and city (Large Towns) Working Lower deprivation towns No university Medium 169.0 73.03370787 65.08875740 82.24852071 53.84615385 23.668639053254438 21.893491124260358 14.201183431952662 57.98816568047337 39.053254437869825 * * 23.7 50.9 23.7 3.38461538 0.29540337
E35001501 Reading BUASD 218705.0 Large Towns South East Non-coastal Large non-coastal town E30000256 Reading Majority town and city (Large Towns) Working Mid deprivation towns University Medium 2243.0 73.81263617 62.50557289 82.65715560 52.07311636 37.672759696834596 16.272848863129738 9.80829246544806 48.14979937583593 24.83281319661168 5.21622826571556 2.4 25.7 42.5 29.5 3.53811859 0.39981293
E35001502 Hedge End BUASD 25117.0 Medium Towns South East Non-coastal Large non-coastal town E30000267 Southampton Majority town and city (Large Towns) Mixed Lower deprivation towns No university Medium 307.0 77.63578275 70.68403909 84.03908795 58.30618893 34.20195439739413 19.218241042345277 14.65798045602606 60.91205211726385 35.50488599348534 5.537459283387622 * 18.1 51.5 29.1 3.63517915 2.48698423
E35001503 Westergate BUASD 9783.0 Small Towns South East Non-coastal Smaller non-coastal town E30000191 Chichester and Bognor Regis Majority town and city (small) Mixed Lower deprivation towns No university Medium 108.0 80.00000000 62.96296296 88.88888889 56.48148148 27.77777777777778 14.814814814814813 * 54.629629629629626 25.0 * * 22.9 48.6 27.5 3.53703704 1.56425938
E35001507 Loughton BUASD 31106.0 Medium Towns East of England Non-coastal Large non-coastal town E30000234 London Majority conurbation Working Mid deprivation towns No university Medium 357.0 75.55555556 70.30812325 84.83146067 49.71910112 35.01400560224089 16.5266106442577 10.92436974789916 42.857142857142854 28.57142857142857 4.481792717086835 * 22.9 46.4 29.3 3.57422969 1.26655373
E35001516 Milton Keynes BUASD 171750.0 Large Towns South East Non-coastal Large non-coastal town E30000243 Milton Keynes Majority town and city (Large Towns) Working Mid deprivation towns University Medium 2106.0 70.43314501 62.67806268 85.99240266 52.08926876 42.26020892687559 13.437796771130103 6.837606837606838 43.49477682811016 21.225071225071225 6.172839506172839 1.9 26.4 38.9 32.8 3.65337132 0.34182404
E35001517 Redditch BUASD 81919.0 Large Towns West Midlands Non-coastal Large non-coastal town E30000169 Birmingham Majority conurbation Working Higher deprivation towns No university Low 1006.0 67.58349705 68.09145129 87.17693837 48.40954274 32.30616302186879 18.78727634194831 9.642147117296222 50.0 22.962226640159045 8.846918489065606 1.7 29.7 41.3 27.3 3.47117296 -0.29414288
K06000004 Chester BUASD 86011.0 Large Towns North West Non-coastal Large non-coastal town K01000011 Chester Majority town and city (Large Towns) Working Higher deprivation towns University Medium 739.0 70.72192513 61.02841678 80.62330623 46.20596206 32.47631935047362 19.485791610284167 9.066305818673884 45.33152909336942 18.26792963464141 7.983761840324763 2.0 27.3 43.0 27.7 3.47496617 -0.81093526
Inner London BUAs Inner London BUA London 24630.0 69.91000918 63.21965083 84.88381540 53.47335067 49.26918392204628 14.023548518067397 5.952090946000812 30.07308160779537 10.117742590336988 7.182298010556232 1.9 19.0 44.0 35.1 3.78457166 0.06759137
Outer london BUAs Outer london BUA London 52998.0 74.51446320 66.58364467 86.60700800 57.05331521 48.47352730291709 14.249594324314124 7.462545756443639 36.38816559115438 15.551530246424395 5.520963055209631 1.7 18.8 42.5 37.0 3.86403261 1.26241027
Other Small BUAs Other Small BUAs 59743.0 77.35164037 65.12562141 86.39697264 54.48745856 37.30144117302445 18.96623872252816 12.269219824916727 50.26530304805584 24.33757929799307 5.4985521316304835 1.7 23.3 44.9 30.1 3.65651206 1.22187679
Not BUA Not BUA 21765.0 78.23469971 67.50746612 87.83783784 57.17503218 38.98460831610384 18.111647139903514 11.504709395818976 48.86744773719274 23.06914771422008 4.741557546519641 1.5 21.2 45.1 32.2 3.70797151 1.80208942

We can plot the education scores sorted by town size like in the article to start:

(hanami/plot english-education
             ht/point-chart
             {:X :education_score
              :Y :size_flag
              :YTYPE "nominal"})

We can add jitter to this chart to spread out the points along the y axis since they’re all bunched up because of how we’re sorting them along the y axis:

(hanami/plot english-education
             ;; This is not ideal.. one of my many projects this year is to think up ways
             ;; to improve Clojure's dataviz wrappers
             (-> ht/point-chart
                 (assoc :encoding (assoc (hc/get-default :ENCODING)
                                         :yOffset :YOFFSET)))
             {:X :education_score
              :Y :size_flag
              :HEIGHT {:step 60}
              :TRANSFORM [{:calculate
                           ;; Generate Gaussian jitter with a Box-Muller transform, we could
                           ;; also use `random()`, but that jitters the points uniformly and
                           ;; looks worse
                           "sqrt(-2*log(random()))*cos(2*PI*random())"
                           :as "jitter"}]
              :YTYPE "nominal"
              :YOFFSET {:field "jitter" :type "quantitative"}})

This is slightly better, but it’s already obvious that we’re never going to be able to replicate the graph in the article this way – there’s no way to specify anything precise about how to offset the points along the y axis with the jittering that vega-lite supports. To proceed and try to replicate the graph from the article, we can make a beeswarm plot. For this we’ll have to use vega itself (as opposed to vega-lite). Vega-lite is awesome, but it’s an intentionally less complete (but as a tradeoff simpler) layer on top of vega. Every once in a while I come across a type of graph that is not well supported by vega-lite and drop into vega. Fortunately there are lots of solutions for rendering vega plots in Clojure notebooks (like Clay, which I’m using to build this book).

Vega spec for a beeswarm graph

Getting a tablecloth dataset into vega

The first step is getting our data to render in a vega spec through Clojure. For this we can use Daniel Slutsky’s amazing library kindly. It tells our namespace in a notebook-agnostic way how to render different things. If we pass a valid vega spec to kind/vega, our notebook will render it properly as a graph. So cool.

So first, to just get any data rendering in our graph, we’ll read (or in Clojure, slurp) our data from the relevant file. We have to specify the format since csv is not the default. For now we’ll just make a simple scatterplot to get the data on the page.

(kind/vega
  ;; every dataset in vega needs to be named, this is how we reference the
  ;; data in the rest of the spec
 {:data [{:name "english-education"
          :values (slurp "data/year_2024/week_4/english_education.csv")
          :format {:type "csv"}}]
  :width 500
  :height 500
  ;; the minimal vega spec includes scales, marks, and axes
  ;; scales map data values to visual values, defining the nature of visual encodings
  :scales [{:name "yscale"
            :type "band"
            :domain {:data "english-education" :field "size_flag"}
            :range "height"}
           {:name "xscale"
            :domain [-12 12]
            :range "width"}]
  ;; axes label and provide context for the scales
  :axes [{:orient "bottom" :scale "xscale"}
         {:orient "left" :scale "yscale"}]
  ;; marks map visual values to shapes on the screen
  :marks [{:type "symbol"
           :from {:data "english-education"}
           :encode {:enter {:y {:field "size_flag"
                                :scale "yscale"}
                            :x {:field "education_score"
                                :scale "xscale"}}}}]})

This is the minimal reproduction of the vega-lite scatterplot we made above. This is where we’ll dive deeper into vega to do some things vega-lite can’t do. In order to turn this into a beeswarm plot, we can add what vega calls forces.

Turning a scatterplot into a beeswarm graph in vega

Force transforms compute force-directed layouts, so we can use one to compute the placement of each point in our graph such that they cluster together based on their y value but don’t overlap. To compute a beeswarm layout we’ll tell vega to treat each point like it’s attracted to other ones that share its y value but not to collide with them.

(kind/vega
 {:data [{:name "english-education"
          :values (slurp "data/year_2024/week_4/english_education.csv")
          :format {:type "csv"}}]
  :width 800
  :height 800
  :scales [{:name "yscale"
            :type "band"
            :domain {:data "english-education" :field "size_flag"}
            :range "height"}
           {:name "xscale"
            :domain [-12 12]
            :range "width"}]
  :axes [{:orient "bottom" :scale "xscale"}
         {:orient "left" :scale "yscale"}]
  :marks [{:type "symbol"
           :from {:data "english-education"}
           :encode {:enter {;; we update the mark encoding to use the x and y values
                            ;; computed by the force transform
                            :yfocus {:field "size_flag"
                                     :scale "yscale"
                                     :band 0.5}
                            :xfocus {:field "education_score"
                                     :scale "xscale"
                                     :band 0.5}}}
           ;; this is the new part -- a simulated force attracting the points to each other
           :transform [{:type "force"
                        :static true
                        :forces [{:force "x" :x "xfocus"}
                                 {:force "y" :y "yfocus"}
                                 {:force "collide" :radius 4}]}]}]})

This is pretty close to the graph we’re going for! We can add some more aesthetic details to make it look neater, like changing the definition of the points and adding a grid along the x axis:

(kind/vega
 {:data [{:name "english-education"
          :values (slurp "data/year_2024/week_4/english_education.csv")
          :format {:type "csv"}}]
  :width 800
  :height 800
  :scales [{:name "yscale"
            :type "band"
            :domain {:data "english-education" :field "size_flag"}
            :range "height"}
           {:name "xscale"
            :domain [-12 12]
            :range "width"}]
  :axes [{:orient "bottom" :scale "xscale" :grid true}
         {:orient "left" :scale "yscale"}]
  :marks [{:type "symbol"
           :from {:data "english-education"}
           :encode {:enter {:yfocus {:field "size_flag"
                                     :scale "yscale"
                                     :band 0.5}
                            :xfocus {:field "education_score"
                                     :scale "xscale"
                                     :band 0.5}
                            :fill {:value "skyblue"}
                            :stroke {:value "white"}
                            :strokeWidth {:value 1}
                            :zindex {:value 0}}}
           :transform [{:type "force"
                        :static true
                        :forces [{:force "x" :x "xfocus"}
                                 {:force "y" :y "yfocus"}
                                 {:force "collide" :radius 4}]}]}]})

Woohoo. Making progress. Next we can add the average lines. We’ll transform the data we already have into a new grouped dataset with the average computed per size_flag, and add another mark layer on top to show these lines:

(kind/vega
 {:data [{:name "english-education"
          :values (slurp "data/year_2024/week_4/english_education.csv")
          :format {:type "csv"}}
         ;; this is the new dataset, computed from the existing one
         {:name "averages"
          :source "english-education"
          :transform [{:type "aggregate",
                       :fields ["education_score"],
                       :groupby ["size_flag"],
                       :ops ["average"],
                       :as ["avg"]}]}]
  :width 800
  :height 800
  :scales [{:name "yscale"
            :type "band"
            :domain {:data "english-education" :field "size_flag"}
            :range "height"}
           {:name "xscale"
            :domain [-12 12]
            :range "width"}]
  :axes [{:orient "bottom" :scale "xscale" :grid true :title "Education score"}
         {:orient "left" :scale "yscale" :title "Town size"}]
  :marks [{:type "symbol"
           :from {:data "english-education"}
           :encode {:enter {:yfocus {:field "size_flag"
                                     :scale "yscale"
                                     :band 0.5}
                            :xfocus {:field "education_score"
                                     :scale "xscale"
                                     :band 0.5}
                            :fill {:value "skyblue"}
                            :stroke {:value "white"}
                            :strokeWidth {:value 1}
                            :zindex {:value 0}}}
           :transform [{:type "force"
                        :static true
                        :forces [{:force "x" :x "xfocus"}
                                 {:force "y" :y "yfocus"}
                                 {:force "collide" :radius 4}]}]}
          ;; this is the new layer with the lines -- it's second so that the lines show up on top
          {:type "symbol"
           ;; we tell it to use the data from our new, computed dataset
           :from {:data "averages"}
           :encode {:enter {:x {:field "avg" :scale "xscale"}
                            :y {:field "size_flag" :scale "yscale" :band 0.5}
                            :shape {:value "stroke"}
                            :strokeWidth {:value 1.5}
                            :stroke {:value "black"}
                            :size {:value 14000}
                            :angle {:value 90}}}}]})

We can add the town selection dropdown like in the example graph using vega signals.

(let [town-name-values (-> english-education
                           (tc/select-columns [:town11nm])
                           tc/rows
                           flatten
                           sort
                           (conj ""))]
  (kind/vega
   {:data [{:name "english-education"
            :values (slurp "data/year_2024/week_4/english_education.csv")
            :format {:type "csv"}}
           {:name "averages"
            :source "english-education"
            :transform [{:type "aggregate",
                         :fields ["education_score"],
                         :groupby ["size_flag"],
                         :ops ["average"],
                         :as ["avg"]}]}]
    :width 800
    :height 800
    :scales [{:name "yscale"
              :type "band"
              :domain {:data "english-education" :field "size_flag"}
              :range "height"}
             {:name "xscale"
              :domain [-12 12]
              :range "width"}]
    :axes [{:orient "bottom" :scale "xscale" :grid true :title "Education score"}
           {:orient "left" :scale "yscale" :title "Town size"}]
    :signals [{:name "highlight"
               :value nil
               :bind {:input :select
                      :options town-name-values
                      :labels (map #(str/replace % #" BUAS?D?" "") town-name-values)}}]
    :marks [{:type "symbol"
             :from {:data "english-education"}
             :encode {:enter {:yfocus {:field "size_flag"
                                       :scale "yscale"
                                       :band 0.5}
                              :xfocus {:field "education_score"
                                       :scale "xscale"
                                       :band 0.5}
                              :fill {:value "skyblue"}
                              :stroke {:value "white"}
                              :strokeWidth {:value 1}
                              :zindex {:value 0}}
                      :update {:fill {:signal "datum['town11nm'] === highlight ? 'orange' : 'skyblue'"}
                               :stroke {:signal "datum['town11nm'] === highlight ? 'purple' : 'white'"}
                               :strokeWidth {:signal "datum['town11nm'] === highlight ? 2 : 1"}
                               :zindex {:signal "datum['town11nm'] === highlight ? 1 : 0"}}}
             :transform [{:type "force"
                          :static true
                          :forces [{:force "x" :x "xfocus"}
                                   {:force "y" :y "yfocus"}
                                   {:force "collide" :radius 4}]}]}
            {:type "symbol"
             :from {:data "averages"}
             :encode {:enter {:x {:field "avg" :scale "xscale"}
                              :y {:field "size_flag" :scale "yscale" :band 0.5}
                              :shape {:value "stroke"}
                              :strokeWidth {:value 1.5}
                              :stroke {:value "black"}
                              :size {:value 14000}
                              :angle {:value 90}}}}]}))

The select dropdown isn’t super nice or in a great spot. The way to style and position it better is with CSS, but I’m going to call that out of scope for this exercise. In theory, this dropdown would be above the chart and look much nicer. This is great, though. We have more or less re-created the main graphic from the article. We can see that smaller towns seem to have higher educational attainment on average, and check the value for any given town by selecting its value from a dropdown.

One of the many projects I’d like to tackle is making a wrapper for vega to make it intuitive to use from Clojure. Anyway, moving on to the next graph in the article.

Income deprivation and town size

Normalized, stacked bar chart

The next one is a simple normalized stacked bar chart exploring the relationship between town size and income deprivation. For this one we can drop back into vega-lite (and hanami):

(-> english-education
    (hanami/plot ht/bar-chart
                 {:XAGG "sum"
                  :XSTACK "normalize"
                  :XTITLE "Count"
                  :X :education_score
                  :YTYPE "nominal"
                  :Y :size_flag
                  :YTITLE "Town size"
                  :COLOR "income_flag"
                  :CTYPE "nominal"}))

The data from the “BUA”s and London is making this look less than ideal, so we’ll filter for only the categories of towns that are included in the article’s graph:

(-> english-education
    (tc/select-rows #(#{"Small Towns" "Medium Towns" "Large Towns" "City"}
                       (:size_flag %)))
    (hanami/plot ht/bar-chart
                 {:XAGG "sum"
                  :XSTACK "normalize"
                  :XTITLE "Count"
                  :X :education_score
                  :YTYPE "nominal"
                  :Y :size_flag
                  :YTITLE "Town size"
                  :COLOR "income_flag"
                  :CTYPE "nominal"}))

This seems to show that the incidence of income deprivation is higher the larger the town size. We can visualize this relationship in another way by making a scatterplot, like in the article.

Scatterplot

For this graph we’ll use the dataset provided in the article (since it contains income deprivation scores, not just classifications).

(-> "data/year_2024/week_4/education-and-income-scores.csv"
    tc/dataset
    (hanami/plot ht/point-chart
                 {:X "Educational attainment score"
                  :Y "Income deprivation score"
                  :YSCALE {:zero false}}))

This reveals that income deprivation seems to be worse in larger towns, but this isn’t the whole story. The article also investigates regional differences. To see the effect of region on education scores, we can plot scores vs region and encode the town size in the color of the points. We’ll go back to our original dataset for this. The rgn11nm column contains the region we’re looking for.

Education score by region

(-> english-education
    (hanami/plot ht/point-chart
                 {:X :education_score
                  :Y :rgn11nm
                  :YTYPE "nominal"
                  :COLOR "income_flag"
                  :CTYPE "nominal"}))

This is pretty close, but we can take the average of all scores in a given region to reduce some of the noise, and make the points larger to make it easier to see them.

(-> english-education
    (hanami/plot ht/point-chart
                 {:X :education_score
                  :XAGG "mean"
                  :Y :rgn11nm
                  :YTYPE "nominal"
                  :MSIZE 100
                  :COLOR "income_flag"
                  :CTYPE "nominal"}))

This reveals an interesting relationship between region and education scores. The North West has the highest education scores at all levels of income deprivation.

Effect of being near the coast

The next graph is an interesting one showing that coastal towns have worse outcomes than non-coastal towns. The graph is similar to the previous one, but in order to get the data into a sensible shape to visualize I’m going to wrangle it a bit ahead of throwing it into our viz function. We have the coastal_detailed column that gives us, well, details about the coastal-ness of each town. Inspecting the distinct values in this column shows us it’s a bit of a mess, though.

(-> english-education :coastal_detailed vec distinct)
("Smaller non-coastal town"
 "Large non-coastal town"
 "Smaller seaside town"
 "Large seaside town"
 "Smaller other coastal town"
 "Large other coastal town"
 "Cities"
 nil)

This one column contains strings that describe a town as either “smaller” or “large”, and “non-coastal”, “seaside”, or “other coastal”. The first thing we’ll do is tidy up this data. Each variable should be in its own column, plus we’ll delete all the rows that don’t have a value for coastal_detailed. Then we’ll pass it to our viz function, taking the average of the education score for each category we’re plotting, just like above.

(-> english-education
    (tc/select-columns [:education_score :coastal_detailed])
    (tc/drop-missing :coastal_detailed)
    (tc/map-columns :town_size [:coastal_detailed]
                    (fn [v]
                      (when v
                        (if (str/includes? v "Smaller") "Small" "Large"))))
    (tc/map-columns :coastal_type [:coastal_detailed]
                    (fn [v]
                      (when v
                        (cond
                          (str/includes? v "non-coastal") "Non-coastal"
                          (str/includes? v "seaside") "Seaside"
                          :else "Other coastal"))))
    (hanami/plot ht/point-chart
                 {:X :education_score
                  :XAGG "mean"
                  :Y :town_size
                  :YTYPE "nominal"
                  :COLOR "coastal_type"
                  :CTYPE "nominal"
                  :MSIZE 100
                  :HEIGHT 100}))

This reveals in interesting relationship between the proximity of a town to the coast and education scores. For whatever reason, inland towns have better educational attainment.

Widening educational attainment gap over time

The article also observes that the gap in educational attainment between students from high vs low income deprivation areas widens over time. We can see this in the data by finding the average “key stage 2” (end of primary school in the UK) attainment in a given income deprivation category and comparing it to two later measures of educational attainment (key stage 4 and level 3).

I’ll define a function for computing the average of a sequence of numbers, for the sake of clarifying the data transformation.

(defn- average [vals]
  (/ (reduce + vals) (count vals)))
#'year-2024.week-4.analysis/average

We’ll also filter out rows that have “Cities” and nil as their income_flag because these aren’t comparable to the other income flag values.

(-> english-education
    (tc/drop-missing :income_flag)
    (tc/drop-rows #(= "Cities" (:income_flag %)))
    (tc/group-by [:income_flag])
    (tc/aggregate-columns [:key_stage_2_attainment_school_year_2007_to_2008
                           :key_stage_4_attainment_school_year_2012_to_2013
                           :level_3_at_age_18]
                          average))

_unnamed [3 4]:

:income_flag :key_stage_2_attainment_school_year_2007_to_2008 :key_stage_4_attainment_school_year_2012_to_2013 :level_3_at_age_18
Higher deprivation towns 69.46241954 55.61679938 41.59478974
Mid deprivation towns 72.95656558 59.34720275 47.82498013
Lower deprivation towns 79.29026824 67.95547806 57.96412983

We’ll calculate the gap in attainment between the different deprivation categories. I’m going to rename the columns first to make them more succinct. Then we’ll add new columns that show the gap between each stage for a given income deprivation level (compared to the lowest deprivation category).

(let [ds (-> english-education
             (tc/drop-missing :income_flag)
             (tc/drop-rows #(= "Cities" (:income_flag %)))
             (tc/group-by [:income_flag])
             (tc/aggregate-columns [:key_stage_2_attainment_school_year_2007_to_2008
                                    :key_stage_4_attainment_school_year_2012_to_2013
                                    :level_3_at_age_18]
                                   average)
             (tc/rename-columns {:key_stage_2_attainment_school_year_2007_to_2008 :key_stage_2
                                 :key_stage_4_attainment_school_year_2012_to_2013 :key_stage_4
                                 :level_3_at_age_18 :level_3}))
      lowest-deprivation-vals (tc/select-rows ds #(str/includes? (:income_flag %) "Lower"))]
  (-> ds
      (tc/map-columns :key_stage_2_gap [:key_stage_2]
                      #(- % (first (:key_stage_2 lowest-deprivation-vals))))
      (tc/map-columns :key_stage_4_gap [:key_stage_4]
                      #(- % (first (:key_stage_4 lowest-deprivation-vals))))
      (tc/map-columns :level_3_gap [:level_3]
                      #(- % (first (:level_3 lowest-deprivation-vals))))))

_unnamed [3 7]:

:income_flag :key_stage_2 :key_stage_4 :level_3 :key_stage_2_gap :key_stage_4_gap :level_3_gap
Higher deprivation towns 69.46241954 55.61679938 41.59478974 -9.82784870 -12.33867868 -16.36934009
Mid deprivation towns 72.95656558 59.34720275 47.82498013 -6.33370266 -8.60827531 -10.13914970
Lower deprivation towns 79.29026824 67.95547806 57.96412983 0.00000000 0.00000000 0.00000000

This data is already pretty easy to interpret just as a table. For the sake of the exercise, we can plot it in a similar way to the article. To do this we’ll tidy up the data in a different way, in all senses of the word. Right now this data is structured in a way that makes it easy to interpret when it’s printed as a table, but it’s hard to plot because some of the column names encode a variable in our dataset across multiple columns (the education attainment stage). To fix this, we’ll use tablecloth’s pivot->longer, so that each row represents one observation and each column represents a single variable. Then we can plot it.

(let [ds (-> english-education
             (tc/drop-missing :income_flag)
             (tc/drop-rows #(= "Cities" (:income_flag %)))
             (tc/group-by [:income_flag])
             (tc/aggregate-columns [:key_stage_2_attainment_school_year_2007_to_2008
                                    :key_stage_4_attainment_school_year_2012_to_2013
                                    :level_3_at_age_18]
                                   average)
             (tc/rename-columns {:key_stage_2_attainment_school_year_2007_to_2008 :key_stage_2
                                 :key_stage_4_attainment_school_year_2012_to_2013 :key_stage_4
                                 :level_3_at_age_18 :level_3}))
      lowest-deprivation-vals (tc/select-rows ds #(str/includes? (:income_flag %) "Lower"))]
  (-> ds
      (tc/map-columns :key_stage_2_gap [:key_stage_2]
                      #(- % (first (:key_stage_2 lowest-deprivation-vals))))
      (tc/map-columns :key_stage_4_gap [:key_stage_4]
                      #(- % (first (:key_stage_4 lowest-deprivation-vals))))
      (tc/map-columns :level_3_gap [:level_3]
                      #(- % (first (:level_3 lowest-deprivation-vals))))
      (tc/select-columns [:income_flag :key_stage_2_gap :key_stage_4_gap :level_3_gap])
      (tc/pivot->longer [:key_stage_2_gap :key_stage_4_gap :level_3_gap]
                        {:target-columns :gap
                         :value-column-name :value})
      (hanami/plot ht/point-chart
                   {:X :gap
                    :XTYPE "nominal"
                    :Y :value
                    :COLOR "income_flag"
                    :CTYPE "nominal"
                    :MSIZE 100})))

This could be tidied up a bit with nicer names and colours, but this is the idea. I think the main takeaway from this graph is that data should be tidy for the sake of visualizing it, and that naming columns is important.

Pursuit of higher education

The next question we can answer with this data is how many students from different income deprivation areas pursue higher education. This chart looks complex but it’s just a simple bar chart faceted by educational attainment milestones. We have the information to make this chart, but again it’s not organized in a nice way for visualizing.

The data we care about for this chart are the town size flag, and the three educational attainment values in the columns level_3_at_age_18, activity_at_age_19:_full-time_higher_education, and activity_at_age_19:_appprenticeships. We’ll pivot our data again in a different way this time to organize the data for this visualization, then aggregate the values to calculate the average value for each town size/attainment level pair. As an added bonus, the data in the activity... columns are strings, not numbers, so we’ll coerce those to numbers too so that we can do math on them (like calculating the average).

(-> english-education
    (tc/select-columns [:size_flag
                        :level_3_at_age_18
                        :activity_at_age_19:_full-time_higher_education
                        :activity_at_age_19:_sustained_further_education])
    (tc/rename-columns {:level_3_at_age_18 "Level 3 qualifications at age 18"
                        :activity_at_age_19:_full-time_higher_education "In full-time higher education at age 19"
                        :activity_at_age_19:_sustained_further_education  "In further education at age 19"})
    (tc/pivot->longer ["Level 3 qualifications at age 18"
                       "In full-time higher education at age 19"
                       "In further education at age 19"]
                      {:target-columns :attainment_measure
                       :value-column-name :value})
    ;; coerce all the values to be numbers
    (tc/update-columns :value (partial map #(if (string? %) (parse-double %) %)))
    ;; lose the nil ones, it messes up our calculation
    (tc/drop-missing :value)
    ;; calculate the average value for each size flag/attainment measure pair
    (tc/group-by [:size_flag :attainment_measure])
    (tc/aggregate-columns :value average)
    ;; plot this
    (hanami/plot {:facet :FACET
                  :spec {:encoding :ENCODING
                         :layer :LAYER
                         :width :WIDTH}
                  :data {:values :DATA
                         :format :DFMT}}
                 {:Y :size_flag
                  :YTITLE "Town size"
                  :YTYPE "nominal"
                  :YSORT ["Small towns"
                          "Medium towns"
                          "Large towns"
                          "Cities"
                          "Outer London"
                          "Inner London"]
                  :X :value
                  :XTITLE "Percentage of population"
                  :FACET {:column {:field :attainment_measure
                                   :type "nominal"
                                   :title "Attainment measure"
                                   :sort ["Level 3 qualifications at age 18"
                                          "In full-time higher education at age 19"
                                          "In further education at age 19"]}}
                  :LAYER [{:mark "bar"}
                          {:mark {:type "text" :dx -3 :color "white" :align "right"}
                           :encoding {:text {:field "value" :format ".1f"}}}]
                  :WIDTH 140}))

Connection to other town residents

The last graph in the article is one showing the relationship between the educational attainment of older and younger residents. We can see that education scores seem to be correlated with the incidence of high educational attainment among older residents of a town. We don’t have exactly the right data to reproduce this graph, but we can do something similar with the data we do have.

We can plot the educational attainment values vs. the educational attainment classification of residents aged 35-64 and add some jitter to spread out the dots, which will reveal the same general relationship as the more precise data – students in towns with more highly educated older generations tend to have higher educational attainment. It’s still worth pointing out that there is a huge overlap between the highest low-education town educational attainment scores and the lowest high-education town scores, so there are obviously many other factors at play. But it’s still an interesting observation.

(-> english-education
    (tc/drop-missing :level4qual_residents35-64_2011)
    (hanami/plot {:encoding (assoc (hc/get-default :ENCODING)
                                   :yOffset :YOFFSET)
                  :mark {:type "circle"}
                  :transform :TRANSFORM
                  :height 300
                  :width 500
                  :data {:values :DATA
                         :format :DFMT}}
                 {:X :education_score
                  :Y :level4qual_residents35-64_2011
                  :YTYPE "nominal"
                  :YSORT ["High" "Medium" "Low"]
                  :YOFFSET {:field  "jitter" :type "quantitative"}
                  :TRANSFORM [{:calculate "sqrt(-2*log(random()))*cos(2*PI*random())"
                               :as "jitter"}]}))

That sums up our work this week. There are so many graphs in this one, it was a lot of fun to play around and see some common patterns emerging. I’m looking forward to many projects revolving around tidying up the dataviz story in Clojure.

See you next week :)

source: src/year_2024/week_4/analysis.clj