Dringtech

Weeknotes 2024-W40

Fri, 04 Oct 2024 00:00:00 GMT

Short one this week, as it's been a heck of a time for various reasons.

Most of the week has been consumed with some rocks thrown into a project about the scope of an initial phase. Let's just say that managing expectations and getting feedback early is very important! It's a pity, as the work is progressing well overall. Consultancy is hard!

Other projects have been more productive. The Bradford 2025 data publishing programme attained automated extract from the first source system. Quite a milestone, which allows / requires getting the data governance bit sorted out. To this end, I've documented the process and drafted some risks to be discussed with the risk owner. This should lead to the data being released, with any luck. We could, by the end of next week, have an initial site sorted out. Watch this space.

One of the steps on the data release was the handling of small numbers in summaries with the risk of personal identification. Often small numbers are surpressesd in data released to avoid this eventuality, but there are other techniques. ONS has published information on disclosure control, but in our case, a simpler technique would suffice. I build a small value masker which bisects the dataset based on a clip level, then replaces the numbers below the clip with the average of the small numbers. The thinking here is that the overall distribution and totals should be similar in this case. I'll do some more detailed analysis of this next week, but for now, it'll do!

Weeknotes 2024-W39

Fri, 27 Sep 2024 00:00:00 GMT

Loads of progress on the Bradford 2025 data publishing project. I had the first in-person working day with the team on Wednesday, and we managed to get one of the key datasets extracted and a minimal process working. This built on the Leeds 2023 data, and I was pleased to dramatically simplify the state tracking code I wrote for that extract.

This code is needed because the volunteering system does not track when checkpoints (used to track the onboarding process for volunteers) were achieved. The Leeds solution (for the same system) was pretty arcane, and meant that the dataset needed to contain a hashed version of the ID field to match with incoming data. The new solution reduces the information stored along with the hash to a list of checkpoints and dates. The first step on extracting the data from the system is to establish if any of the hashes have a new checkpoint, and if so append the updates to the list of state dates. This avoids some unpleasant round-tripping on the data. Once this is done, the (possibly updated) state data is read and turned into a lookup returning the states as a dictionary. This dictionary replaces the hash, meaning that this potentially personal and disclosive data is removed from the dataset.

The hush-hush project also progressed, and I was able to play back the data work to the client, with some excellent feedback. We have a pilot organisation workshop next week, and will hopefully be able to speak more about the project after that. The data science work that I mentioned in last week's weeknotes has progressed, with a fuzzy match of part of the company house data. Next steps on that is to add in the identification of personal names.

Finally, today, I met with a digital marketer who is working with one of my clients. He needed some help getting the Google search console linked in to GA-4, which needed a bit of tinkering and archaeology to work out where their DNS records were managed! Will be interesting to keep abreast of this work and see how digital marketing data is linked together.

A few links:

An excellent visualisation / flowchart of what is considered personal data from missiggeek on Mastodon
A brilliant article about what ChatGPT is, with analogy to lossy compression
A hexagonal based system developed for geospa

Home | H3

Weeknotes 2024-W38

Fri, 20 Sep 2024 00:00:00 GMT

This week, the Bradford 2025 data site work kicked off. An excellent meeting where we collaboratively agreed on the areas of focus. We're already expanding on the Leeds 2023 work, so I'm hopeful we'll be building some solid culture infrastructure for the team, and legacy for the broader region.

The other project is progressing well too, although is still under a communications blackout, hence me not mentioning what it is explicitly! I've had cause to use some actual data science techniques in building an exciting bit of data infrastructure. Firstly, I'm matching very large datasets (several million rows in the reference), and this was slowing down. Chucking this into DuckDB rather than processing with PETL or Pandas totally sped this process up. Right tool for the job. As an added bonus, the DuckDB bindings for Python has a .df() method which returns the result set as a Pandas dataframe. Winner! I've also used some fuzzy matching using thefuzz to identify spelling mistakes in the source dataset and enhance matching. Next up will be starting to try to identify personal names in the source data so that I can separate organisations from individuals, and handle them separately. I've identified Named-entity recognition (NER) as the means to do this, and early experiments using the NLTK NER modules is looking promising. Quite a lot of tuning to be done here. Finally, I'm going to extend the simple DuckDB SQL matching to take account of fuzzy matches between the source and reference sets.

In other work, I spent quite a lot of time on network fixing. One of my clients, for whom I've set up a Sentinel license server running on a laptop accessed via a VPN, got a new router, so I had to open up the VPN ports again. Annoyingly I hadn't written down exactly what I did, so this took a bit longer to work out that the port mapping (51820/udp) shouldn't be limited! Anyway sorted now. Another client had a major DNS failure when one of their suppliers decided to turn off a cPanel server that they had been using, and no longer needed. Unfortunately, their main domain was associated with this service, and when it went away, everything broke. Badly. That took a couple of hours to unpick and get back on the road to recovery. And for the hat-trick, a data microsite that OI had created had it's DNS records removed, so stopped working. Thankfully, I was able to direct them to fix it fairly quickly, and managed to find the original email trail from two years ago.

All of this left me thinking that for many small (and not-so-small) businesses this network configuration was something close to magic. I have begun work on a workbook, focussed on small businesses, which collects important data about the critical infrastructure that keeps their website, email and other important services up-and-running. It would encourage the business owners to review this data frequently so that (for example) credit cards don't run out and lead to a break in service. Furthermore it would contain advice about practical matters, such as email addresses to use a primary contact for critical accounts to avoid being locked out in the event something goes wrong.

Weeknotes 2024-W37

Fri, 13 Sep 2024 00:00:00 GMT

Full disclosure, I'm almost a week late writing these, so they'll be a bit patchy!

A couple of big thoughts: one about value of various types of work and one about a business model that might be worth exploring.

I was pondering the things that we spend time on in projects and things we don't. In an ideal world the balance between time spent on the hard technical stuff and time spend on the even harder people-related stuff should tip towards the people-related. I've invested a lot of time in ensuring I know how to make the techy stuff get out of the way. You want a website: I know how to build it. You want some some visualisations? I know how to do that, and moreover, I've invested time with the people I work with on making it easy for anyone in the team to create a pretty good map, or chart or whatever. Pipelines? I've got patterns that help me make robust and repeatable data processing pipeline pretty fast. This means that we can prototype at speed and get stuff done. However, there are times, and particularly in more open ended projects where delivering fast is not really desirable. This isn't because we need to spend longer building the tech, but because we need to spend more time dealing with the really interesting stuff, which tends to be squishy, human stuff. That might be researching, pondering, discussing with colleagues, exploring prior art. Sometimes, delivering fast is a sure-fire way to miss the point.

I'm sure at various points in the week, those thoughts have been more well defined, so I might loop back to them and see if I can tighten them up a bit.

The other 'biggish thought' I had was about the idea of founding a co-op which delivers IT services to other co-ops. This came out a tech incident that I was called upon to help out with at Equal Care Co-op. They are a small team, and don't have a dedicated IT Service team. This is arguably problematic given the dependence that people are beginning to have on the tech. Getting a full service team, with cover dedicated simply to Equal Care is likely to be beyond their means, and it feels like signing up with a commercial organisation might not mesh well with their ethos and instruments. Maybe a like-minded team, constructed to service a number of organisations could have the economies of scale to deal with this. Some initial challenges are seed funding the organisation, defining the offer and recruiting both service delivery team and the client organisations. Would this be a co-op of co-ops: each providing their IT folk to support others? Would there be a TUPE element to this if acting as a de facto outsourcing organisation? More to think about, but worth developing.

That'll do for now: it's nearly time for the next weeknote!

Weeknotes 2024-W36

Fri, 06 Sep 2024 00:00:00 GMT

This week has seen further clarification on scope for the project that started when I was on holiday, progress on the contract for the first Bradford 2025 data delivery, the start of the refresh of Open Audience and a bunch of R&D on optimising SvelteKit.

One thing that I've struggled with a bit on the new project is lack of clarity on the direction of travel. A colleague (hi, Sarah!) describes this as the North Star, which is great imagery: I just need to know that I'm heading in broadly the right direction. I'd raised the idea of writing down some (possibly / probably wrong) hypotheses that we could use to help set this direction without being too prescriptive. The same colleague then shared this excellent blog post by Ben Holliday about hypothesis driven design which pretty well encapsulates my reasons for wanting to write something down We've (me and Sarah) have been working through some high-level / very broad hypotheses and realised we can use these to identify some lower-level / more fine-grained hypotheses which can really shape the work. I've come to realise this is slightly in tension with the design "double diamond", which aims for breadth in the first instance. My response to this is that without the direction setting, the initial diamond risks effort radiating in all directions from a point, resulting in a circle. The hypotheses set enough of a direction to decide what to consider in the first diamond. I may write this up soon.

My SvelteKit optimisation has resulted in a much better understanding of how to control what the bundler creates. This is captured in a repo, ready to be written up. As I was doing this, I was also implementing some of the recommendations in the Social Value site I'm building. This uses DuckDB-WASM on the client side, which is excellent... a full OLAP database in a browser? Crikey! It is, however, quite costly on the network, as it has to download 35MB of Web Assembly code to start the database engine up. In a search for an alternative, I've come across the @oak/acorn framework which will allow me to serve just the JSON. Much better for this use case. I have, however, discovered that I cannot host this on Deno Deploy, which is my go-to platform.

The final bit of work is refreshing Open Audience, a tool which Tom Forth built using 2011 Census data which can build a profile of attendees at events based on their postcodes. This has been buzzing around some of the culture work that I've been involved in for a while and given we now have 2021 Census data, I thought I'd rebuild at least the dataset. The data has changed ever so slightly, so it might not be possible to recreate completely, and I might need to rethink the frontend. I have some other ideas, including:

Open Audience as a service: An API which provides the profile based on postcodes
Open Audience language bindings: Wrappers in Python / Javascript to allow the data to be used easily in pipelines and web-pages

I did have a look at Storybook, and it looks promising as a way of documenting web design libraries. I'm also wondering if there's a way of using it as an SSG, as I suspect it might be overkill for some of the work.

Finally, I've added an RSS feed for this site, so you can read my witterings in your favourite feed reader. I like NetNewsWire, FWIW, but there are many others

Plans for next week:

Refine the culture project hypotheses
Blog about the SvelteKit optimisations
Research adding ActivityPub to this site. There's a really nice (albeit incomplete!) series of blog posts by Maho Pacheco covering how to do this.

Some links I came across

Your name in Landsat images
Signalbox live train locations. NB Proves to be not that accurate as I tested when I was on an actual train!
A really nice blog post about the importance of clear Open Data licensing by Mike Rose and Kieran Wint.

Finally, I really like this pull quote by Ted Chiang, shared in this toot by James Gleick on Mastodon:

The programmer Simon Willison has described the training for large language models as ‘money laundering for copyrighted data,’ which I find a useful way to think about the appeal of generative-A.I. programs: they let you engage in something like plagiarism, but there’s no guilt associated with it because it’s not clear even to you that you’re copying.
Source https://www.newyorker.com/culture/the-weekend-essay/why-ai-isnt-going-to-make-art.

Weeknotes 2024-W35

Fri, 30 Aug 2024 00:00:00 GMT

I've decided to start writing weeknotes for myself. In part to keep track of all the stuff but also to ensure I keep in the habit of writing. As with all projects, this requires a degree of bikeshedding and / or yak shaving.

Given one of the hardest things to do in tech is deciding on the name of things, my first task is coming up with distinct names for each of the what I assume will be almost endless collection of blog posts. I briefly considered naming them like Friends episodes, but this could quite quickly descend into farce. I decided on simply the year and week number, according to the ISO standard of weeks starting on Monday. I note from the link above that Monday is Labor Day in the US, which I think has something to do with wearing white shoes.

This week has been mostly return from holiday and getting my head round a couple of culture-related projects that I'm involved in as a freelancer with Open Innovations. The first is the Bradford 2025 City of Culture, for which I wrote an Open Data Strategy just before my break. I've agreed the scope of the next piece of work to start delivering on it. It builds on the LEEDS 2023 Data Microsite that I built with OI team. The other project is along the same lines, but much broader, and earlier in the discovery phase. I consequently spent a fair amount of time reading what I could lay my hands on and hypothesising about what we might build. Very happy to be wrong about assumptions at this stage!

Away from Open Innovations, I've finally broken the back of a website upgrade for the Hebden Bridge Picture House which should enable me to upgrade from (out of support) PHP 7.3. It turns out that some changes in the Textpattern software have invalidated the way I built the site. The Textpattern community were very helpful, responding to a topic I posted in the Textpattern forums with some really useful suggestions.

I've also been refreshing a prototype site I'm building for social value consultants CHY Consultancy. I've been mindful of recent discussions about Javascript framework bloat, and I agree wholeheartedly with this. I really like static site generators, and tend to use Lume as my weapon of choice. In fact, this very post is written in it. There are times that having a framework that can be easily extended to do more powerful client-side stuff is useful, but I really don't like the likes of React, Angular, et al. I've long been a fan of Svelte for componentised client-side code, particularly where the need is a bit more heavyweight. The prototype uses Sveltekit which builds on this beautifully. Given a combination of setting prerender, csr and ssr options for pages, it's even possible to generate a static site. This can be turned off for any pages which need live server-side rendering (e.g. dynamic pages based on routing parameters), or client-side processing (e.g. highly interactive islands on a web page). Anyway, there's been tinkering on making this perform well.

Stuff added to the list to do or look at next week:

Storybook frontend UI workshop.
Draft a blog post about the Generative AI inspired by a very long and rambling pub chat that I had with a friend last Sunday night.
Draft a blog post about using SvelteKit to make efficient websites.

Finally, and in the spirit of full disclosure, I have just spent 20 minutes automating appending the week of the year to the post title as a Lume preprocessor. It uses the pretty handy date-fns library, with the format string RRRR-'W'II. This is next-level yak shaving.

Kitchen sync

Thu, 09 May 2024 00:00:00 GMT

In my work with Open Innovations (and elsewhere), I frequently create static websites. These suit the work as they don't take a lot of hosting. Most of the production sites are hosted on GitHub Pages which works really well, certainly for production sites. A slight drawback is the inability to password-protect pages and the quite reasonable limitation of one GitHub Pages site per repo. Recently the sites have been getting more complex, with longer-running development processes. I decided that it was time to host a dev version. Luckily, OI has a cloud server running Apache, so all I needed to do was upload the result of the built site to an appropriate directory.

In my goal of automating all the things, I wanted to make this happen whenever anyone pushed to the dev branch in the repository. It's pretty simple to trigger a workflow on a push to a specific branch. The question was what to run to transfer the files.

The answer to this question is rsync, which you can read all about on the Wikipedia page for rsync. In short, this venerable tool allows files to be transferred and synchronised over connection protocols such as SSH.

Building the actions pipeline now has the following stages

Build site using the site builder into the build folder. In my case, this the builder is lume, which deposits the compiled site in _site.
Use rsync to transfer the build folder to the dev host.

Step 1 is easy enough, and in any case out of scope of this post! Step 2 needs a bit of careful thought.

The basic incantation is

rsync --recursive --delete $SOURCE_PATH $SSH_HOST:$SSH_PATH

This recurses through the source path, uploading any new or changed files, and deleting any orphans. The source path should end in a / to avoid including the directory itself! Using environment variables SOURCE_PATH, SSH_HOST and SSH_PATH means that the configuration can be altered and used in multiple potential targets, which could be useful. You can set environment variables in GitHub workflows.

When using rsync interactively, it's usual to run as a personal user account, in which case SSH credentials (keys, etc) are likely to be set up. It's not, however, a great idea to use a personal user in an automated pipeline, so I created a locked down bot user. Passing credentials into a GitHub actions environment is also slightly fiddly. It's theoretically possible to setup SSH keys and a config file, but rsync allows a slightly easier setup, by providing the --rsh option. This allows exact specification of the remote shell command.

rsync ... \
  --rsh="sshpass -e ssh -o StrictHostKeyChecking=no -l $SSH_USER" \
  ...

This allows the SSH password to be provided in the SSHPASS environment variable (managed via the sshpass -e command). It also specifies the user to connect with (SSH_USER) and allows overriding other ssh options such as StrictHostKeyChecking.

So now we have a working sync command which signs in as our bot user. All is not well, however, as the files and directories are owned by the bot user. Given these are web content, we'd ideally like them to be owned by the www-data user with a group ownership of www-data. Thankfully here can provide another option --rsync-path which defines the command that is run in the shell created by the connection.

rsync ... \
  --rsync-path="sudo -u www-data rsync" \
  ...

This will run the remote rsync command as the user www-data, meaning that the files are written with appropriate ownership and permissions.

The cherry on the cake is specifying a the --info flag to write a report on completion of the sync. The final command is:

rsync \
  --rsh="sshpass -e ssh -o StrictHostKeyChecking=no -l $SSH_USER" \
  --rsync-path="sudo -u www-data rsync" \
  --info=STATS2 --recursive --delete \
  $SOURCE_PATH $SSH_HOST:$SSH_PATH

Wrapping this up in a GitHub Actions script is fairly simple using the actions file run directive. Of course, it can also be packaged in another runner, such as deno.json, or as an NPM task. This is left as an exercise for the reader!

I hope that this has been helpful. I'm sure future me will also be thankful!

One hundred duck-sized horses

Mon, 06 Nov 2023 00:00:00 GMT

Would you rather fight 100 duck-sized horses or one horse-sized duck?

I’ve recently been trying the capabilities of DuckDB to drive visualisations. There’s something quite astounding about writing SQL in client-side Javascript. My current platform of choice is Svelte. I’ve come up with some patterns on using Duck within a Svelte app — to be written up another day.

Suffice it to say that the official DuckDB WASM client setup guide is a great start. The basic pattern is to prepare parquet files with the data, and query those via DuckDB in the following format:

SELECT
  strftime(date, '%x') AS date,
  value
FROM ‘data.parquet’
WHERE code == ‘The Code’
ORDER BY date;

The only pre-requisite is that the parquet files need to be registered when the database connection is made:

await db.registerFileURL(
  'data.parquet',
  'data.parquet',
  DuckDBDataProtocol.HTTP,
  false
);

So far, so good, but if the parquet files are large, it'd be nice to take advantage of partitioning to avoid shipping the entire file to the browser. It's pretty easy to create a partitioned file using libraries such as pandas:

df.to_parquet(path='data/', partition_cols=['variable'])

We now have a dataset partitioned by variable name, and can in theory write queries as follows:

SELECT
  strftime(date, '%x') AS date,
  value
FROM ‘data/**/*.parquet’
WHERE code == ‘The Code’
ORDER BY date;

The slight wrinkle is that you still need to register each file as before. It appears that the DuckDB registerFileURL doesn't support wildcards, so each file has to be registered independently. Having discovered, this, I decided that a manifest file would be a sensible way of dealing with the potentially very large number of files that needed to be registered. A simple way to create this is using shell commands and jq.

find data/ -type f |\
  jq --raw-input --slurp 'split("\n")' > manifest.json

I then register each parquet file in the JSON array as follows:

await Promise.all(
  manifest.map(p => db.registerFileURL(p, p, DuckDBDataProtocol.HTTP, false))
);

Here were the rough results for a simple test database that I set up. This is running on my local network, so the impact on a slower network would be greater. The whole database is 2.5 MB in a parquet file, which is already a massive saving from the source 18 MB CSV file. It's worth noting that the subsequent calls were much faster.

Measurement	Monolithic parquet	Partitioned parquet
Network transfers	11 requests	6 requests
Network payload	3.6 MB	81.3 kB
Time for first query	659 ms	278 ms
For next query	~100 ms	~20 ms

Limitations I ran into, each of which could do with a bit more digging...

It seems that at least the libraries that I was using don't allow more than 1024 partitions to be created.
I tried using Brotli compression, but the DuckDB WASM library didn't seem to like it.
I sometimes / within some build systems, had to perform some manipulation of the url to prefix with the server URL.

Cross-repository GitHub Action triggers

Tue, 17 Oct 2023 00:00:00 GMT

I've recently been using GitHub actions to automate data pipelines and build static sites. Something that's quite useful to be able to do is trigger one job when another has finished.

This is quite simple to achieve when both jobs are in the same repository. Assume this is in the pipeline for a workflow we want to run after another workflow finishes.

name: Second Workflow

"on":
  workflow_run:
    workflows: ["First Workflow"]
    types:
      - completed

The workflow_run event means that when First Workflow completes, Second Workflow will run.

It's a bit more complicated when the two workflows are in different repositories. In this case, we need to use the GitHub API to dispatch an event on the second repository.

The `repository_dispatch` event

It's possible to initiate activity in a GitHub repository with the repository_dispatch event. This event can be raised by a GitHub API call to the dispatches endpoint. In our case, we'll use a step of another workflow in another action to call the API.

This needs a GitHub personal access token provided in the Authorization header. You'll also need to target the repo which contains the job that you want to run by altering the API endpoint.

curl -L \
  -X POST \
  -H "Accept: application/vnd.github+json" \
  -H "Authorization: Bearer ${TOKEN}" \
  -H "X-GitHub-Api-Version: 2022-11-28" \
  https://api.github.com/repos/${OWNER}/${REPO}/dispatches \
  -d '{"event_type":"trigger_workflow_two"}'

The other important part is the payload, and particularly the event_type. In the example above I've set the value of event_type to trigger_workflow_two. In theory, we could have a series of events being triggered, each of which would cause a different thing to happen.

We intercept this specific event type by adding the following trigger to the workflow.

name: Second Workflow

"on":
  repository_dispatch:
    types:
      - trigger_workflow_two

Now, whenever we initiate a repository_dispatch event with the event_type of trigger_workflow_two, this job should run.

NB This could cause a large number of in-flight jobs to be triggered. As a bonus, we could set the concurrency key in the targetted job, which will stop any in-flight jobs if another one is started.
# Cancel any in-flight jobs
concurrency:
  group: ${{ github.ref }}
  cancel-in-progress: true

Trigger step

Having got the repo to respond to the repository_dispatch event, we need to call that from within our repository.

We could use the Curl code as above. In my case, I'm triggering from within a repo which already contains a reasonable amount of Python, so I've created a small trigger script.

TOKEN = os.environ.get('TRIGGER_TOKEN')
OWNER = os.environ.get('TRIGGER_ORG')
REPO = os.environ.get('TRIGGER_REPO')

r = requests.post(
    f"https://api.github.com/repos/{OWNER}/{REPO}/dispatches",
    headers={
        "Accept": "application/vnd.github+json",
        "Authorization": f"Bearer {TOKEN}",
        "X-GitHub-Api-Version": "2022-11-28",
    },
    data='{"event_type": "trigger_workflow_two"}'
)

NB This webhook should return an HTTP 204 No Content return code. If you want to be sure, check the response. You can do this with requests as follows:
if r.status_code != requests.codes.no_content:
    raise Exception('Unexpected response from GitHub API')

This trigger can then be added to an workflow at an appropriate point. In my case, I add this just after committing and pushing changes to the source repo.

  - name: Trigger
    id: trigger
    env:
      TRIGGER_TOKEN: ${{ secrets.TRIGGER_TOKEN }}
      TRIGGER_ORG: ...
      TRIGGER_REPO: ...
    run: python trigger.py

Conditional triggering

So far, so good, but what about situations where we don't always want to trigger the action. My use case polls a series of data sources that update periodically, and then processes the files and checks them in to the git repo.

Thankfully, GitHub actions gives you the ability to conditionally run steps.

Firstly, we need to capture output of a step. In my case, I want to check how many files are changed in a given directory. Appending to the $GITHUB_OUTPUT means you can refer to a variable later on.

- name: Check for updates
  id: updated
  run: echo count="$(git status --short watchdir | wc -l)" >> $GITHUB_OUTPUT

This is then accessible as steps.updated.outputs.count, and can be used in a conditional step:

- name: Conditional trigger
  id: conditional
  if: steps.updated.outputs.count > 0
  env:
    ...
  run: python trigger.py

So finally, we have a workflow that will kick off a workflow in a different repository in the event that any files have changed during the course of this run.