Making Typerighter work harder

Rhys Mills

11 December 2023 at 10:12 am·7-min read

Minimising typos is important to a newspaper. Publishing messy prose might undermine trust in our overall quality control – why trust our political reporting if we can’t spell Thérèse Coffey’s name? We want to provide accurate information, and typos can mutate meaning – a single letter can transform a fiend to a friend and a bowl to a bowel. Worse, a missing word can turn a holy text into a wicked one.

The Guardian is in the business of words, and we’re keen to avoid “idiosyncratic” spelling decisions. That’s why we built Typerighter, a tool that offers recommendations from our style guide into our content management system – Composer – and highlights errors well before the articles get in front of our readers.

My engineering colleagues have already written about the construction of Typerighter. This is the story of Typerighter’s next steps – how we improved it to provide more value for the Guardian and its readers.

Where we left off

Until this year, Typerighter used a set of “rules” for checking text, created largely by a Guardian subeditor, Max Walker, (who wrote an intro for the tool a few years back) and stored in a Google sheet. These rules were ingested by our Typerighter Checker service, a Scala Play web app which used the LanguageTool library. As writers and editors made changes to text in Composer, Composer would send it to Typerighter, which would spot potential problems and provide suggested corrections.

Typerighter had two kinds of rule: manually curated regular expressions – which would search text for regex patterns relating to the Guardian’s style guide – and more complex XML-based LanguageTool rules, usually finding common grammatical errors. So, for example, if a journalist writes World Health Organisation, Typerighter will suggest the correct World Health Organization. If they started a sentence without a capital letter, Typerighter will suggest adding one.

What happened next

Introducing Typerighter helped us reduce the overall volume of errors in our published content. But some errors still snuck past, keeping our reader’s editor busy. Why? For a start, Typerighter wasn’t everywhere it needed to be; most notably, it wasn’t in the editing system for our live blogs, which have become an increasingly important way of delivering news to our readers. This was a problem because the live blog is written and published virtually in real time – a perfect environment for typos to creep in.

The live blog was still based on our deprecated rich text editor, Scribe, which was incompatible with Typerighter. So we migrated the live blog (and various other rich text editors in Composer) to our favoured rich text editor of today – ProseMirror – which Typerighter was designed to operate with. Typerighter was almost everywhere we wanted it to be, but a problem remained.

While we could cover mistakes for our style guide, we’d never catch everything – bespoke regex rules were an impractical solution for catching all the possible typos of valid English language words. The Guardian’s “house dictionary”, Collins, which journalists defer to when the style guide doesn’t have a ruling on something, has about 300,000 words, and, when you factor in all the possible misspellings of those words, we would have to search for many millions of variants.

We needed a general-purpose spellchecker, based on the Collins dictionary, to extend Typerighter beyond its existing rules, catching typos through a separate mechanism and making sensible suggestions quickly.

One last hitch

There was one obstacle to our ambitions in the existing architecture of Typerighter: the Google spreadsheet that served as a home for our rules. Until now, it was mostly fit for purpose – so in the spirit of avoiding premature optimisation, we kept it.

Now, it was time to move on. The sheet was no longer practical – with 13,000 rows it already felt clunky for the journalists who maintained and updated the rules, and adding 300,000 more rows representing the words in the dictionary was out of the question. The spreadsheet was easy to break by accidentally deleting things, there wasn’t a clear audit trail for changes and it wasn’t particularly user-friendly because it contained lots of technical data that editors didn’t need to see.

What architecture would we use? Well, this wasn’t quite a greenfield project, we already had a Typerighter repository where we had a Scala Play web application – the Checker service. We could choose a new architecture for the Rule Manager, but this would add another set of technologies for Typerighter software developers to learn and master. So we built the Rule Manager as a second Play app within the repository – we had no reason to introduce a new paradigm when Play would fit our needs.

We had an abstraction in place that made it simple to add a new service in place of the Google sheet – the Checker read its rules from a published JSON artefact – and it didn’t matter to the Checker whether it came from the Google Sheet or was produced by our new Rule Manager service.

One addition we would need was somewhere to store our rules, where we could make additions and query existing data in a performant manner. This sounded like a relational database; we went with Postgres, a powerful and flexible database with strong text search options and some use across other Guardian projects, such as Composer.

With a sprinkling of UI magic from our product designer, and making use of React and the ElasticUI interface library (recommended by colleagues in our Investigations & Reporting team), we built the Rule Manager, a modern, user-friendly tool for managing Typerighter’s rules. With a tool powerful enough to manage a dictionary’s worth of rules we could move on to our next phase.

Building the spellchecker

To build our spellchecker, we first needed a dictionary. We struck a deal with the wonderfully helpful people at Collins to get access to a data representation of the Collins English Dictionary. We then ingested Collins’ list of words into our Rule Manager database, creating a third rule type – the Dictionary Rule. With 10 times the number of rules as before, we quickly encountered new performance problems, and had to tighten up some inefficiencies that we’d previously gotten away with.

With those teething problems resolved, we could use the words to create our spellchecker. We pair the words with their frequencies in the English language, and provide them to a LanguageTool instance, which finds strings that aren’t valid words in its dictionary, and identifies any valid words within an edit distance of three of those strings, ranking them by word frequency. We plug that LanguageTool instance into the Checker’s existing interface (as a new matcher in our MatcherPool) to provide spellchecking for our users in Composer.

Because subeditors can modify the words recognised by the spellchecker in our Rule Manager, we can add neologisms as they enter the language, or remove words we don’t particularly want it to suggest as valid spellings (for example, the many pejoratives that are valid English words which we wouldn’t want to write in an article).

We were now ready to enter testing with a small number of users. Initial results were promising – but we quickly hit a snag. The Guardian publishes content on events from around the world, and our writers regularly use proper nouns that aren’t in our dictionary. The spellchecker would often flag them as typos and usually make poor suggestions – we wanted to ignore words that were likely to be unrecognised but valid names.

Fortunately for us, the Data Science team at the Guardian were way ahead of us, and had already built a very good named entity recognition (NER) model, trained on Guardian content, and a service to interact with it – providing a means to identify real-world objects like names and places. Plugged into that NER service and ignoring entities that it flagged – the Typerighter Collins dictionary was good to go. Composer would have its own spellchecker, built exactly for the Guardian’s needs.

What’s next?

With the Rule Manager in place, and the Collins spellchecker wired in, Typerighter can do an even better job of supporting consistency in the content we produce. That said, there are bound to be some snags. Some problems can be solved by engineers, but many will be solvable by users of the rule manager – journalists who’ll be able to tweak our ruleset themselves.

For now, our core focus as engineers will move on to our other tools, like Composer and the Grid – and trying to figure out new ways to make our colleagues’ lives easier.

OK! Magazine
Meghan Markle 'hysterical' after breaking Prince Harry's strict orders in interview
Meghan Markle was given strict instructions by Prince Harry before she sat down for an interview with Vanity Fair - but the Duchess of Sussex seemingly didn't listen
Wales Online
BBC Olympics presenter Clare Balding's shoplifting shame and cancer battle
The presenter is covering the 2024 Olympics in Paris for the BBC
OK! Magazine
Drivers warned that most car owners will be hammered with new £410 tax from next April
New rules will see many drivers forced to pay additional tax on their vehicle from next year. The so-called 'luxury' car tax would see motorists forking out £410 each year
Cosmo
Rosalía goes braless and *almost* frees the nip in a lace naked dress
Rosalía stepped out wearing a breathtaking naked dress at the Prelude to the Olympics in Paris. The design was a nude coloured see-through lace gown by Dior.
OK! Magazine
Prince Harry 'ignored Prince William's strong advice' over Meghan Markle – now he's paying the price
EXCLUSIVE: After claims Prince Harry and Meghan were offered advice on balancing their public and private lives, an expert explains why Prince William was right to air his concerns
BuzzFeed
A Bunch Of Trump Supporters' Cars Were Towed From A Dunkin' Parking Lot, And The Towing Company Name Is Unintentionally Hilarious
Yeah, this is why I'd never mess with a manager of a Dunkin'.
OK! Magazine
Meghan Markle wants 'old life back' - 'Prince Harry knows how much she's sacrificed'
EXCLUSIVE: The Duchess of Sussex has taken on a lot involving herself in the British royal family and sometimes she wishes for her 'old life back' a source says
The Independent
Prince William’s feelings towards Harry revealed in unseen letters from Princess Diana
Collection includes insights into Diana’s royal life
Wales Online
Antiques Roadshow guest 'needs bodyguard' after surprise valuation
Antiques Roadshow expert Alastair Dickenson was left impressed after being shown a decorative silver box
Wales Online
DWP says it owes thousands of people almost £8,000 each
There are some key dates to check as the service says three main groups of people have been identified
The Northern Echo
I compared Heinz tomato sauce with supermarket versions (this is the one to avoid)
Is branded or supermarket own ketchup better? I taste tasted a handful to make up my own mind - here's what I thought.
Wales Online
Former Man Utd and Cardiff City player now working on a building site after walking away
The Premier League winner once commanded a transfer fee of £35 million
Wales Online
Woman wakes up hours before life support was to be switched off
Emma's family had been told the 32-year-old was brain dead
Wales Online
Mum banned from holidays for life for having CBD gummies in luggage
She was stopped from getting on the cruise and told she is no longer welcome
OK! Magazine
Britain's Got Talent singer loses £43 million claim against ITV show over failed audition
A Britain’s Got Talent hopeful, who sued the show's producers for more than £40 million, has had his claim thrown out at the High Court with the Judge ruling the case as "hopeless"
The Telegraph
Pensioners face shock tax bills within weeks
Some 140,000 pensioners will be hit with tax bills in the next six weeks as a stealth raid on retirement incomes bites.
Manchester Evening News
For decades 'Dr' Alemi fooled everyone. Now, her luck has run out
Zholia Alemi compared her plight to that of the Post Office scandal victims
The Telegraph
‘I’m not the whistleblower but Charlotte Dujardin has lots of enemies’
A dressage trainer claimed Charlotte Dujardin has “many enemies” as she denied suspicions in equestrianism that she is the whistleblower behind the horse-whipping video.
CNN
Judge who ordered Trump to pay $454 million says he was ‘accosted’ by lawyer and won’t recuse himself from case
The judge who found Donald Trump liable for fraud and ordered the former president to pay $454 million said he will not recuse himself from the case.
HuffPost UK
Did The Tories Really Leave Labour A 'Shocking' Inheritance?
We all knew the UK is not exactly fighting fit – but just how much of a mess are we really in?

Making Typerighter work harder

Where we left off

What happened next

One last hitch

Building the spellchecker

What’s next?

Latest stories

Meghan Markle 'hysterical' after breaking Prince Harry's strict orders in interview

BBC Olympics presenter Clare Balding's shoplifting shame and cancer battle

Drivers warned that most car owners will be hammered with new £410 tax from next April

Rosalía goes braless and almost frees the nip in a lace naked dress

Prince Harry 'ignored Prince William's strong advice' over Meghan Markle – now he's paying the price

A Bunch Of Trump Supporters' Cars Were Towed From A Dunkin' Parking Lot, And The Towing Company Name Is Unintentionally Hilarious

Meghan Markle wants 'old life back' - 'Prince Harry knows how much she's sacrificed'

Prince William’s feelings towards Harry revealed in unseen letters from Princess Diana

Antiques Roadshow guest 'needs bodyguard' after surprise valuation

DWP says it owes thousands of people almost £8,000 each

I compared Heinz tomato sauce with supermarket versions (this is the one to avoid)

Former Man Utd and Cardiff City player now working on a building site after walking away

Woman wakes up hours before life support was to be switched off

Mum banned from holidays for life for having CBD gummies in luggage

Britain's Got Talent singer loses £43 million claim against ITV show over failed audition

Pensioners face shock tax bills within weeks

For decades 'Dr' Alemi fooled everyone. Now, her luck has run out

‘I’m not the whistleblower but Charlotte Dujardin has lots of enemies’

Judge who ordered Trump to pay $454 million says he was ‘accosted’ by lawyer and won’t recuse himself from case

Did The Tories Really Leave Labour A 'Shocking' Inheritance?