Madan Thinks

Opinions, Context & Ideas from Me

Robots in the Newsroom

Cover

As digitization becomes pervasive in our world, more of its happenings are rendered in data. Advances in computational sciences specifically in AI, Machine Learning, Big Data Analytics as well as accessibility of these to business operations, have helped harvesting of this data to deliver newer, deeper & more interesting insights. This advance is creating a paradigm that values new digital voices in a way we perceive a human source in bringing a unique perspective to an event. The goal of this paper is to explore the impact of advances in AI on Journalism as well as how these are extending human capabilities in the reporting of news.

 

1. Introduction

AI in Journalism can be broadly defined as the idea of using algorithms to automatically generate news & analysis from structured data. News organizations are already past the nascent stages of adopting AI to generate news content. Amid the sensationalism in some quarters that a robot journalist may soon win the Pulitzer, write eighty percent of online news or make human reporters redundant, we shouldn’t lose sight of exactly what AI supported Journalism means to a news organization today. A simplistic take would be the leveraging of AI to enable reporters to reduce time spent on mundane activities (such as transcribing and manually poring over datasets), instead investing time on complex tasks (such as pursuing leads, asking questions, judging importance, building sources & understanding context).

A more nuanced view of AIs influence on Journalism will emerge while we delve into the use cases in the upcoming sections.

2. How is AI augmenting Journalism today?

2.1 AI helps reporters scale: Associated Press publishes twelve times more earnings stories

The financial journalists of the Associated Press (AP) were burdened every three months with the enormous human task of reporting earnings for as many companies as possible. AP induced AI to create content from raw earnings data & publicly available digital data to automatically report 3700 corporate earnings stories like this one below.

1.

Source – Automated Insights

Key cog of this solution was Automated Insight’s [1] [18] Wordsmith [2] AI platform. This augmentation freed up twenty percent of journalists’ time. In terms of scale, Associated Press is providing customers with twelve times the corporate earnings stories as before, including for a lot of very small companies that never received any coverage.

In another case, the legendary news organization, Reuters has deployed AI to generate thousands of automated stories [3] each day, in multiple languages, from corporate and government data.

2.2 Finding needles in a haystack: Atlanta Journal investigates sexual abuse by doctors

In special cases that might elude human analysis due to sheer scale or complexity of data, AI can be silver bullet. This role fits neatly into standard newsroom processes, because even if it discovers cases the human eye glazes over, the findings can be fact-checked via human investigative techniques.

Atlanta Journal-Constitution is the only major daily newspaper in Atlanta metro area. It was a Pulitzer Prize nominee in 2017 for its comprehensive investigation on sexual abuse by doctors. Using machine learning, natural language understanding (NLP) et al., it analyzed more than 100,000 documents [4] and other records to find cases that may have involved doctors’ misconduct. This research in a remarkable duration wouldn’t have been possible if not for use of AI based search & contextual analysis.

2.3 Pursuit of trends​: ​ ProPublica profiles US Congressmen

Statistical techniques embedded in AI algorithms can help profile vast sets of raw data to bring out unique characters, outliers & trends.

ProPublica is an American nonprofit newsroom that produces investigative journalism in the public interest. ProPublica journalist, Jeremy Merrill, used machine learning [5] to detect the issues uniquely important to each member of Congress.

2.4 AI battles fake news

Google is pursuing research in alternate ways of scoring web pages [20], this tries to understand a page’s context independent of external signals such as links. French news outlets have combined their efforts in an effort called CrossCheck [21] that identifies & debunks fake news. Technically, the fake news identified is done using some overarching principles, names –

Automated weighing of facts using technologies such as Natural Language Understanding (NLU) & Machine Learning,

  • Scoring of pages contextually as Google has attempted,
  • Prediction of reputation using Machine Learning and existing tools such as Alexa Ranks
  • Key word analytics using techniques such as discovery of sensational words et al.
  • The French news giant, Le Monde has created a database called Le Decodex [22], which has hundreds news sites that have been identified and tagged as “satire,” “real,” “fake,” et al.

2.5 Significant Others: Use cases from BBC, The NY Times, LA Times

BBC’s Juicer is using AI to aggregate news within as well as outside BBCs realm, extract content intelligently & enable contextual discovery [6]

The NY Times leveraged Google’s Perspective API among other techniques to increase the productivity of its content moderating team [7] by eighty percent in validating comments on its public website.

The Los Angeles Times used AI techniques to investigate the Police Dept.’s (LAPD) downgrading crime classifications [9]

 

3. AI technology foundations in Journalism

The following illustration presents sub-domains of AI that are augmenting Journalism

2.

Though an explanation of how these sub-domains impact Journalism is outside the scope of this paper, the following value chain expands on this topic.

3.

Source – a World Economic Forum publication [10]

 

4. Case Study

Heliograf, the trail blazer amongst AI powered news bots

Heliograf is an AI enabled automated news content generation & publication engine created & operated by Washington Post. Human editors at The Post create narrative templates for stories, then they hook Heliograf up to any source of structured data related to the event. For e.g. in the case of a US election, VoteSmart.org. Heliograf zeroes in on the required data, matches it with the corresponding phrases in the template, merges them, and then publishes versions of the content specific to destination channels. The system can also alert reporters via Slack of any highlights or outliers it finds in the data. The Post’s original objective of this project was twofold:

  1. Grow its audience. Instead of targeting a national audience with a small number of labor-intensive human-written stories, Heliograf can deliver to several small audiences with many automated stories about targeted or localized topics.
  2. Allow its reporters to focus on the qualitative side of reporting. Jeremy Gilbert of The Post says ”If we took someone like Dan Balz, who’s been covering politics for the Post for more than 30 years, and had him write a story that a template could write, that’s a crime![11]/

 

Heliograf’s Architecture

4.

Source – Image & Information on the architecture is sourced from a scholarly paper titled ‘Domain Specific Newsbots’ authored by Al Johri, Eui-Hong Han & Dhrumil Mehta [12]

Following are technical building blocks of Heliograf

Observer

A crawler application that monitors output sources (REST API endpoints or Webhooks) for any changes to data and caches the most recent version. For e.g. During the Olympics’16, one endpoint per sport will be monitored. Assuming RESTful endpoints are listened to, the polling delay defines the “real-time” nature of the stories.

Detector

The detector is where the event detection logic lies & triggers are defined to kickoff news generation. Heliograf triggers consider both the current state of data and its previous states to find changes; frequency & intervals to fire the triggers are contextually decided & not pre-defined. In situations where there is no consistent stream of clearly defined events or state changes, Heliograf can “find” a story within the data. Its triggers are stateful as it has memory of how it characterized an event being covered in the past in order to determine how the event is characterized in the present. For e.g., during the 2016 US primary elections, Heliograf published tweets when the votes started coming in, a race was close, the vote counts increased by ten percent, a result was announced, one hundred percent of precincts had reported results etc.

Writer

The writer, in its simplest avatar, leverages a template to produce output for all of the Post’s distribution channels. But it is also capable of more complex tasks such as

  1. Permutations that would have to be written to account for all the different events that could occur in any order and the grammatical nuances of reporting each one
  2. Making appropriate assumptions, decisions while dealing with varied contexts & situations of an event as well as news related to separate events that are related
  3. Generate news at the right level of abstraction based on the gravity of the story & distribution channel.

The fulcrum of this operation is Natural Language Generation (NLG) [16]; while expanding the NLGs functioning is outside the scope of the paper, its worth mention that the Heliograf team employed the University of Aberdeen’s SimpleNLG in its bot. In this context, a reading of Automated Insights Wordsmith platform that is popular amongst news content generating applications is a worthy exercise [17].

Distributor

Heliograf’s distributor component works in one of two ways. Firstly, distribution to one-way channels where it acts as a wrapper around the various APIs such as Twitter, Slack, WordPress, etc. In more high frequency applications, it can form a queue to publish. Secondly, distribution to two-way channels, its distributor API can be hit by apps such as Amazon Alexa or Facebook Messenger. It supports the complexity of each channel having differing features; for e.g., Facebook messenger, that engages the API with the user’s input as an atomic request, while other text-based channels support a user session with persistence of past user interactions. Heliograf is also able to support speech recognition within the platform and provide results in a structure specific to the destination channel.

 

5. The Road Ahead & Challenges

Transparency & Accountability

How much do readers need to know about how a story was built & what choices were made in creating it? Clarity on accountability for the facts & content published as well as errors caused by the algorithm is an important responsibility to the audience. This is even more important since AI is today fully able to verify whether the data it uses is authentic or accurate.

Take for instance, Quakebot, a platform which helped the Los Angeles Times to publish an article about the 2014 LA earthquake within three minutes [13] of occurrence. But, in 2015 the bot reported a California earthquake based on data from an earthquake that actually happened in Alaska [14]. The information was clearly published without time for basic human verification.

Additionally, the lack of self-awareness of AI, will ensure human validation is always required for content that goes beyond its uses explained earlier in this piece.

Editorial Bias

News curation by AI represents editorial decisioning. But the major problem today facing engineers is to integrate editorial values into this curating process & content generation where AI is involved. Further, the Machine Learning that enables content generation relies on human-supervised learning capability – this process cascades biases whose effects can’t be predicted & could be potentially damaging. Journalists can best leverage AI once they understand the technology.

Thin line called Personalization

AI & analytics enables personalization of feeds, articles et al both to individuals or targeted at a specific audience. Though personalization allows to shape stories to cater to each individual reader’s interests, increasing user loyalty, this walks a fine line. Too much personalization turns journalism into marketing. And this will have a detrimental effect on reader engagement & eventually ROI.

The matter of Ethics & Privacy

Though AI allows journalists to perform massive data crunching in a limited period, this advantage can easily be an Achilles Heel. Ethical use of data & respect for to sensitive matters like transparency, contextualization and trust. Data privacy & ethical use is a fundamental question every journalist need to introspect on. AI is perceived as an amorphous blob where data goes in and insights come out. Journalists need to be as critical as possible & be aware of these decisions in the future, both when using them in their own research and when reporting on them. Of interest here is an ethics checklist for automated Journalism published by Tom Kent of Columbia University [15].

Emergence of monopolies

AI research, tooling & embedding into operations is an expensive affair. The sort of resources & access to engineering / subject matter expertise can only be afforded by large news organizations. Considering infusion of AI allows for creation of personalized, niche stories that was the domain of localized, small news organizations, this advance can be a hammer blow to such organizations; leading to the rise of monopolies in an industry where presence of myriad voices & balances is sacrosanct. Smaller news organizations might look at early partnerships with academia to ensure they aren’t left behind.

 

Conclusion

The emergence of AI holds immense promise in enhancing a reporters’ capability & benefit Journalism at large. While the value of this is undeniable, the use & extension of this technology requires careful nurturing & long term societal considerations.

 

REFERENCES

[1] Automated Insights – https://automatedinsights.com/

[2] Automated Insights, WordSmith NLG platform –https://wordsmith.readme.io/v1.6/docs

[3] Columbia Journalism Review, The Age of the Cyborg – https://www.cjr.org/analysis/cyborg_virtual_reality_reuters_tracer.php

[4] The Atlanta Journal-Constitution: Doctors & Sexual Abuse – http://doctors.ajc.com/about_this_investigation/?ecmp=doctorssexabuse_microsite_stories

[5] ProPublica, Chamber of Secrets: Teaching a Machine What Congress Cares About – https://www.propublica.org/nerds/teaching-a-machine-what-congress-cares-about

[6] British Broadcasting Corporation, BBC Labs Juicer – http://bbcnewslabs.co.uk/projects/juicer/

[7] Ponyter Inc., What if technology could help improve conversations online – https://www.poynter.org/news/new-york-times-teaming-alphabets-jigsaw-expand-its-comments

[8] Niemen Labs, After a rocky reception, Le Monde’s Décodex is almost a year into fighting intox (fake news) in France – http://www.niemanlab.org/2017/11/after-a-rocky-reception-le-mondes-decodex-is-almost-a-year-into-fighting-intox-fake-news-in-france/

[9] Los Angeles Times: How we reported this story – http://www.latimes.com/local/cityhall/la-me-crime-stats-side-20151015-story.html

[10] World Economic Forum, Can you tell if this was written by a robot? 7 challenges for AI in journalism – https://www.weforum.org/agenda/2018/01/can-you-tell-if-this-article-was-written-by-a-robot-7-challenges-for-ai-in-journalism/

[11] The Wired, What news-writing bots mean to the future of Journalism – https://www.wired.com/2017/02/robots-wrote-this-story/

[12] ACM, Domain Specific Newsbots: Live Automated Reporting Systems involving Natural Language Communication – Al Johr, Eui-Hong (Sam) Han, Dhrumil Mehta

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.

[13] The Slate, The First News Report on the L.A. Earthquake Was Written by a Robot – http://www.slate.com/blogs/future_tense/2014/03/17/quakebot_los_angeles_times_robot_journalist_writes_article_on_la_earthquake.html

[14] CBS, USGS & LA Times ‘Quakebot’ Mistakenly Report Magnitude 5.1 Earthquake Near Redding – http://sanfrancisco.cbslocal.com/2015/05/29/usgs-mistakenly-reports-magnitude-5-1-earthquake-near-redding/

[15] Tom Kent, An ethical checklist for robot Journalism – https://medium.com/@tjrkent/an-ethical-checklist-for-robot-journalism-1f41dcbd7be2

[16] Google Research – https://research.google.com/pubs/pub46311.html

[17] Automated Insights, The ultimate guide to natural language generation –  https://automatedinsights.com/blog/the-ultimate-guide-to-natural-language-generation

[18] Automated Insight – https://automatedinsights.com/case-studies/associated-press

[19] A. Graefe. Guide to Automated Journalism – http://towcenter.org/research/ guide-to-automated-journalism

[20] Google Research, https://arxiv.org/pdf/1502.03519.pdf

[21] FirstDraft Co’s CrossCheck application – https://firstdraftnews.org/project/crosscheck/

[22] Le Monde’s Fake News directory – http://www.lemonde.fr/les-decodeurs/article/2017/01/23/le-decodex-un-premier-premier-pas-vers-la-verification-de-masse-de-l-information_5067709_4355770.html

 

Additional references

Newswhip – http://www.newswhip.com/2017/08/ap-hidden-force-facebook/

Techmergence, Automated Journalism – AI Applications at New York Times, Reuters, and Other Media Giants – https://www.techemergence.com/automated-journalism-applications/

Associated Press, AP Insights – https://insights.ap.org

Columbia University, Tow Center blog

Nieman Foundation, http://www.niemanlab.org/

Leave a comment

Information

This entry was posted on November 11, 2018 by in Technology.