A note re: social media: After years of not posting anyway, I have now fully abandoned Twitter/X and am now on BlueSky as philip-schrodt (same identifier as my GitHub account). And actually posting and commenting there: until someone figures out how to ruin it [1], BlueSky seems to have much the communal vibe of the Twitter of old. I do not, as yet, have any invites to offer: that’s a networking thing.
On to our topic.
I just finished coding—okay, annotating—about 200 news articles (from various international, if “mainstream”, sources; this is part of a larger project and as it happens I was mostly coding protests) as part of eventual training and validation for a machine learning (ML) event coder (see this) using the PLOVER
ontology.[2] Beyond a visceral “I hate this, hates it forever!” some thoughts and reflections on human coding generally inspired by this experience.
The Human Component
From way back in the days of reading Kahneman’s Thinking Fast and Slow, I’m increasingly aware of the issue of the cognitive load [3] involved in human coding, and I think it needs to be taken seriously.
Parts of these annotations were pretty easy: PLOVER
has a nicely compartmentalized event-mode system for categorizing the events, and identifying the text of relevant individuals and locations was reasonably easy: in fact the PLOVER
-coded POLECAT
data set, one of my sources, includes the text of individuals, organizations, and locations as identified by the open-source spaCy
software.
But other aspects of the codings were challenging: the PLOVER/POLECAT
system works with texts that are roughly 500 words long (the constraint set by the BERT
language models, but pretty typical of news articles) and processing this requires surprisingly more time/cognitive resources compared to the single sentences used in the older event data coding projects (which as a native reader of English, I could generally take in at a glance, and multiple events were almost always delineated using readily parsed compound phrase structures).[4]
Furthermore, while I annotated 200 events, I actually read at least 50% more, possibly more like 100% more and even that proportion of coded to uncoded stories is high given that I was working with a corpus that had already been pretty well filtered for positive cases. At least 50% of the stories had multiple events (most typically a PROTEST/ASSAULT pair when the state responded repressively) and stories that either lumped similar events (related protests on the same day) or provided historical context might contain five or more events: this requires close reading.
In addition, PLOVER
has an unstructured “context” field with an unstructured list [5] of 35 categories, any of which could be included: that is also a heavy cognitive load (all the more so as I’m not entirely happy with it) and, if we have a chance to develop a dedicated annotation team for PLOVER
events, contexts should probably be a separate task as they are pretty much orthogonal to the event and mode assignment.
Still, I can only code/annotate something at this level of complexity maybe three hours a day. Some of that is due to the novelty of the task, but that can be improved….
Developing machine-assisted coding systems
For a number of years I’ve been generating a couple of near-real-time (monthly updates) data sets, first on attacks on civilians, now on protests, for the late, great Political Instability Task Force using the US government versions of ICEWS (which include the source texts) as a pre-filter, and on these I could readily do four hours (with breaks) despite, for example, the ontology of protest topics having at least fifty distinct categories. I was able to sustain this rate by using machine-assisted software that I have meticulously refined with every possible routine task [6] (as well as having a keyboard rather than screen interface) and since I’ve been working with it for years I’ve pretty much memorized both the ontology and the keyboard equivalents. [7] The resulting productivity gains are substantial: for the protest coding, I reduced the time required to code a month of data from about 35 hours to about ten hours.
But how to develop such a system, which needs to be hand-crafted? About a decade ago, under NSF funding, I developed (and yes, documented) a “build your own machine-assisted coding site” called CIVET
, Contentious Incident Variable Entry Template. Which was used, with a lot of assistance and customizing from me, in a couple Minerva projects and then…never again.
So, okay, most software has a short shelf life(sometimes zero…at least CIVET
got deployed…to say nothing of having an endearing animal print as its logo) but…well, the uptake was not that of TeX
or ChatGPT
. The use-case is pretty esoteric—how many long-term conflict data collection projects are out there that don’t already have good internal systems [8]—and it was fairly complicated.
In particular, was it less complicated than just directly writing code in php
and javascript
(or a javascript
framework such as JQuery
)? Arguably—which is to say, I’d like to convince myself I didn’t completely waste that time (I was paid…)—that wasn’t the case in 2014, when we proposed to NSF what became CIVET
. Both php
and javascript
(and HTML
) just kinda grew out of the early days of the web, and substantial parts of each were not all that logically coherent (or necessarily debugged), were constantly changing (and the frameworks even more so [9]), and we were still transitioning from paper-based to web-based (and most critically, query based: StackOverflow) documentation.
But today all of this has changed. Things can still be a bit complicated for the likes of me [10] given the need to interact with a server (php
), client (javascript
) and one or more databases (multiple possibilities) but, for example, the javascript/HTML DOM
(document object model) is brilliant, and e.g. php
now has almost everything you’d expect in Python
(or vice versa: JSON
, every data mongers best friend, started out the javascript
environment). So while I’m still not thrilled juggling three languages to get something working, it is not that bad, and critically, there are a gadzillion (too many…) resources describing, in multiple ways, and with extensive both useful and hopelessly pedantic feedback, how to do anything you could possibly want to do. Which cannot be said for CIVET
.
So for the current annotation work, I simply wrote web pages operating in a client/server environment, and found it straightforward to rapidly modify these as I was working with several different source formats (the project has gone through multiple phases). Moving forward, I’m probably going to use this approach.[11]
Human vs automated coding: ChatGPT changes everything, right?
I wish.
Let’s start by stipulating three things
- Near-real-time [13] large scale coding of information on the web [14] is necessarily going to be largely or completely automated: the question is not whether you can do this, but what the quality will be. As always, as I’ve argued innumerable times in this blog, people tend to seriously overestimate the accuracy of human-based coding, particularly coding done in extended multi-institution, multi-generational settings, so the bar that realistically needs to be crossed here is not very high. [15]
- If nothing else, large language models (LLMs) have contributed hugely by using embeddings which more or less resolve the synonym problem that plagued pattern-based approaches.
- As we argue in the ISA-2023 papers linked at the beginning of this essay, future systems will almost certainly be largely example-based, rather than pattern-based.
The third point suggests that—alas, and I hates it forever—human curating of training cases will remain a major task, and probably one that will require non-trivial levels of expertise, and a considerable amount of experimentation, to get right: this is not Mechanical Turk stuff, or a case where pre-labelled training cases are low-hanging fruit on the web [17].
Which, based on my readings the current ML industry literature, puts political analysts in the same situation as virtually everyone trying to deploy ML models: the simple cases have been done—distinguishing cats from dogs or purses from bracelets using pre-labeled data from the web—and going forward requires human effort and lots and lots of quality vs. quantity tradeoffs. Everyone wants to find short-cuts, e.g. from semi-supervised and weakly-supervised training protocols, but it seems pretty clear that one size will not fit all. Even if you’ve got billions of dollars available (albeit much of that going to secure Nvidia chips).
This is not to say that LLMs aren’t an amazing [and amazingly expensive] accomplishment, if for no other reason than being able to watch millions of pedantic arguments about the Turing Test cry out in terror and be suddenly silenced. But I’m less confident generative models will be relevant to automated coding in the near future due to at least three factors
- The aforementioned estimation and deployment costs, far beyond anything social science academics can afford, and in the near future, with the GPU chip shortage, probably beyond even government funded projects.
- LLMs are, obviously, generative, whereas automated coding is reductive: this is a big deal. Again, embeddings—also reductive—are important, but those are a side effect of LLMs.
- LLM hallucinations are potentially very problematic, particularly given that due to their sheer plausibility they may be more difficult to detect and/or compensate for than classical coding errors.
So likely due to these and other factors, at a recent workshop I attended which was a kick-off to a new coding development project, everyone [18] is interested in using the smaller BERT
family of models, not the GPT family.
Lest this seem too negative, I think the newer models will eventually—and “eventually” may not be that far in the future—be far better (and not just cheaper and faster) than human coding. In some recent experiments—but at this point, I still call them “experiments” rather than final results—I seemed to be consistently getting precision and recall scores in the 0.90 to 0.95 range, out-of-sample, in classifying Factiva stories into the PLOVER
PROTEST category using only about 150 closely curated positive training cases. That’s hugely better than what any extended human coding project, much less a multi-institutional, multi-generational data set, could achieve. But that just one category, and in my experience—which seems pretty consistent with other reports—these models can be very tricky to estimate. [19]
The upshot: with LLMs we’re unquestionably in a world with new possibilities, but exploring and exploiting these is not going to happen overnight. To be continued.
The Legal Situation
I’ve made some initial comments on this issue inan update to one of my most-read blog entries, with the core point is that the little bitty, and relatively ambiguous, legal niche occupied by event data, specifically the legal status of tiny amounts of very large copyrighted corpora, is suddenly, in a somewhat modified form, in the big leagues. Like really big. You just won’t believe how vastly hugely mind-bogglingly big it is. I mean, you may think your latest research grant is big, but that’s just peanuts compared to what’s going on here. [20].
Cory Doctrow [21] has also been writing on this recently, inter alia here. The key, which Doctrow alludes to, is that the practice of reading a lot of text, some copyrighted, some not, storing it in unimaginably complex structures that, curiously, are not completely dissimilar from computational neural networks, then using a generative process to produce text that is derivative of that material but quite different in form from it, is precisely what every writer, yea every story-teller, from the dawn of human languages, has done. Copyright on the original material not only does not prohibit this, ironically copyright unambiguously and explicitly protects the output!
When it is produced by a human. What if it is produced by a machine? And that, bunko, is the trillion-dollar question.
As I note at the end of my updated article, we are [now, finally, possibly] in the situation of the bullied little kid who shows up at the playground with his new best friend, the thoroughly tattooed and leather-clad leader of a motorcycle gang. Consider the size of the two most notorious bad-asses in the copyright game, Disney (market cap: $150-billion) and Elsevier ($50-billion) compared to the big dogs in the LLM business: Alphabet/Google ($1.7-trillion), Microsoft ($2.4-trillion), Nvidia ($1.2-trillion), and Meta/Facebook, at the end of the pack with a market cap of “only” $760-billion. To the extent that civil law follows the Golden Rule—”Whoever has the gold makes the rules”—it is likely that at the end of the day, that small greasy spot on the courtroom floor will be all that remains of Elsevier’s legal team, an outcome which will delight academic authors and librarians everywhere.
And finally, “Possession is nine-tenths of the law”. Which is not actually true, but the big dogs have already scraped the entire web, converted it to an incomprehensible but rather useful set of numbers which essentially embody the whole of human knowledge ca. 2022, and conveniently have even “accidentally” released these numbers and the relevant software in the form of LLaMA and its many derivatives. Cat’s out of the bag.
But, but, you say: evil anarchists, you will destroy the entire enterprise of paid journalism! Like it isn’t getting completely destroyed by hedge funds already. Hates you, hates you forever!
Calm down… No, and in fact in my personal behavior, I rather thoroughly support subscription-based media, including a forlorn if driven local journalist who is swimming against mighty tides to document the nuances of our local politics [I’m shocked, shocked…] being run of, for and by real estate developers.] [22].
The subscription media produce current news; the institution for which I’d like to see a substitute is archival news, which is a completely different story, though perhaps one not completely dissimilar to how Wikipedia replaced proprietary encyclopedias. But just how much, item-by-item, are those archived texts worth? Leading us to the final—for the moment—observation…
The data-point economic value paradox
The value of an individual news story is closely related to the esoteric if, I believe, widely accepted, paradox of the value of an individual’s data on the web, a topic of extensive discussion over the years in the context of whether individuals should be rewarded with a market price for that data.
The problem/paradox: the value of an individual data point—however complex, but in isolation—can be readily and reliably calculated: it is precisely zero. Which is to say, suppose you are an advertiser—and do keep in mind, targeted advertising is what funds virtually all of the web—and you have a single piece of information to work with, say the entire demographic and web-browsing profile of Philip Schrodt. How much good will that do you in determining, say, whether to show Mr. Schrodt, consistently for about a week, advertisements for $32,000 Italian-made industrial-grade potato harvesting machines? [23]
None whatsoever.
Okay, maybe at the grossest level, my data could guide some decisions: my age would indicate I should be shown AARP ads and not ads for [nonexistent] tickets to Taylor Swift and Beyonce concerts, albeit, based on experience [24] that data would probably be insufficient to ascertain I already belong to AARP and don’t go to concerts, just as it apparently already indicates I’m a potato farmer with refined tastes for Italian design, but from that single data point, it wouldn’t be worth the effort.
My personal data, in fact, is only of value as one tiny part of a very large collection of data points, whose value is an emergent property. Hence if you figure that in some capitalist utopia your retirement years will be financed by your monetized individual data, think again. Better to join AARP and invest in the finest quality Italian-made potato harvesting equipment (and perhaps some acreage appropriate for growing potatoes).
And thus it is also with individual news reports: not only these have zero value in isolation, but because most of them are redundant and have the potential for being incorrectly coded, in isolation they arguably have negative value. Rather than dozens, or hundreds, of articles redundantly, and somewhat inconsistently, describing the same event, better to have a single article produced, copyright-free, with automatic summarization software. As is being proposed/imagined/fantasized.
This also has an interesting corollary: a single miscoded event has zero cost/impact. Or should. So yes, yes, sorry, sorry that we coded that bus accident in Mindanao as a terrorist attack, and yes, we know you were stationed nearby as a captain for six months and thus it was of considerable concern to you but really?: ain’t no never mind… [25] A large number of systematic errors—famously, urban bias and media fatigue—will create problems but any single random error?: nah. [26]
So are large news archives such as those maintained by Factiva and LexisNexis worth something?: unquestionably. But are they worth, e.g. the amounts thathelps proviode Elsevier, who own LexisNexis, with a profit margin of 40% or which place Factiva in a position where it can threaten entire universities with loss of access? [27] Those sound to me like monopoly rents to me and, well, returning as usual to the opening key, we hates it, hates it forever.
Footnotes
1. Or as the inimitable Cory Doctrow would phrase this, “enshitify it”.
2. For 160-pages of [open-access] detail on this project, see this and this ; for a blogish summary, see this
3. This, it seems, is a surprisingly difficult issue to figure out metabolically, but recent research suggests the issue may be glutamine. As my bathroom scale will testify, it is not glucose.
4. Two conjectures:
1. Displaying the texts as a delineated set of sentences—spaCy
does this quite reliably—would probably substantially reduce the cognitive load, and I’ll probably implement this in the next [hypothetical] iteration of any machine-assisted coding software I create for this project.
2. Should we be coding machine-translated cases at all when the objective is developing training sets? First, when the translation is less than perfect—and the quality varies widely—this really slows down the human processing time and increases the cognitive load. Second, isn’t there a good possibility that poorly translated training cases will reduce the accuracy of the models? Instead, use only standard English, not machine-rendered English, and if the translation of a particular news story is so bad that nothing can be coded from it, well, them’s the breaks. If a non-English source is high quality, develop training sets in the original language, using native speakers as coders.
5. Probably a mistake…in developing PLOVER
we were really trying to get away from the four-level coding hierarchy of CAMEO
, but on the contexts, a bit more structure would probably be useful. E.g. we currently have a single “economic” context, and giving it some sub-contexts, e.g. [“strike”, “prices/inflation”, “government benefits”, “services”, “inequality”] would be useful. Come to think of it, quite a few contexts could be combined, e.g.
- “political institutions” => [“pro-democracy”, “pro-authoritarian”, “elections”, “legislative”, “legal”],
- “human-rights” => [“gender”, “lgbt”, “asylum”, “repression”, “rights_freedoms”]
- “crime”=>[“corruption”, “cyber”, “illegal_drugs”, “terrorism”]
- “international”=>[“military”, “territory”, “intelligence”, “peacekeeping”, “migration”]
6. Albeit these are generally static—keyword-based pattern-matching for the most part—rather than dynamic per the various “active learning” methods now available in, e.g. the prodigy
annotation platform: for sufficiently uniform inputs, this simple approach can result in massive increases in productivity.
7. In the early days of personal computing there was a keyboard-driven word processing program called WordPerfect and regular users—say, faculty who did a lot of writing—memorized countless complex key combinations and could work at astonishing speeds compared to those of us using screen-based systems. And, of course, there’s emacs
…
[For the record, I still use the screen-oriented programming editor BBEdit
whose company—with a [non-] mission statement not unlike that of Parus Analysis—just passed their 30-year birthday/anniversary: this is the only proprietary software I own (I do subscribe and gratefully use some cloud-based software, notably https://data.page/json/csv). BBEdit
‘s original slogan was, famously, “It doesn’t suck.” It still doesn’t]
8. Conversely, how many use legal pads or spreadsheets…I don’t want to know…
9. CIVET
has still another layer of complexity, the Django
system, which again probably made sense at the time but I doubt I would use it now.
10. Whereas an experienced web developer—throw a frisbee at random on Charlottesville’s downtown Mall and you’ll probably hit one, after which it will bounce off and hit someone teaching yoga and mindfulness meditation—would be fluent in these approaches. Whereas I’m still forgetting semicolons.
11. TL;DR. A very long discourse on curses
, the package.
Until this most recent project (and CIVET
) my machine-assisted programs have been in the curses
terminal package, which works at the character level and is keyboard driven. This had several clear advantages: it is in Python
(and before that C
) hence a single language, it is single-machine rather than server/client system, so everything (notably files) is in one place and both very fast and independent of a web connection, and more generally, keyboards are quicker and safer (re: carpal tunnel and related maladies) than menus and mice. The downside is it doesn’t automatically adjust to different screen sizes, every input tool must be built from basic code (albeit once you can create a few examples you just cut-and-paste), and it does not have the vast options of HTML
and javascript
input and display widgets. But in general I can write and modify curses
code faster than I can write php/javascript/HTML
.
That said, the major excuse I used was being able to use the programs on long flights [12] but in point of fact, I tend to use long flights to either (east bound: sleep: I have long argued that sleeping on airplanes in economy class is a serious professional skill that must be learned) or (west bound: read magazines that have accumulated and edit my laptop-based journal), and screen size on my laptop is about a third that of my desktop, so I’m pretty much limited to simple tasks such as filtering with prodigy-like systems, of which I have many.
This still leaves the issue of being able to do almost all tasks from the keyboard, which remains far faster. While I’ve not implemented a system yet, my sense now is that a suitably customized—and probably extensively customized—web page could handle this and, as with most things programming, once it has been done once subsequent iterations are relatively easy. We shall see.
So while 2014 self was quite happy with curses
, 2024 self will probably work with AJAX
variants.
12. I am, alas, one of those people whose carbon footprint is far and away dominated by air travel and well, I shouldn’t do this. But wow, are we ever having a post-COVID conference bounceback!. Though I am using the Kansas Land Trust for carbon offsets, as prairie grasses sequester carbon underground where it does not burn (the grass burns, but in native, not invasive (the tragic issue in Maui), prairie that’s a [quite dramatic] nutrient cycling feature, not a problem), and are rather hardy, and the whole area is going back to wild prairie anyway as industrial-farming has pretty much finished off the Ogallala aquifer.
13. “Near-real-time” is a critical caveat: several very high quality and widely-used data sets in political science are human coded—always with sophisticated machine-assisted coding frameworks in the background—but they are not released in near-real-time, instead having lags of a number of months, and typically a year or even a decade. That’s a different animal.
But wait, didn’t you say you’ve been coding near-real-time data?? Yes, but with ICEWS
and now POLECAT
as pre-filters, so I’m dependent on the automated systems.
14. While my own experience is largely in the context of event data, I think there are four clear general categories of use cases for automated coding of political data:
- Clustering and filtering: huge productivity enhancers
- Sentiment: there’s a huge amount of research on this due to its relevant in commercial applications, and goes back to the beginning of automated programming, with the Ur-program
General Inquirer
. - Features, e.g. does a human rights report mention state-sanctioned sexual violence? Again, this is a general problem
- Events, which are the most complicated and fairly specific to political event data, though event extraction has been a long-standing interest of DARPA, leading to a number of specialized developments in the field of computational linguistics.
15. A different question than crossing the accuracy bars set, often as not, by people who have never used data in the context of political analysis. As for those who do use it, repeat after me: “First they say it is impossible, then they say it is crap, then they ask where the data is when you don’t post it on time.” [16]
16. I never claimed to have originated this, but I think I may have now located the source (which, of course, may well also have an earlier source, or be apocryphal):
“All truth passes through three stages: First, it is ridiculed; second, it is violently opposed; and third, it is accepted as self-evident.” Arthur Schopenhauer
17. An interesting, and very real, edge case: ICEWS
would quite frequently incorrectly identify “police” as one of the initiators of protest demonstrations, and I used a post-filter to identify and correct these cases. However, I had to manually determine whether to remove them, since every so often the police actually do engage in anti-government demonstrations, typically over wages and benefits, but occasionally because they believe the government is being too restrictive in the police response to demonstrations. It’s complicated…
18. A random note that, in fact, has next to nothing to do with the topic but I found most curious: at this and another largely independent workshop I attended in the past month, I noted that post-COVID, slide presentations have become vastly simpler—generally black-on-white, only necessary graphics, no cringe-worthy animated subtitles—than in the pre-COVID era. My hypothesis: Zoom bandwidth: you don’t look (or feel) good when “next slide” invokes then a ten-second delay.
19. My Google Colaboratory
models seem to maintain some sort of state between runs that results in their converging on the same model after a while, despite my efforts to randomize. So what other mistakes am I making in Colaboratory
?
20. WTF? This.
21. Doctrow is not for the faint of heart, but is right a lot more often than he is wrong. Your reaction to his work will doubtless be governed in part by whether you consider “enshitification” to be a word, though it is difficult to dispute the legitimacy of the general concept.
22. So, I’m thinking, I spend a lot on subscription news, but do I spend as much as I spend on oat-milk chai lattes? Maybe I should use that as a benchmark? Mind you, most of the chai latte expenditures goes to local labor. And real estate developers.
23. Yes, I got these—pretty sure it was advertising this and maybe I’m wrong about the price—as my predominant advertisement across [of course…] multiple web pages on Google Chrome
for a couple of weeks, then a pause, then for a couple more weeks. I’m also apparently in the market for machines that can make aluminum gutters on-site. And you thought event data coding was bad?
24. I presume I am not alone in the experience of looking up some product, purchasing it, then receiving ads for that product for at least a week or more. Though I did not purchase the Italian potato harvesting machines.
25. This phrase proof I’m not writing this using ChatGPT
? Or the opposite?
26. There is a long-standing real-time event data set colloquially known as “The Data Set That Shall Not Be Named” that across at least two independent tests was shown to contain only about 5% of cases that were neither redundant nor miscoded. Can you do meaningful conflict analysis with a 1:20 signal to noise ratio: well, apparently you can, as I’ve heard from multiple projects, and realistically, statistical analysts in all sorts of fields have for decades worked with data as bad or worse. Though not suggesting this as a deliberate practice when alternatives are available, and they are.
27. Dow Jones (market cap: $40-billion), which owns Factiva, has a quite modest profit rate of 3.5%, right around the average for companies listed in its eponymous average, and of course Dow Jones, unlike Elsevier, actually produces original research. As to Factiva’s notorious “Nice research project you got here; pity if something happened to it…” approach, they appear to have become more accommodating lately: the knowledge that the LLMs have almost certainly hoovered their entire content probably contributes to this.