Five observations on data coding

Posted on September 6, 2023 by schrodt735

A note re: social media: After years of not posting anyway, I have now fully abandoned Twitter/X and am now on BlueSky as philip-schrodt (same identifier as my GitHub account). And actually posting and commenting there: until someone figures out how to ruin it [1], BlueSky seems to have much the communal vibe of the Twitter of old. I do not, as yet, have any invites to offer: that’s a networking thing.

On to our topic.

I just finished coding—okay, annotating—about 200 news articles (from various international, if “mainstream”, sources; this is part of a larger project and as it happens I was mostly coding protests) as part of eventual training and validation for a machine learning (ML) event coder (see this) using the PLOVER ontology.[2] Beyond a visceral “I hate this, hates it forever!” some thoughts and reflections on human coding generally inspired by this experience.

The Human Component

From way back in the days of reading Kahneman’s Thinking Fast and Slow, I’m increasingly aware of the issue of the cognitive load [3] involved in human coding, and I think it needs to be taken seriously.

Parts of these annotations were pretty easy: PLOVER has a nicely compartmentalized event-mode system for categorizing the events, and identifying the text of relevant individuals and locations was reasonably easy: in fact the PLOVER-coded POLECAT data set, one of my sources, includes the text of individuals, organizations, and locations as identified by the open-source spaCy software.

But other aspects of the codings were challenging: the PLOVER/POLECAT system works with texts that are roughly 500 words long (the constraint set by the BERT language models, but pretty typical of news articles) and processing this requires surprisingly more time/cognitive resources compared to the single sentences used in the older event data coding projects (which as a native reader of English, I could generally take in at a glance, and multiple events were almost always delineated using readily parsed compound phrase structures).[4]

Furthermore, while I annotated 200 events, I actually read at least 50% more, possibly more like 100% more and even that proportion of coded to uncoded stories is high given that I was working with a corpus that had already been pretty well filtered for positive cases. At least 50% of the stories had multiple events (most typically a PROTEST/ASSAULT pair when the state responded repressively) and stories that either lumped similar events (related protests on the same day) or provided historical context might contain five or more events: this requires close reading.

In addition, PLOVER has an unstructured “context” field with an unstructured list [5] of 35 categories, any of which could be included: that is also a heavy cognitive load (all the more so as I’m not entirely happy with it) and, if we have a chance to develop a dedicated annotation team for PLOVER events, contexts should probably be a separate task as they are pretty much orthogonal to the event and mode assignment.

Still, I can only code/annotate something at this level of complexity maybe three hours a day. Some of that is due to the novelty of the task, but that can be improved….

Developing machine-assisted coding systems

For a number of years I’ve been generating a couple of near-real-time (monthly updates) data sets, first on attacks on civilians, now on protests, for the late, great Political Instability Task Force using the US government versions of ICEWS (which include the source texts) as a pre-filter, and on these I could readily do four hours (with breaks) despite, for example, the ontology of protest topics having at least fifty distinct categories. I was able to sustain this rate by using machine-assisted software that I have meticulously refined with every possible routine task [6] (as well as having a keyboard rather than screen interface) and since I’ve been working with it for years I’ve pretty much memorized both the ontology and the keyboard equivalents. [7] The resulting productivity gains are substantial: for the protest coding, I reduced the time required to code a month of data from about 35 hours to about ten hours.

But how to develop such a system, which needs to be hand-crafted? About a decade ago, under NSF funding, I developed (and yes, documented) a “build your own machine-assisted coding site” called CIVET, Contentious Incident Variable Entry Template. Which was used, with a lot of assistance and customizing from me, in a couple Minerva projects and then…never again.

So, okay, most software has a short shelf life(sometimes zero…at least CIVET got deployed…to say nothing of having an endearing animal print as its logo) but…well, the uptake was not that of TeX or ChatGPT. The use-case is pretty esoteric—how many long-term conflict data collection projects are out there that don’t already have good internal systems [8]—and it was fairly complicated.

In particular, was it less complicated than just directly writing code in php and javascript (or a javascript framework such as JQuery)? Arguably—which is to say, I’d like to convince myself I didn’t completely waste that time (I was paid…)—that wasn’t the case in 2014, when we proposed to NSF what became CIVET. Both php and javascript (and HTML) just kinda grew out of the early days of the web, and substantial parts of each were not all that logically coherent (or necessarily debugged), were constantly changing (and the frameworks even more so [9]), and we were still transitioning from paper-based to web-based (and most critically, query based: StackOverflow) documentation.

But today all of this has changed. Things can still be a bit complicated for the likes of me [10] given the need to interact with a server (php), client (javascript) and one or more databases (multiple possibilities) but, for example, the javascript/HTML DOM (document object model) is brilliant, and e.g. php now has almost everything you’d expect in Python (or vice versa: JSON, every data mongers best friend, started out the javascript environment). So while I’m still not thrilled juggling three languages to get something working, it is not that bad, and critically, there are a gadzillion (too many…) resources describing, in multiple ways, and with extensive both useful and hopelessly pedantic feedback, how to do anything you could possibly want to do. Which cannot be said for CIVET.

So for the current annotation work, I simply wrote web pages operating in a client/server environment, and found it straightforward to rapidly modify these as I was working with several different source formats (the project has gone through multiple phases). Moving forward, I’m probably going to use this approach.[11]

Human vs automated coding: ChatGPT changes everything, right?

I wish.

Let’s start by stipulating three things

Near-real-time [13] large scale coding of information on the web [14] is necessarily going to be largely or completely automated: the question is not whether you can do this, but what the quality will be. As always, as I’ve argued innumerable times in this blog, people tend to seriously overestimate the accuracy of human-based coding, particularly coding done in extended multi-institution, multi-generational settings, so the bar that realistically needs to be crossed here is not very high. [15]
If nothing else, large language models (LLMs) have contributed hugely by using embeddings which more or less resolve the synonym problem that plagued pattern-based approaches.
As we argue in the ISA-2023 papers linked at the beginning of this essay, future systems will almost certainly be largely example-based, rather than pattern-based.

The third point suggests that—alas, and I hates it forever—human curating of training cases will remain a major task, and probably one that will require non-trivial levels of expertise, and a considerable amount of experimentation, to get right: this is not Mechanical Turk stuff, or a case where pre-labelled training cases are low-hanging fruit on the web [17].

Which, based on my readings the current ML industry literature, puts political analysts in the same situation as virtually everyone trying to deploy ML models: the simple cases have been done—distinguishing cats from dogs or purses from bracelets using pre-labeled data from the web—and going forward requires human effort and lots and lots of quality vs. quantity tradeoffs. Everyone wants to find short-cuts, e.g. from semi-supervised and weakly-supervised training protocols, but it seems pretty clear that one size will not fit all. Even if you’ve got billions of dollars available (albeit much of that going to secure Nvidia chips).

This is not to say that LLMs aren’t an amazing [and amazingly expensive] accomplishment, if for no other reason than being able to watch millions of pedantic arguments about the Turing Test cry out in terror and be suddenly silenced. But I’m less confident generative models will be relevant to automated coding in the near future due to at least three factors

The aforementioned estimation and deployment costs, far beyond anything social science academics can afford, and in the near future, with the GPU chip shortage, probably beyond even government funded projects.
LLMs are, obviously, generative, whereas automated coding is reductive: this is a big deal. Again, embeddings—also reductive—are important, but those are a side effect of LLMs.
LLM hallucinations are potentially very problematic, particularly given that due to their sheer plausibility they may be more difficult to detect and/or compensate for than classical coding errors.

So likely due to these and other factors, at a recent workshop I attended which was a kick-off to a new coding development project, everyone [18] is interested in using the smaller BERT family of models, not the GPT family.

Lest this seem too negative, I think the newer models will eventually—and “eventually” may not be that far in the future—be far better (and not just cheaper and faster) than human coding. In some recent experiments—but at this point, I still call them “experiments” rather than final results—I seemed to be consistently getting precision and recall scores in the 0.90 to 0.95 range, out-of-sample, in classifying Factiva stories into the PLOVER PROTEST category using only about 150 closely curated positive training cases. That’s hugely better than what any extended human coding project, much less a multi-institutional, multi-generational data set, could achieve. But that just one category, and in my experience—which seems pretty consistent with other reports—these models can be very tricky to estimate. [19]

The upshot: with LLMs we’re unquestionably in a world with new possibilities, but exploring and exploiting these is not going to happen overnight. To be continued.

The Legal Situation

I’ve made some initial comments on this issue inan update to one of my most-read blog entries, with the core point is that the little bitty, and relatively ambiguous, legal niche occupied by event data, specifically the legal status of tiny amounts of very large copyrighted corpora, is suddenly, in a somewhat modified form, in the big leagues. Like really big. You just won’t believe how vastly hugely mind-bogglingly big it is. I mean, you may think your latest research grant is big, but that’s just peanuts compared to what’s going on here. [20].

Cory Doctrow [21] has also been writing on this recently, inter alia here. The key, which Doctrow alludes to, is that the practice of reading a lot of text, some copyrighted, some not, storing it in unimaginably complex structures that, curiously, are not completely dissimilar from computational neural networks, then using a generative process to produce text that is derivative of that material but quite different in form from it, is precisely what every writer, yea every story-teller, from the dawn of human languages, has done. Copyright on the original material not only does not prohibit this, ironically copyright unambiguously and explicitly protects the output!

When it is produced by a human. What if it is produced by a machine? And that, bunko, is the trillion-dollar question.

As I note at the end of my updated article, we are [now, finally, possibly] in the situation of the bullied little kid who shows up at the playground with his new best friend, the thoroughly tattooed and leather-clad leader of a motorcycle gang. Consider the size of the two most notorious bad-asses in the copyright game, Disney (market cap: $150-billion) and Elsevier ($50-billion) compared to the big dogs in the LLM business: Alphabet/Google ($1.7-trillion), Microsoft ($2.4-trillion), Nvidia ($1.2-trillion), and Meta/Facebook, at the end of the pack with a market cap of “only” $760-billion. To the extent that civil law follows the Golden Rule—”Whoever has the gold makes the rules”—it is likely that at the end of the day, that small greasy spot on the courtroom floor will be all that remains of Elsevier’s legal team, an outcome which will delight academic authors and librarians everywhere.

And finally, “Possession is nine-tenths of the law”. Which is not actually true, but the big dogs have already scraped the entire web, converted it to an incomprehensible but rather useful set of numbers which essentially embody the whole of human knowledge ca. 2022, and conveniently have even “accidentally” released these numbers and the relevant software in the form of LLaMA and its many derivatives. Cat’s out of the bag.

But, but, you say: evil anarchists, you will destroy the entire enterprise of paid journalism! Like it isn’t getting completely destroyed by hedge funds already. Hates you, hates you forever!

Calm down… No, and in fact in my personal behavior, I rather thoroughly support subscription-based media, including a forlorn if driven local journalist who is swimming against mighty tides to document the nuances of our local politics [I’m shocked, shocked…] being run of, for and by real estate developers.] [22].

The subscription media produce current news; the institution for which I’d like to see a substitute is archival news, which is a completely different story, though perhaps one not completely dissimilar to how Wikipedia replaced proprietary encyclopedias. But just how much, item-by-item, are those archived texts worth? Leading us to the final—for the moment—observation…

The data-point economic value paradox

The value of an individual news story is closely related to the esoteric if, I believe, widely accepted, paradox of the value of an individual’s data on the web, a topic of extensive discussion over the years in the context of whether individuals should be rewarded with a market price for that data.

The problem/paradox: the value of an individual data point—however complex, but in isolation—can be readily and reliably calculated: it is precisely zero. Which is to say, suppose you are an advertiser—and do keep in mind, targeted advertising is what funds virtually all of the web—and you have a single piece of information to work with, say the entire demographic and web-browsing profile of Philip Schrodt. How much good will that do you in determining, say, whether to show Mr. Schrodt, consistently for about a week, advertisements for $32,000 Italian-made industrial-grade potato harvesting machines? [23]

None whatsoever.

Okay, maybe at the grossest level, my data could guide some decisions: my age would indicate I should be shown AARP ads and not ads for [nonexistent] tickets to Taylor Swift and Beyonce concerts, albeit, based on experience [24] that data would probably be insufficient to ascertain I already belong to AARP and don’t go to concerts, just as it apparently already indicates I’m a potato farmer with refined tastes for Italian design, but from that single data point, it wouldn’t be worth the effort.

My personal data, in fact, is only of value as one tiny part of a very large collection of data points, whose value is an emergent property. Hence if you figure that in some capitalist utopia your retirement years will be financed by your monetized individual data, think again. Better to join AARP and invest in the finest quality Italian-made potato harvesting equipment (and perhaps some acreage appropriate for growing potatoes).

And thus it is also with individual news reports: not only these have zero value in isolation, but because most of them are redundant and have the potential for being incorrectly coded, in isolation they arguably have negative value. Rather than dozens, or hundreds, of articles redundantly, and somewhat inconsistently, describing the same event, better to have a single article produced, copyright-free, with automatic summarization software. As is being proposed/imagined/fantasized.

This also has an interesting corollary: a single miscoded event has zero cost/impact. Or should. So yes, yes, sorry, sorry that we coded that bus accident in Mindanao as a terrorist attack, and yes, we know you were stationed nearby as a captain for six months and thus it was of considerable concern to you but really?: ain’t no never mind… [25] A large number of systematic errors—famously, urban bias and media fatigue—will create problems but any single random error?: nah. [26]

So are large news archives such as those maintained by Factiva and LexisNexis worth something?: unquestionably. But are they worth, e.g. the amounts thathelps proviode Elsevier, who own LexisNexis, with a profit margin of 40% or which place Factiva in a position where it can threaten entire universities with loss of access? [27] Those sound to me like monopoly rents to me and, well, returning as usual to the opening key, we hates it, hates it forever.

Footnotes

1. Or as the inimitable Cory Doctrow would phrase this, “enshitify it”.

2. For 160-pages of [open-access] detail on this project, see this and this ; for a blogish summary, see this

3. This, it seems, is a surprisingly difficult issue to figure out metabolically, but recent research suggests the issue may be glutamine. As my bathroom scale will testify, it is not glucose.

4. Two conjectures:

1. Displaying the texts as a delineated set of sentences—spaCy does this quite reliably—would probably substantially reduce the cognitive load, and I’ll probably implement this in the next [hypothetical] iteration of any machine-assisted coding software I create for this project.

2. Should we be coding machine-translated cases at all when the objective is developing training sets? First, when the translation is less than perfect—and the quality varies widely—this really slows down the human processing time and increases the cognitive load. Second, isn’t there a good possibility that poorly translated training cases will reduce the accuracy of the models? Instead, use only standard English, not machine-rendered English, and if the translation of a particular news story is so bad that nothing can be coded from it, well, them’s the breaks. If a non-English source is high quality, develop training sets in the original language, using native speakers as coders.

5. Probably a mistake…in developing PLOVER we were really trying to get away from the four-level coding hierarchy of CAMEO, but on the contexts, a bit more structure would probably be useful. E.g. we currently have a single “economic” context, and giving it some sub-contexts, e.g. [“strike”, “prices/inflation”, “government benefits”, “services”, “inequality”] would be useful. Come to think of it, quite a few contexts could be combined, e.g.

“political institutions” => [“pro-democracy”, “pro-authoritarian”, “elections”, “legislative”, “legal”],
“human-rights” => [“gender”, “lgbt”, “asylum”, “repression”, “rights_freedoms”]
“crime”=>[“corruption”, “cyber”, “illegal_drugs”, “terrorism”]
“international”=>[“military”, “territory”, “intelligence”, “peacekeeping”, “migration”]

6. Albeit these are generally static—keyword-based pattern-matching for the most part—rather than dynamic per the various “active learning” methods now available in, e.g. the prodigy annotation platform: for sufficiently uniform inputs, this simple approach can result in massive increases in productivity.

7. In the early days of personal computing there was a keyboard-driven word processing program called WordPerfect and regular users—say, faculty who did a lot of writing—memorized countless complex key combinations and could work at astonishing speeds compared to those of us using screen-based systems. And, of course, there’s emacs…

[For the record, I still use the screen-oriented programming editor BBEdit whose company—with a [non-] mission statement not unlike that of Parus Analysis—just passed their 30-year birthday/anniversary: this is the only proprietary software I own (I do subscribe and gratefully use some cloud-based software, notably https://data.page/json/csv). BBEdit‘s original slogan was, famously, “It doesn’t suck.” It still doesn’t]

8. Conversely, how many use legal pads or spreadsheets…I don’t want to know…

9. CIVET has still another layer of complexity, the Django system, which again probably made sense at the time but I doubt I would use it now.

10. Whereas an experienced web developer—throw a frisbee at random on Charlottesville’s downtown Mall and you’ll probably hit one, after which it will bounce off and hit someone teaching yoga and mindfulness meditation—would be fluent in these approaches. Whereas I’m still forgetting semicolons.

11. TL;DR. A very long discourse on curses, the package.

Until this most recent project (and CIVET) my machine-assisted programs have been in the curses terminal package, which works at the character level and is keyboard driven. This had several clear advantages: it is in Python (and before that C) hence a single language, it is single-machine rather than server/client system, so everything (notably files) is in one place and both very fast and independent of a web connection, and more generally, keyboards are quicker and safer (re: carpal tunnel and related maladies) than menus and mice. The downside is it doesn’t automatically adjust to different screen sizes, every input tool must be built from basic code (albeit once you can create a few examples you just cut-and-paste), and it does not have the vast options of HTML and javascript input and display widgets. But in general I can write and modify curses code faster than I can write php/javascript/HTML.

That said, the major excuse I used was being able to use the programs on long flights [12] but in point of fact, I tend to use long flights to either (east bound: sleep: I have long argued that sleeping on airplanes in economy class is a serious professional skill that must be learned) or (west bound: read magazines that have accumulated and edit my laptop-based journal), and screen size on my laptop is about a third that of my desktop, so I’m pretty much limited to simple tasks such as filtering with prodigy-like systems, of which I have many.

This still leaves the issue of being able to do almost all tasks from the keyboard, which remains far faster. While I’ve not implemented a system yet, my sense now is that a suitably customized—and probably extensively customized—web page could handle this and, as with most things programming, once it has been done once subsequent iterations are relatively easy. We shall see.

So while 2014 self was quite happy with curses, 2024 self will probably work with AJAX variants.

12. I am, alas, one of those people whose carbon footprint is far and away dominated by air travel and well, I shouldn’t do this. But wow, are we ever having a post-COVID conference bounceback!. Though I am using the Kansas Land Trust for carbon offsets, as prairie grasses sequester carbon underground where it does not burn (the grass burns, but in native, not invasive (the tragic issue in Maui), prairie that’s a [quite dramatic] nutrient cycling feature, not a problem), and are rather hardy, and the whole area is going back to wild prairie anyway as industrial-farming has pretty much finished off the Ogallala aquifer.

13. “Near-real-time” is a critical caveat: several very high quality and widely-used data sets in political science are human coded—always with sophisticated machine-assisted coding frameworks in the background—but they are not released in near-real-time, instead having lags of a number of months, and typically a year or even a decade. That’s a different animal.

But wait, didn’t you say you’ve been coding near-real-time data?? Yes, but with ICEWS and now POLECAT as pre-filters, so I’m dependent on the automated systems.

14. While my own experience is largely in the context of event data, I think there are four clear general categories of use cases for automated coding of political data:

Clustering and filtering: huge productivity enhancers
Sentiment: there’s a huge amount of research on this due to its relevant in commercial applications, and goes back to the beginning of automated programming, with the Ur-program General Inquirer.
Features, e.g. does a human rights report mention state-sanctioned sexual violence? Again, this is a general problem
Events, which are the most complicated and fairly specific to political event data, though event extraction has been a long-standing interest of DARPA, leading to a number of specialized developments in the field of computational linguistics.

15. A different question than crossing the accuracy bars set, often as not, by people who have never used data in the context of political analysis. As for those who do use it, repeat after me: “First they say it is impossible, then they say it is crap, then they ask where the data is when you don’t post it on time.” [16]

16. I never claimed to have originated this, but I think I may have now located the source (which, of course, may well also have an earlier source, or be apocryphal):

“All truth passes through three stages: First, it is ridiculed; second, it is violently opposed; and third, it is accepted as self-evident.” Arthur Schopenhauer

17. An interesting, and very real, edge case: ICEWS would quite frequently incorrectly identify “police” as one of the initiators of protest demonstrations, and I used a post-filter to identify and correct these cases. However, I had to manually determine whether to remove them, since every so often the police actually do engage in anti-government demonstrations, typically over wages and benefits, but occasionally because they believe the government is being too restrictive in the police response to demonstrations. It’s complicated…

18. A random note that, in fact, has next to nothing to do with the topic but I found most curious: at this and another largely independent workshop I attended in the past month, I noted that post-COVID, slide presentations have become vastly simpler—generally black-on-white, only necessary graphics, no cringe-worthy animated subtitles—than in the pre-COVID era. My hypothesis: Zoom bandwidth: you don’t look (or feel) good when “next slide” invokes then a ten-second delay.

19. My Google Colaboratory models seem to maintain some sort of state between runs that results in their converging on the same model after a while, despite my efforts to randomize. So what other mistakes am I making in Colaboratory?

20. WTF? This.

21. Doctrow is not for the faint of heart, but is right a lot more often than he is wrong. Your reaction to his work will doubtless be governed in part by whether you consider “enshitification” to be a word, though it is difficult to dispute the legitimacy of the general concept.

22. So, I’m thinking, I spend a lot on subscription news, but do I spend as much as I spend on oat-milk chai lattes? Maybe I should use that as a benchmark? Mind you, most of the chai latte expenditures goes to local labor. And real estate developers.

23. Yes, I got these—pretty sure it was advertising this and maybe I’m wrong about the price—as my predominant advertisement across [of course…] multiple web pages on Google Chrome for a couple of weeks, then a pause, then for a couple more weeks. I’m also apparently in the market for machines that can make aluminum gutters on-site. And you thought event data coding was bad?

24. I presume I am not alone in the experience of looking up some product, purchasing it, then receiving ads for that product for at least a week or more. Though I did not purchase the Italian potato harvesting machines.

25. This phrase proof I’m not writing this using ChatGPT? Or the opposite?

26. There is a long-standing real-time event data set colloquially known as “The Data Set That Shall Not Be Named” that across at least two independent tests was shown to contain only about 5% of cases that were neither redundant nor miscoded. Can you do meaningful conflict analysis with a 1:20 signal to noise ratio: well, apparently you can, as I’ve heard from multiple projects, and realistically, statistical analysts in all sorts of fields have for decades worked with data as bad or worse. Though not suggesting this as a deliberate practice when alternatives are available, and they are.

27. Dow Jones (market cap: $40-billion), which owns Factiva, has a quite modest profit rate of 3.5%, right around the average for companies listed in its eponymous average, and of course Dow Jones, unlike Elsevier, actually produces original research. As to Factiva’s notorious “Nice research project you got here; pity if something happened to it…” approach, they appear to have become more accommodating lately: the knowledge that the LLMs have almost certainly hoovered their entire content probably contributes to this.

Posted in Methodology, Programming | Tagged event data, human coding, LLM, PLOVER, POLECAT | Leave a comment

Two followups, ISA edition

Posted on March 15, 2023 by schrodt735

So those of you who follow this blog closely—yes, both of you…—have doubtlessly noticed the not-in-the-least subtle subtext of an earlier entry that something’s coming, and it’s gonna be big, really big, and I can’t wait to say more about it!

Well, finally, finally, the wait is over, with the presentation at the International Studies Association meetings in Montreal [1] of two papers:

Halterman, Andrew, Philip A. Schrodt, Andreas Beger, Benjamin E. Bagozzi and Grace I. Scarborough. 2023. “Creating Custom Event Data Without Dictionaries: A Bag-of-Tricks.” Working paper presented at the International Studies Association, Montreal, March-2023. arXiv link

Halterman, Andrew, Benjamin E. Bagozzi, Andreas Beger, Philip A. Schrodt, and Grace I. Scarborough. 2023. PLOVER and POLECAT: A New Political Event Ontology and Dataset.” Working paper presented at the International Studies Association, Montreal, March-2023. socArXiv link

There are 160 pages of material here,[2] including a nice glossary that defines all of the technical terms and acronyms I’m using here, plus some supplementary code in Github repos: not a complete data generation pipeline but, as I was told once at a meditation retreat, “If you know enough to ask that question, you know enough to find the answer.” And more to the point, we put this together pretty quickly, about eight months from the dead start to the point where the system was capable to producing data comparable to the ICEWS system, and this to create both a radically new coder and a new ontology that had never been implemented, and as noted in the papers, under such circumstances as you’d want to refactor the thing anyway. Plus the large language model (LLM) space upon which our system depends is changing unbelievably rapidly right now so the optimal techniques will change in coming months, if not weeks [or hours: [12]: tens of billions of dollars are being invested in these approaches right now.

But, you say breathlessly, I’m a lazy sonofabitch, I just want your data! When do I get the data?!?

Good question, and this will be decided at levels far above the pay grade of any of us on the project, to say nothing of the decisions of legions of Gucci-shod lawyers at both private and public levels, and could go in any direction. Maybe the funders will just continue to generate ICEWS, maybe the POLECAT data stays internal to the US government, maybe, as was the pattern with ICEWS, it gradually goes public in near real time [3], maybe released with the backfiles coded to 2010, maybe not: who knows? Mommas, don’t let your babies grow up to be IC subcontractors.

[Update 21-July-2023: So the near-real-time data has now been made available on Dataverse, since April-2023 and with reliable weekly updates. I’ve been using it in conjunction with a coding project that formerly used ICEWS (in both instances, the event data, along with the source texts as the project is covered under U.S. government licenses, is used as the base for subsequent human coding), and I’m generally happy with it, and I’m really happy about the full-story approach vs. the single-sentence approach of ICEWS, but there is something funky in the system which is generating really high numbers of false positives (false negatives are not much of an issue). Various corrections for this are on-going and when things becomes more settled, I’ll probably do a blog post.]

Sigh. But the redeeming feature of which I’m completely confident is the Roger Bannister effect: in track, the four-minute-mile stood as an unbroken record for decades, until Roger Bannister broke it in 1954. Two months later both Bannister and Australian John Landy ran under four minutes in regular competition. In a scant ten years, a high school student, Kansan Jim Ryun, ran the mile under four minutes.[4]

Similar story from OMG Arnold Schwarzenegger on Medium

For a long time, there was a “limit” on the Olympic lift, the clean and jerk. For decades, nobody ever lifted 500. But then, one of my heroes, Vasily Alekseyev did it. And you know what happened? Six other lifters did it that year.

It’s been done, and having been once done, it can be done again, and better. Event data never catches on but it never goes away.

The [likely] soon-to-be-fulfilled quest for the IP-free training and validation sets

As the papers indicate at numerous points, in addition to not providing the full pipeline due to intellectual property (IP) ambiguities [5], we also have not provided the training cases due to decidedly unambiguous licensing requirements from the news story providers. This, of course, has been an issue with respect to the development and assessment/replication of automated event data coders from the beginning: the sharing of news stories is generally subject to copyright and/or licensing limitations, even while the coded data are not, nor, of course, are the dictionaries if these are open source, as is true for the TABARI/PETRARCH coders. [6]

But that was then, and we now see light at the end of this tunnel, and it isn’t an on-coming train, it is LLMs. Which should be absolutely perfect for the generation of synthetic news stories which, for training and validation (T/V) purposes, will be indistinguishable, in fact likely preferable to, actual stories, and will be both timeless and IP-free. It’s not that LLMs are merely capable of producing realistic yet original texts, the entire purpose of LLMs is doing this: we’re not on the periphery here, we’re at the absolute core of the technology. A technology upon which tens of billions of dollars is currently being invested.

As discussed in the papers, we’ve already begun experimenting with synthetic cases to fill out event-mode types that were rare in our existing data, using the GPT-2 system. The results were mixed: we got about a 30% positive yield, which was far more efficient than the <5% yield (often <1%) we got from the corpus of true stories, but GPT-2 could only generate credible stories out to about two sentences, whereas typical inputs to POLECAT are 4 to 8 sentences, and it codes at the story level, not at the sentence-level used by all previous automated coders that have produced event data used in published work in conflict analysis.[7]. GPT-2 also tended to lock-in to a few common sentence structures and event-mode descriptions—e.g. protesters attacking police with baseball bats—and just varying the actors: after a few of these, additional similar cases were not that useful for training.

While we’ve not done the experiments (yet), there is every reason to believe GPT-3 (the base model for ChatGPT)—and as of the date of this writing, the rumor mill says Microsoft will release a variant of GPT-4 next week, months earlier than originally anticipated (!)[8][12]—will easily be able to produce credible full stories comparable to those of international news agencies. Based on some limited, and rather esoteric (albeit still in the range of Wikipedia’s knowledge base), experiments I’ve done with ChatGPT, it is capable of producing highly coherent (and factually correct with only minor corrections) text in roughly the range of two detailed PowerPoint slides, and it is very unlikely it would fail at the task of producing short synthetic news articles, given, we note again for emphasis, that word/sentence/paragraph generation is the core capability of LLMs.

So this changes everything, solving two problems at once. First, the need to get sufficient rare events and corner cases: A current major issue in our system, for example, is distinguishing street protest from legislative and diplomatic protest: the content of the articles outside the word “protest” will clearly be different, but you’ve got to get the examples, which with real cases is labor-intensive. And second, removing all IP concerns that currently prevent the sharing of these cases.

That said, these synthetic cases will still need human curation—LLMs are now notorious for generating textual “hallucinations”—and that’s an effort where a decentralized community could work, and here we have three advantages over the older dictionary/parser systems. First, the level of training required for an individual, particularly someone already reasonably familiar with and interested in political behavior, to curate cases is far lower than that required for developing dictionaries, even if the task remains somewhat tedious. Second, training examples are “forever” and don’t require updating as new parsers are developed, whereas to be fully effective, dictionaries needed to be updated to use information provided by the new parsers.[9] Third, as we discuss at multiple points in the papers, we can readily deploy various commercial and open source “active learning” systems to drastically reduce the cognitive load, while increasing the accuracy and yield, of the curation.

One and done. Really. A big task at the beginning—given that it has over 100 distinct events, modes, and contexts PLOVER probably needs a corpus of T/V cases numbering in the tens of thousands [14]—but once a set of these effectively define an accepted and stable version of PLOVER—as the papers indicate, our existing training sets were generated simultaneously with the on-going refinement of PLOVER, a necessary but by no means ideal situation—that can hold through multiple generations of coder technology. In this respect, it should be rather like TeX/LaTeX, originally running on bulky mainframes and now, with the same core commands, running on hardware and into standardized formats inconceivable at the time, but the documents produced for the original would still compile, or do so with routine modifications. PLOVER, obviously, isn’t as general purpose as LaTeX, but we’d like to think a sufficient community exists to put this together in a year or so of decentralized coordinated effort, ideally with a bit of seed funding from one or more of the usual suspects.

Now you swear and kick and beg us that you’re not a gamblin’ man
Then you find you’re back in Vegas with a handle in your hand
Your black cards can make you money so you hide them when you’re able
In the land of milk and honey, you must put them on the table
Steely Dan, Do It Again (1972)

Once an open system is running—and by the way, as long as we’ve got versioning (another feature only loosely implemented in many prior event data systems) we can start coding after almost any point that we feel we’ve got reasonably credible T/V sets, rather than waiting until they are fully curated. Near-real-time is easy, since as noted a while back in this blog, with sophisticated open libraries, web scraping (for real-time news stories in this application) is now so simple it is used as an introductory exercise in at least one popular on-line Python class. At present, the coding system runs quite well with a single GPU—subsequent implementations could probably make use of multiple GPUs in the internal pipeline, though the near-100% efficiency of “embarrassingly parallel” file splitting is hard to beat—so those just need to be set up and run. And very gradually, a day at a time (which is indeed very gradual…), that does accumulate a long time series, and in any case since far and away the most common application of event data has been conflict monitoring and fairly short-term forecasting, that’s adequate (at least for operations; model estimation could still be an issue).

Long-term sequences similar to the 1995-2023 ICEWS series on Dataverse are more difficult due to the cost of acquiring appropriate rights to some news archive, and, per discussions in the papers, the fact that the computational requirements of these LLM-based models are far higher than those of dictionary/parser systems. There are numerous possibilities for resolving this. First, obviously, is just to splice the existing ICEWS long series, which at least gets the event and mode codings, though not the contexts. Second, academic institutions that have already licensed various long-time-series corpora might be able to run this across those (though given the computational costs, I’d suggest waiting until the T/V set has had a fair amount of curating. Though if you’ve got access to one of those research machines with hundreds of GPUs, the coding could be done quite quickly once you’ve split the files). Finally, maybe some public or private benefactor would fund the appropriate licensing of an existing corpus.

And then there’s my dream: You want a really long time series, like really, really long: code Wikipedia into PLOVER. Code that unpleasantness between the armies of Ramses II and Muwatalli II at Qadesh in late May 1274 BCE: we actually have pretty good accounts of this.[10] And code every other political interaction in Wikipedia, and that’s a lot of the content of Wikipedia. We can readily download all of the Wikipedia text, and since the PLOVER/POLECAT system uses Wikipedia as its actor dictionary, we’ve got the actors (getting locations may remain problematic, though most historical events are more or less localized to geographical features even if the named urban areas were long ago reduced to tall mounds of wind-blown rocks and mud). The format of Wikipedia differs sufficiently from that of news sources that this would take a fair amount of slogging work, but it’s doable.[11]

Footnotes

1. The actual panel is at 8:00 a.m. on the Saturday morning following St. Patrick’s Day, thus ensuring a large and fully attentive audience [joke]. Whatever. I have a fond memory of being at an ISA in Montreal on St. Patrick’s Day and walking past a large group of rather young people waiting to get into a bar, and a cop telling them “Line up, line up: before you go in I have to look at your fake IDs.”

2. I’ve been out of the academic conference circuit for some years now, but back when I was, major academic organizations such as the ISA and APSA maintained servers for the secure deposit of conference papers, infrastructure which of course nowadays would cost tens of dollars per month on the cloud. For the whole thing, not per paper. But then some Prof. Harold Hill wannabees, who presumably amuse themselves on weekends by cruising in their Teslas snatching the tip jars from seven-year-olds running 25-cent lemonade stands, persuaded a number of these chumps to switch to their groovedelic new “not-for-profit” open resource—a.k.a. lobster trap—and then without warning pulled a switcheroo and took it, and all those papers, proprietary. Is this a great country or what!

So presumably you can get the papers by contacting the authors.

Meanwhile, the wheels of karma move slowly but inexorably:

[Abrahamic traditions] May the evil paper-snatchers burn forever in the hottest fires of the lowest Hell along with the people who take not-for-profit hospitals private.

[everyone else] May they be reborn as adjunct professors in a wild dystopia where they teach for $5,000 a course at institutions where deans, like Hollywood moguls of old, viewing The Hunger Games as a guide to personnel management, sit in richly paneled rooms snorting lines of cocaine while salivating over the ever-increasing value of their unspent endowments, raising tuition at twice the rate of inflation, and budgeting for ever-increasing cadres of subservient associate deans, assistant deans, deanlets, and deanlings. But I exaggerate: deans don’t snort coke at these meetings (just the trustees…), and the endorphin surge from untrammeled exercise of arbitrary power would swamp cocaine’s effects in any case.

3. For those who haven’t noticed, ICEWS is currently split across multiple Dataverse repositories, due to the transition of the ICEWS production from Lockheed to Leidos. But as of this writing, the most recent ICEWS file on Dataverse is current as to yesterday [13-March-2023] and FWIW, that’s the same level of currency I have with my contractor’s access to the Leidos server. I also see from Dataverse that these files are getting hundreds of downloads—currently 316 downloads for the data for the first week of the 2023 calendar year—so someone must be finding it interesting.

The inevitable story of automated coding: First they tell you it is impossible, then they tell you it is crap, then they just use it.

4. This paragraph was not written by ChatGPT, but probably could have been. It did, of course, benefit hugely from Wikipedia. I will respect Jim Ryun’s athletic prowess and refrain from commenting on his politics.

5. Why utterly mundane code funded entirely by U.S. taxpayers remains proprietary while billions of dollars—have we mentioned the billions of dollars?—of pathbreaking and exceedingly high quality state-of-the-art software generated by corporations such as Alphabet/Google, Meta/Facebook, Amazon, and Microsoft has been made open source is, well, a great mystery. Though as the periodic discourses in War on the Rocks on the utterly dysfunctional character of US defense procurement note repeatedly, the simple combination of Soviet-style central planning and US-style corporate incentives gets you most of the way: nothing personal, just business.

6. The only open resource I’m aware of that partially gets around this is the “Lord of the Rings” validation set for the TABARI/PETRARCH family, but it is designed merely to test the continuing proper functioning of a parser/coder, not the entire data-generation system, and contains only about 450 records, many of them obscure corner cases, and small subsets of the dictionaries.

As mentioned countless times across the years of this blog, this did not stop a contractor—not BBN— from once “testing” TABARI on current news feeds using these dictionaries and reporting out that “TABARI doesn’t work.” Yes, the Elves and Ring-bearers have departed from the Grey Havens, while Sauron, Saruman, and the orcs of Mordor have been cast down, and the remains of the Shire rest beneath a housing development somewhere in the Cotswolds: the validation dictionaries don’t work.

7. Which is to say, NLP systems from the likes of IBM and BBN and their academic collaborators have experimented with coding at the story level, particularly in the many DARPA data-extraction-from-text competitions, which go back more than three decades. But these systems appear to have largely remained at the research level and never, to my knowledge, produced event data used in publications, at least in conflict analysis (there are doubtlessly published toy examples/evaluations in the computer science literature). Human coders, of course, work at the story level.

8. AKA “let’s kick Google while they are still down…”

9. Or at least that’s how it worked via the evolution of the KEDS/TABARI/PETRARCH-X/ACCENT automated coding systems from 1990 to 2018: some elements of parsing, for example the detection of compound actors, remained more or less the same but others changed substantially and dictionaries needed to account for this. For example even after the PETRARCH series shifted to the external parsers provided by the Stanford CoreNLP project, there was an additional fairly radical shift in the parsing, never fully implemented in a coder, from constituency parsing to dependency parsing. ACCENT almost certainly—the event dictionaries have never been open-sourced—used parsing information based on decades of NLP experience within BBN and made modifications as their parsers improved.

10. The Egyptians pretty much got their collective butts kicked and narrowly escaped a complete military disaster, with the area remaining under the control of the Hittites. Ramses II returned home and commissioned countless monuments extolling his great victory: some things never change.

11. Then move further to my next dream: take the Wikipedia codings (or heck, any sufficiently large event series) and apply exactly the LLM masking training and attention models (or whatever is next down the line: these are rapidly developing) to the dyadic event sequences. Hence solving the long-standing chronology generator problem and creating purely event-driven predictive models: PLOVER coding effectively “chunks” Wikipedia into politically-meaningful segments that are far more compact than the original text. The required technology and algorithms are all in place (if not complete off-the-shelf…) and available as open source.

12. [Takes a short break and opens the Washington Post…] WTF, GPT-4 is getting released today. Albeit by OpenAI, which is only sort of Microsoft. Stepping out ahead of the rumor mill, I suppose. But fundamentally, [8]. And at the very same time when the “Most read…” story in the WP concerns Meta [13] laying off another 10,000 employees…cruel they are, tech giants. The article also notes, scathingly, that GPT-3 is “an older generation of technology that hasn’t been cutting-edge for more than a year.”…oh, a whole year…silly us…

13. Hey, naming your corporation after a feature in a thoroughly dystopian novel (and genre): how’s that working for you? At least when Steve Jobs released the Macintosh in 1984 he mocked, rather than glorified, the world of the corresponding novel. Besides, we’ve had a metaverse for fully two decades: it’s called Second Life and remains a commercially viable, if decidedly niche, application. Some bits managed remotely here in Charlottesville.

14. As noted in the papers, we’re currently working with training cases that aim for a total of around 500 training cases, more or less balanced between positives and negatives (which may or may not be a good idea, and the representativeness of our negative cases probably needs some work). Given the high false-positive rates we’re getting, that may be insufficient, at least for the transformer-based models (but there are only 16 of these: the better-understood SVMs seem to be satisfactory for modes and contexts, though we still need to fill out some of the rarer modes). Using the fact that we can probably safely re-use some cases in multiple sets—in particular, all of the positive mode cases also need to correspond to a positive on their associated event, which presents considerably greater coverage for some of the events likely to be of greatest interest and/or frequency, notably CONSULT, ACCUSE, PROTEST, COERCE, and ASSAULT—that’s roughly 40,000 to 50,000 cases. But these are relatively easy to code, requiring just a true/false decision.

Validation cases are much more complex, requiring correct answers for all of the coded components of the story, which can be extensive given that POLECAT typically generates multiple events from its full-story coding, and each of these can have multiple entities (actor, recipient, location) and those, in turn, can have multiple components (albeit these generally are simply derived from Wikipedia and Geonames). Initially these need to be generated from the source stories—we have multiple custom platforms for doing this—but eventually, once the system has been properly seeded and is working most of the time, can mostly be done by confirming the automated annotation and only correcting the codings that are in error. Nonetheless, this is a much slower and cognitively taxing task than simply verifying training cases.

How many validation cases do we need: well, how many can you provide? But realistically, with 100 or so positive cases for each type of event, maybe with fewer from some of the more distinct modes, which are easy to code, and with a general set of perhaps 2,000 null cases, 10,000 to 12,000 validation cases would probably be a useful start, and that’s sufficient to embed a lot of corner cases.

That said, “active learning” components make both these processes far more efficient than dictionary development, and in some instances (notably the assignment of contexts) these converge after just a couple estimation iterations (or in the case of the commercial prodigy program, its ongoing evaluation) to a situation where most of the assignments are correct.

This also lends itself very well to decentralized development, which is particularly important given that curators/annotators tend to burn out pretty quickly on the exercise. This decentralization goes back to ancient days ca. 1990 of the first automated event data coder dictionary development, which was shared between our small KEDS team in Kansas and Doug Bond’s PANDA project at Harvard. In the current environment, tools, procedures, and norms for decentralized work are far more developed, and this should be relatively straightforward.

Posted in Methodology | Tagged event data, forecasting, NGEC, PLOVER, POLECAT | 2 Comments

How open source software is destroying Fordism

Posted on August 7, 2022 by schrodt735

What is Fordism? In present-day economic theory Fordism refers to a way of economic life developed around the mass production of consumer goods, using assembly-line techniques. A few large companies came to dominate the key sectors of the economy, they dictated the market, and dictated what consumers would be offered.
http://www.yorku.ca › anderson › Unit2 › fordism

This is going to be a two-part entry—I’m one of those people who writes to figure out what they want to say—divided loosely on a micro vs macro level.

Let’s start with the key caveat that while I’m phrasing these arguments about a technologically-driven radical decentralization and modification of economic structures that have been central for (micro) two centuries and (macro) around three millennial, my point of reference is the relatively narrow/specialized part of that economy I’m familiar with, software engineering. Some of the arguments don’t generalize and big Fordist institutions will prevail: global scale production of low-cost batteries is not going to come out of distributed remote teams, and more generally, most of these arguments rely on a couple critical points in the production process which have zero, or near zero, marginal cost, which does not apply to most physical processes. Most. That said, I think at least some of these arguments generalize, in sometimes surprising ways, but that’s for the next essay.

A further caveat: I am not projecting complex economies with exclusively anarcho-libertarian business structures (well, mostly not, but again, that’s the next entry…), but I am arguing against the Fordist structures currently dominating much of IT (and the economy more generally). Part of the difficulty here is an absence of a sufficiently detailed vocabulary: we use the same word, “manager”, refer to someone coordinating remote self-managing groups (good, and I’ve worked with some people highly skilled at this) and butts-on-seats doofuses embedded in the middle of massively inefficient Fordist corporations (bad). These are totally different roles, and we need different vocabulary, albeit probably more nuanced than “coordinator” vs “parasitic tyrant”, though I rather like “coordinator.” As we will see in the next entry, decentralized self-managed production has been the norm for almost all of economic history, and is very common even in today’s industrialized economies: they are not a utopian vision.

As typical, I start this entry referencing the zeitgeist from the mainstream press [1] Taking just a sample over the past week, an extended discourse on the WTF/breaks-all-the-rules nature of current economic situation here and here; (there have been hundreds of similar articles), and then a series from The Economist on how the [tech] mighty have fallen (or at least are falling) here, here, and here, and finally Mark Zuckerberg’s “the floggings will continue until morale improves” moment over at Meta, here and despite/because-of this, Meta is doing really badly, here [2]. And its not just the tech sector: HIMARS notwithstanding, the defense acquisition process remains distressingly messed up due to Fordist anachronisms: here.[3] And finally Ezra Klein on how incredibly pervasive and potentially [dystopian] society-changing these institutions are: here, here and, Yuval Levin, here.

The real background

That’s the zeitgeist, and just a tiny fraction at that, but the core motivation for this is more prosaic: Over the past week or so a major player in the machine learning field was recruiting me: I’ll leave them anonymous beyond a subtle hint [4] when we return to the opening key at the end, as the experience has been surprisingly pleasant and the recruiter I dealt with was quite intelligent on both emotional and technical dimensions, and I think the interest was sincere [5][6] The positions were remote; salary was attractive if, on an hourly basis, about what I’m currently making; benefits would be kind of irrelevant—though I’d love to see the look at HR when confronting “What do you mean he’s on Medicare?”. I pursued this as far as I did due to the attraction of working with smart people on reasonably interesting problems with access to absolutely stupendous hardware, which is never going to be available in government or academia.

But, alas—or “just as well”—I realized a week or so into the process I would sooner or later run into an insurmountable wall as these mega-corporations have an entirely different model than the one I’ve been quite successfully following. But in the process gained at least somewhat more insight as to what is going on. Leading to this essay.

So here’s the career narrative I’ve been telling myself: I’ve [obviously] spent about thirty years effectively developing multiple generations of event data coding systems in multiple environments ranging from “cool, I’ll see what I can do over the weekend” to being embedded as a subcontractor for assorted massive defense contractors. I’ve been telling myself these projects were successful initially because I’d been working with a professional partner (and wife) Dr. Deborah Gerner, who handled the people side while I handled the technical, and after she died of cancer—”that damn disease”— in 2006, this all fell apart [7] and thus I would eventually cast myself adrift as a lone freelancer in the world of private consulting.

Except that’s not how the story actually went, as I realized once I started working on a detailed “industry” resumé, several iterations away from my academic vita. After 2006 the large managed projects actually continued, and while they were different without a professional partner handling the people management, the post-2006 managed projects were actually larger than those before, and were still generally successful—I found other collaborators skilled on the people side—and opportunities for managed projects continued to present themselves after I “went feral” in 2013.

But a point came when I not only starting avoiding being a lead on managed projects, but after about 2012, if I did get involved with such projects, I regretted it. Instead, I was working on my own, typically with other individuals with high levels of technical skills in “peer-to-peer” [8] remote projects, and things went just fine. Periodically I’d ask myself why I wasn’t surrounding myself with a covey of code monkeys and data wallahs, but interesting work was getting done and opportunities continued to present themselves so, well, it’s not broken, don’t fix it.

Only now am I realizing it’s not me, it’s the changing environment, specifically open source (and its supporting infrastructure such as Linux, Python, Github, and Stack Overflow) creates the ability to do more and more with less and less. Some of this is consistent with my own advice over almost ten years of remote contracting— here, here, here, and here—but I’m realizing there’s a lot more to it.

Gimme a model

So let’s move to the mythical here, and look at two models. The monopolistic corporations currently dominating IT are Fordist (and Taylorist), with massive hierarchical structures exercising strict command-and-control over a uniform workforce of generally replaceable individuals: The particular outfit I was talking with has a “boot camp” intake period of 5 to 8 weeks (!) during which one is supposed to internalize the corporate norms.[9] So we’re basically talking Patton or, to be a bit more contemporary, Game of Thrones.[10]

My world, on the other hand, is Justice League of America, or if you prefer, X-Men: projects are done by a bunch of free-wheeling misfits with diverse skills and attitude issues who come together, get the job done—you know, save the universe, that sort of thing—and then go their separate awkward ways, but keep in touch in case something else—there’s always another super-villain—comes along.

So in the aftermath of a recent Fordist project that was an abysmal failure—subtly alluded to in my previous entry—some of the technical leads reassembled—remotely, of course—minus the useless posses and parasitic gaggles of the failed project, and in a few months (albeit with a new technology) successfully completed the task the Fordist rendition had failed to do in four years and $2-million. This is a feature, not a bug, and while this was an exceptional case-that-proves-the-rule—same task, diametrically opposed organizational structures—I’ve done about a dozen of these peer-to-peer projects successfully over the past decade.

Why, and why now?

In all likelihood, this change is accounted for by three aspects of open source (and here, generally, we are dealing with open source as expressed in programming language libraries, not stand-alone programs):

Everything routine that needs to be done is now available as a library, or more frequently, several libraries, the best ones having filtered to the top in a virtuous cycle which augments their code and documentation. You just need to be able to write glue code to put it together
The collective wisdom is now on Stack Overflow [11]
The cutting edge which determines whether or not projects will succeed requires expertise, not just moderately-skilled code monkeys

So, smart-ass, then why are the MAAMAs [12] so successful and you are just sitting in your miserable little sunlit office a quick walk from six coffee shops and not building a survival bunker in New Zealand like a real techie? Loser: you aren’t even making crappy deals with billion-dollar penalty fees while sending out interminable “420” jokes on Twitter! [13]

Yeah, sucks, doesn’t it? But returning to the narrative, I’m pretty sure “this isn’t just me” and—consistent with all of the WTF/OMG!!! articles cited as the underlying zeitgeist—in the software engineering space, 2022 is in fact fundamentally different than 2012. Ever mindful that the duck/owl of Minerva quacks/flies only at dusk, from the perspective of 2022 this transition occurred in three stages:

Stage 1: Large corporations accepted open source

This was a long and gradual process but I’d argue was pretty much complete by 2012, and the gateway drug was Linux (In data analytics, the gateway drug was R, despite the suits wanting SAS if not SPSS.). If an anarchist hippie from Massachusetts and an unknown geeky nerd from—huh, Finland??—could create the seeds of an operating system—operating system!!—that by the 2010s was running the server side of essentially the entire web, as well as open source providing additional libraries used in vast amounts of core software, well, pigs may be flying, but the floodgates are open and they aren’t going to close. Contrast this to the contract I got from Lockheed at the beginning of the ICEWS project in 2008—said contract roughly the length of a mature work by J.K. Rowling or George R.R. Martin—which not only prohibited the use of open source code without advance permission, but specifically prohibited anything involving Python. [14]

The consequence of open source has been that greater and greater amounts of common tasks which earlier would have been handled by managed teams of interchangeable code monkeys [15] have been “libraried” out of existence. You’ve still got the first-mile and last-mile coding problems—cleaning the digital offal provided by the client and visualizing the eventual results for the client—but astonishing amounts of the intermediate steps can be handled by a few lines of library calls. Sure, you can do a better job with customized code, but can you do a more affordable job?: probably not. This is classical Christensen disruption: the technology is not as good, but it is good enough, and it is far cheaper/efficient/accessible.

At this point, however, these corporations, and most of the jobs, remained Fordist, stuck in the 1970s model of The Mythical Man-Month and the development of IBM’s OS/360. Well, except for jeans and t-shirts, tattoos and body piercings, foosball tables, slightly more women and minorities in the workforce, and vastly lower consumption of hard liquor.

Stage 2: Remote distributed teams emerge during the 2010s

This trend was evident watching the discussions in the Charlottesville CTO group—I’m not a CTO but was invited to join because I write this blog [16]—in the late 2010s where the discussions increasingly revolved around a couple highly successful firms that had always been 100% remote, and other start-ups now headed in that direction. And in my own experience, I worked on a large distributed project that was highly successful, and shortly there after, as a remote contractor for a generally classical butts-on-seats project that was an abysmal failure. By 2018, well before COVID, we have Ines Montani’s famous EuroPython talk “How to Ignore Most Startup Advice and Build a Decent Software Business” (30-August-2018), which is effectively an anti-Fordist manifesto.

Stage 3: Out of necessity, COVID clinches the remote model

“Rabbi, rabbi, is there a blessing for the Czar?”
“Yes, my son,” the rabbi responds: “God bless and keep the Czar — far away from us!”
Joseph Stein and Sholem Aleichem, Fiddler on the Roof

COVID accelerated the transition to remote distributed teams, and put the lie to the necessity of having a manager looking over everyone’s shoulder and the future of the company resting on chance encounters in coffee rooms and hallways. That sort of expertise—”Talk to Jane; I think she ran into that situation in her previous job”— was in fact now embedded in libraries, Stack Overflow, and occasionally Slack or Google search: recall from the previous entry that I was finally able to grok transformer models thanks to a PDF of a presentation from an engineering school in the Czech Republic. Yes, those breakthrough encounters may very occasionally occur—though the panopticon manager is a decidedly mixed blessing for anyone with significant experience, so we’re already at the level of “I don’t wear a seatbelt because in an accident I want to be thrown free”[17]—but they don’t provide a critical edge the way they would, say, in a university computer center in 1975 (where that sort of information transfer was definitely needed).

Implications

This model definitely satisfies the original intention of explicating the revelations I uncovered constructing my resumé: the shift from managed to peer-to-peer work, plenty of that available, managed outcomes bad, peer-to-peer outcomes good, as well as explaining some local observations such as CVille CTO peeps shifting to all-remote models and a couple friends quitting their long-time University of Virginia jobs when their sniveling sexist kiss-up-kick-down entitled manager insisted they return to full-time butts-on-seats having performed their tasks perfectly well remotely for the previous 30 months. But can we generalize further to the zeitgeist?

Probably. Not explicitly stated but obvious from the above is that Fordism is inefficient: by maintaining Mythical Man Month structures, the MAAMA have excess capacity both in technical workers who have now been long redundant due to open source libraries, and in excess layers of in-person managers where subcontracts to self-managed peer-to-peer groups would be more efficient. The astronomical profit margins of the MAAMA, and until recently the availability of free capital for those tech darlings who find it difficult to consistently make profits (most notoriously, WeWork, Uber and Twitter), thoroughly cover up those weaknesses, but they are both real, and serious vulnerabilities. Per Warren Buffet’s ever-quoted observation that only when the tide goes out do you find who is swimming naked, the tide is going out at least on free capital, and remarkably swiftly at that.

Second observation accounted for is the MAAMAs perceived inability to recruit talented labor and more generally the widespread resistance by a significant—by no means all—portion of the workforce to returning to butts-on-seats, another part of the popular media zeitgeist which is so prevalent, with literally daily articles, that I am choosing to be lazy and not provide citations (except this one). This was originally interpreted as “The Great Resignation,” with individuals supposedly unwilling to return to work, but I think that explanation has now been recognized as mostly measurement error: Bureau of Labor Statistics methodologies work very well with Fordist corporations, and are okay with traditional small businesses and self-employment, but, from my [limited] understanding, would have been seriously challenged by small 100% remote peer-to-peer groups relying mostly on subcontracts, and these are proliferating.

Meanwhile remote self-management is attractive in [at least] three ways, though again, I’ve written extensively about this earlier, and well before COVID. First, it is disproportionately attractive to the sorts of “talent” the MAAMA are hand-wringing about, and after all, the word BOSS originally came from “butts on seats supervision.”[18] Second, small groups are more able—again, imperfectly—to extract compensation in line with their marginal contributions, rather than having this diverted to private equity and/or the owners’ projects for Mars colonies, immortality, survival bunkers, and/or super yachts. Third, remote self-managed groups are more likely to successfully complete tasks, and from a pure ego/quality-of-life perspective, I can tell you that having worked on projects that succeeded, and projects that failed, I’m happier working for projects which will eventually succeed.

SIDEBAR: Initially the weak point in this model seemed to be the labor-intensive production labeled data, which is vital to many machine learning models. But, and I’m guessing this provides further evidence for the model, we’re now seeing an emphasis on reducing these labor requirements: semi-supervised learning, transfer learning, efficient leveraging of small data sets, highly efficient machine-assisted labeling systems such as prodigy, outsourcing to MechTurk, and greater efficiencies and quality control for existing data. When labelled data is the issue, the collective wisdom seems to be moving towards investing in finding ways around this rather than hiring and training a team. This is also consistent with my own experience, where devising highly customized machine-assisted coding environments enabled me to reduce labor requirements for two data collection projects by a factor of three to four.

Finally, as noted at the beginning, organizing complex economic production using a network of small contractors is anything but unusual: think construction, medicine (until recently), dentistry (even now), retail before the rise (and now demise?) of the department store and…agriculture (!!). The MAAMAs are, arguably, a very odd anomaly whose effervescence depended on a temporary advantage in hardware, free finance, and positive network effects, but this is quite possibly only temporary. Again, I will pursue this in more detail in the next entry.

So, why don’t we see this happening? Where’s the Economist special issue?

The reasons are long, and mostly pretty obvious, but the key ones would be

Technological lag: innovations require about a human generation. Boomers—and the educational system, such as it is—still think that a corporation needs to look like Ford’s Rouge River complex or Alfred P. Sloan’s General Motors.
A giant oak shades out the saplings long after its core has rotted. And then, suddenly, it falls. Astronomical profit margins, network effects [19] and vast quantities of investment capital can sustain completely uneconomical but buzz-generating companies—Uber and WeWork certainly, probably Twitter, probably cryptocurrencies. [20] And once you’ve got your hand in Uncle Sugar’s pocket—think coal, shipping, airlines on multiple occasions, the auto industry on multiple occasions, the military-industrial complex permanently—the trough is never empty, and the MAAMAs with their massive investments in lobbying have learned that lesson well.
It is happening, but quietly, little rodents eating the surviving dinosaur eggs, but nothing fancy, and in the meantime the 2008-2009 recession and COVID stirred up a lot of conceptual mud that’s been hard to grok through.
VC’s have been known, perhaps despite themselves, to throw good money after good: Hundreds, thousands, of small independent startups have been purchased by the giants, usually just to inhibit competition but occasionally to get the products. Simply aiming for acquisition is now a very common start-up strategy, and this is a far cry from the world-dominating aspirations of Steven Jobs, Bill Gates, and Sergey Brin, to say nothing of John Rockefeller, Henry Ford, Andrew Carnegie and two or three generations of late 19th century robber barons.
Beyond that, these are apex predators and create environments optimized for their own survival. Notoriously, defense contracting. [21]

So where does this go? Probably what we will see is these giants gradually wilting and vanishing from the scene, much as we saw with the giant retail chains (visit your local deserted shopping mall)[22]: after all, the average expected future lifetime of a business is [apocryphally?] always ten years, however old the business is.

Meanwhile MAAMAs [24]: can’t find quality help? Maybe you need to update your operational model, and I’d focus on two things. First, recognize that most of the skills you needed when your companies started decades ago have been “libraried away” (or Stack-Overflowed-away) and you need a different set. That’s also the set of skills competent programmers want to offer, rather than solving rescue-the-princess puzzles as a precondition to employment. Second, efficiently outsource to small, distributed teams rather than insisting on building Rouge River complexes.

Final thoughts

I was at an exhibit on the French Impressionists which pointed to a curious and, apparently, under-appreciated trigger for Impressionism: oil paint in tin tubes with screw caps. These became available in the middle of the 19th century, and had three effects:

Artists no longer had to be associated with an institution which had the materials, expertise, and labor required to create fragile oil paints
When a new color became available or popular, it could be immediately acquired at a modest price
Painting could be easily done outside of the studio, opening a huge array of new possibilities

All of which broke two centuries of utter stagnation under the monopoly of the Academie des Beaux-Arts and its endless focus on wall-sized soft-core porn in the guise of Biblical and classical mythology, Photoshopped elite portraits, and bloodless battle dioramas.

Open source libraries are our tubes of paint, eh?: cheap, flexible, available without institutional constraints, and thus opening new creative possibilities.

My, those dinosaur eggs taste good.

And as for this job opportunity: remain with the mutants or become a foot soldier for—let us be blunt—House Lannister? That’s even a choice? [25] When did you last see a rat swimming towards a sinking ship?

Footnotes

1. These references are probably mostly paywalled but, hey, unlike paywalled academic publications, which restrict access to work largely funded by the public, the folks writing these articles are doing high quality work in the for-profit sector and in order to survive need to make, well, a profit. So supporting them is a virtue lest we be reduced to consuming only low-quality “news” in 280-character chunks and/or click-bait.

2. Drop back a month for similar renditions at Twitter, though at the moment we’re in Act II, and the revolver introduced in Act I will not be picked up until Act III.

3. For which Putin should be grateful or his WWII-model forces would be facing ten innovative systems with the operational effectiveness of HIMARS rather than one, and at half the price. Instead in NATO defense acquisitions we’ve got Frederick the Great plus FARS, a system so abysmal that NASA was reduced to using Russian equipment—and much to their credit, Russian engineers excel at creating robust systems out of utterly crappy material—to sustain the International Space Station, and are stuck “developing” a horrendous expensive launch system based on 1990s concepts and, at times, even parts. Plus a long term problem of defense consolidation leaving us with US-style engineers working in Soviet-style organizations.

4. yeah, right…

5. The alt-hypothesis is I’d been contacted without the recruiter doing the due diligence of ascertaining my age—in order to avoid wasting everyone’s time, I’m not exactly subtle about this factor on my LinkedIn profile—and they were subsequently told by HR that in the interest of CYA “You broke it, you bought it.” But I’d like to imagine the interest was sincere.

6. The nuanced reader will detect the use of past participles here rather than past tense: if you haven’t gotten in a habit of reading John McWhorter, do so!

7. One large 2006-2007 project did in fact fall apart amid the adjustments to Gerner’s death, but only one. Continuation was a team effort: everyone was affected, and it took the entire team to grieve, regroup, and then finally get back on our collective feet to carry on.

8. “expert network” is apparently another buzzword for these structures.

9: And, it goes without saying, obligatory “coding exercises” and “rescue-the-princess”-style puzzles which nominally assess the candidate’s intellectual capacity more accurately than, say, thousands of lines of operational code and a score or so of successfully completed projects over forty years. These elements of the hiring process do, in fact, select for the weak-willed and easily domesticated. But, as is utterly transparent, fundamentally function as Boomer removers.

While these are invariably phrased as “boot camp”—with very, very few exceptions, those involved have never been anywhere remotely close to a military boot camp, to say nothing of combat, and likely this absence of familiarity extends back at least two generations—the more appropriate comparison would be Edward L Katzenbach Jr’s classic, but now, alas, inaccessible on the web “The Horse Cavalry in the Twentieth Century: A Study On Policy Response.”

10. Patton reflected an actual organization. GoT…well, it’s fantasy, and pre-modern polities didn’t actually work that way, at least at scale. Albeit the most elaborate fantasy element in GoT is not the dragons, but the logistics.

11. Clever advice recently posted on our local Slack channel on the most efficient use of Stack Overflow: post your question, then from another account post a really inane answer to it. Which will trigger a series of outraged replies giving you one or more correct answers.

12. Meta, Alphabet, Apple, Microsoft, Amazon. Previously the FAANG: Facebook, Apple, Amazon, Netflix [sic], Google.

13. Aspiring bloggers: in musical composition terms, this paragraph is a “bridge”, providing momentary relief for the by-now repetitive incoherent rants of the early theme and marking a transition to a new set of repetitive rants on the central theme, while contributing nothing of substance to the exposition.

14. As with so many things Lockheed, I found this specificity truly odd, but in retrospect, Lockheed probably had some automated formatter they ran code through—there would be very good reasons for doing this for purposes of security and standardization—which messed with white space, and hence Python specifically would make some managers seriously upset. This in contrast to Donald Knuth’s famous characterization: “I decided to master Python; it was a pleasant afternoon.” I ignored the restriction, of course.

15. Sure, we programmers see ourselves as heroic ubermensch, but in truth we’re little more than wanderers who stumbled upon a mountain stream littered with gold nuggets, and had the sense to collect a few. As the saying goes, the 10x programmer is real, but they aren’t going to work for you.

But the self-appointed toxic genius and HR-nightmare programmer: they can’t wait to sign on!

16. Hey, Ron, we miss you, dude! Drop us a line sometime!

17. You will be thrown free. To impact a tree or guardrail while traveling at the speed of the vehicle.

18. No, I just made this up. But can we start an urban legend?

19. But network effects, over sufficient periods of time in proprietary systems, are over-rated, as Meta and Twitter are discovering to their horror: What we see, in fact, are reverse network effects where the last thing a new generation wishes to be associated with are the social networks of the previous generation, not just those of their parents, but even older siblings. And remember, a defining characteristic of Fordism is monopolistic restriction of consumer choice. If a social network free of spam, disinformation, and toxic content were available, would you use it? Would you [try to] insist that your kids use it?

20. This is not new: visiting Harper’s Ferry recently, I was reminded that in the 1840s companies continued to invest in fabulously expensive canals—the C&O in this location—long after it had become abundantly clear railroads were more efficient and flexible. In Indiana we were actually required to learn this in high school history, as canal investments caused Indiana to go bankrupt in 1841, but I’m guessing few tech investors went to high school in Indiana.

21. There’s a fascinating example of the reverse of this happening with the establishment of the massive defense contractor SAIC, which in the early1970s consolidated hundreds of small independent defense and intelligence consulting shops, though this proved [somewhat] unstable and SAIC itself eventually split. Even SAIC never went to the Rouge River Fordist model, with both SAIC and its split-off Leidos (now being acquired by OMG Lockheed) distributing work across hundreds of still relatively small operations: Quietly, quietly in a major city you are never more than a few miles from a SAIC/Leidos shop, to say nothing of a SAIC/Leidos subcontractor.

22. Will private equity rapidly kill off the dominant IT companies the way they have killed off brick-and-mortar retail? Probably not: the value in those IT companies are employees with legs and LinkedIn profiles, not readily reconfigured real estate. Acquisition is still viable for small start-ups as these have generally assembled proprietary pipelines (and/or client lists and/or data sets) that have not made it to the level of common use. With the Fordist companies, more likely we’ll just see a series of organizational-suicide-by-charismatic-CEO episodes per the now loathed Jack Welch’s destruction of General Electric, once the very pinnacle of Fordist excellence.[23]

23. Elon Musk: makin’ a list, checking it twice…

24. Like they give a rat’s ass:

“At this festive season of the year, Mr. Scrooge, many thousands are in want of common necessaries; hundreds of thousands are in want of common comforts, sir.”

“Are there no prisons?” asked Scrooge.

“Plenty of prisons,” said the gentleman.

“And the Union workhouses?” demanded Scrooge. “Are they still in operation?”

“They are. Still,” returned the gentleman, “I wish I could say they were not.”

“The Treadmill and the Poor Law are in full vigour, then?” said Scrooge.

“Both very busy, sir.”

“Oh! I was afraid, from what you said at first, that something had occurred to stop them in their useful course,” said Scrooge. “I’m very glad to hear it.”

Charles Dickens, A Christmas Carol, 1843

25. Which is why JLoA and X-Men are so popular, and why in popular media, Fordist corporations, almost without exception, are mocked: if Euripides were writing today he’d be working with the theme of an arrogant boss who fires the quiet but in fact central member of a team in order to satisfy the demands of corporate, who are facing a downturn due to bad decisions by the founder. Which is pretty much what Euripides wrote about anyway. Patton was a single movie; the fictional exploits of the skilled norm-defiers of Mash lasted longer than the Korean War itself.

Posted in Ramblings | Leave a comment

Seven thoughts on neural network transformers

Posted on July 28, 2022 by schrodt735

If an elderly but distinguished scientist says that something is possible, he is almost certainly right; but if he says that it is impossible, he is very probably wrong.
Arthur C. Clarke. (1962)[1]

So, been a while, eh: last entry was posted here in March-2020—yes, the March-2020: how many things do we now date from March-2020 and probably will indefinitely?— when, like, everyone, was suddenly doing remote work, which I’d been doing for six years. But, well, the remote-work revolution has taken on a life of its own—though my oh my do I enjoy watching those butts-on-seats “managers” [2] squirming upon the discovery that teams are actually more productive in the absence of their baleful gaze, random interruptions, and office political games, to say nothing of not spending three hours a day commuting—so no need for further contributions on that topic. Same on politics: so much shit flying through the air right now and making everyone thoroughly miserable that the world doesn.t need any more from me. At the moment…

And I was busy, or as busy as I wanted to be, and occasionally a bit busier, on some projects, the most important of which is the backdrop here. This is going to be an odd entry as I’m “burying the lead” on a lot of details, though I expect all of these to come out at some point in the future, generally with co-authors, in an assortment of open access media. But I’m not sure they will, and meanwhile things in the “space” are moving very rapidly and, as I’m writing this, sort of in the headlines such as this, this, this, this, and this, [5-Aug-22:hits keep on coming…]so I’m going ahead.

So y’all are just going to have to trust that I’ve been following this stuff, and have gained a lot of direct experience, over the past year, as well as tracking it in assorted specialized newsletters, particularly TheSequence, which in turn links to a lot of developments in industry. However, as usual, my comments are going to be primarily directed to political science applications. Also see this groveling apology for length. [3]

So, what the heck is a transformer??[4]

These are the massive neural networks you’ve been reading about in applications that range from the revolutionary to the utterly inane. While the field of computing is subject to periodic—okay, continuous—waves of hype, the past five years have seen a genuine technical revolution in the area of machine learning for natural language processing (NLP). This is summarized in the presumably apocryphal story that when Google saw the results of the first systematic test of a new NLP translation system, they dumped 5-million lines of carefully-crafted code developed at huge expense over a decade, and replaced it with 5,000 lines of neural network configuration code.

I’ve referred to this, with varying levels of skepticism, as a plausible path for political models for several years now, but the key defining attributes in the 2022 environment are the following:

These things are absolutely huge, with the current state of the art involving billions of parameters, and they require weeks to estimate and their estimation is beyond the capabilities of any organizations except large corporations using vast quantities of specialized hardware.
Once a model has been estimated, however, it can be fine-tuned for a very wide variety of specific applications using relatively—still hours or even days—small amounts of additional computation and small numbers of training cases.

The technology, however, is probably—probably—still in flux, though it has been argued that the basis is in place [5]:, and we’re now entering a period where the practical applications will be fleshed out. Which is to say, we’re entering a Model T phase: the basic technology is here and accessible, but the infrastructure hasn’t caught up with it, and thus we are just beginning to see the adaptations that will occur as secondary consequences.

The mostly widely used current models in the NLP realm appear to be Google’s 300-million parameter BERT and the more compact 100-million parameter distilBERT. However, new transformers with ever larger neural networks and training vocabularies are coming on-line with great frequency: The most advanced current models such as OpenAI’s GPT-3 (funded, apparently in “billions of dollars”, by Microsoft) was trained on around 400-billion words and is thought to have cost millions of dollars just to estimate, but it is too large for practical use. China’s recent Yuan 1.0 system was trained on 5 terabytes of Chinese text, and with almost 250-billion estimated parameters in the network, is almost twice the size of the GPT-3 network. So the observations here can be seen as a starting point and not, by any means, the final capabilities, though we’re also at the limits of hardware implementations except for organizations with very high levels of resources. And all of these comparisons will be outdated by the time most of you are reading this.

So on to seven observations about these things relevant to political modeling.

1. Having been trained on Wikipedia, transformer base models have the long-sought “common sense” about political behavior.

This feature, ironically, occurred sort of by accident: the developers of these things wanted vast amounts of reasonably coherent text to get their language models, and inadvertently also ingested political knowledge. But having done so, this allows political models to exploit an odd, almost eerie, property called “zero-shot classification”: generalizing well beyond the training data. As one recent discussion [citation misplaced…] phrased this:

Arguably, one of the biggest mysteries of contemporary machine learning is understanding why functions learned by neural networks generalize to unseen data. We are all impressed with the performance of GPT-3, but we can’t quite explain it.

In experiments I hope will someday be forthcoming, this is definitely happening in models related to political behavior. In all likelihood, this occurs because there is a reasonably good correspondence between BERT’s training corpus—largely Wikipedia—and political behaviors of interest: Wikipedia contains a very large number of detailed descriptions of complex sequences of historical political events, and it appears these are sufficient to give general-purpose transformer models at least some “common sense” ability to infer behaviors that are not explicitly mentioned.

2. These are relatively easy to deploy, both in terms of hardware and software

Transformer models have proven to be remarkably adaptable and robust, and are readily accessible through open source libraries, generally in Python. And again, the Model T analogy—any color so long as it is black—in my experiments just using default hyperparameters gives decent results, a useful aspect given that hyperparameter optimization on these things has a substantial computational cost.

Three key developments here

For whatever reason—probably pressure from below to retain staff, or corporate ego/showing off to investors, or figuring (quite accurately, I’m sure) that there are so many things that can be done no one would have time to explore them all, and in any case they are still retaining the hardware edge, and network effects…the list goes on and on and on—the corporate giants—we’re mostly talked Google, Facebook, Amazon, and Microsoft—have open sourced and documented a huge amount of software representing billions of dollars of effort [6]
A specific company, HuggingFace pivoted from creating chatbots to transformers and made a huge amount of well-documented code available
Google, love’em, created an easy-to-use (Jupyter) cloud environment called Colaboratory available with GPUs [7] and charges a grand $10/month for reasonable, if not unlimited, access to this. Which is useful as the giants appear to be buying every available GPU otherwise.

That said, it’s not plug-and-play, particularly for a system that is going to be used operationally, rather than simply as an academic research project: there’s still development and integration involved, and the sheer computational load required, even with access to GPUs, is a bit daunting at times. But…this is the sort of thing that can be implemented by a fairly small team with basic programming and machine learning skills. [8]

3. The synonym/homonym problems solved through word embeddings and context

Synonyms—distinct words with equivalent meanings—and homonyms—identical words with distinct meanings—are the bane of dictionary-based systems for NLP, where changes in phrasing which would not even be noticed by a human reader cause a dictionary-based pattern to fail to match. This is particularly an issue for texts which are machine-translated or written by non-native speakers, both of which will tend to use words that are literally correct but would not be the obvious choice of a native speaker. In dictionary-based systems, competing meanings must be disambiguated using a small number of usually proximate words, and can easily result in head-scratchingly odd codings. An early example from the KEDS automated event data coding project was a coding event claiming a US military attack on Australia that was eventually traced to the headline “Bush fires outside of Canberra”. This coding resulted from the misidentification of “Bush”—this usage is apparently standard Australian English, but typically in the US English of our KEDS developers, it would have been “brush”—as the US president, and the noun “fires” as a verb, as in “fires a missile.” Dictionary developers collect such howlers by the score.

The synonym problem is solved through the use of word embeddings, which is also a related neural network based technology which became widely available a couple years before transformers took off, and are useful in a number of contexts. Embeddings place words in a very high dimensional space such that words that have similar meanings—”little” and “small”—are located close to each other in that space. This is determined, like transformers, from word usage in a large corpus,

In terms of dealing with homonyms, transformer models look at words in context and thus can disambiguate multiple meanings of a word. For example the English word “stock” would usually refer to a financial instrument if discussed in an economic report, but would refer to the base of a soup if discussed in a recipe, farm animals in a discussion of agriculture, a railroad car (“rolling stock”) in a discussion of transportation, or supplies in general in the context of logistics. Similarly, the phrase “rolled into” has different meanings if the context is “tanks” (=take territory) or “an aid convoy” (=provide aid). BERT is trained on 512 word—actually, “tokens”, so numbers and punctuation also count as “words”—segments of text so there is usually sufficient context to correctly disambiguate.

Both the these features, which involve very large amounts of effort when specialized dictionaries are involved, are just part of the pre-trained system when transformers are used.

4. 3 parameters good; 12 parameters bad; billions of parameters, a whole new world

As I elucidated a number of years ago (click here for the paywalled version, which I see has 286 citations now…hit a nerve, I did…), I hates garbage can models with a dozen or two parameters, hates them forever I do. Chris Achen’s “Rule of Three” is just fine, but go much beyond this, and particularly pretending that these are “controls” (we truly hates that forever!!!), and you are really asking for trouble. [9]. Or, alas, publication in a four-letter political science journal.

So, like, then what’s with endorsing models that have, say, a billion or so parameters? Which you never look at. Why is this okay?

It’s okay because you don’t look at the parameters, and hence are not indulging in the computer-assisted self-deception that folks running 20-variable regression/logit garbage can models do on a regular basis.

Essentially any model has a knowledge structure. This can be extremely simple: in a t-test it is two (tests for zero require only estimates of the mean and standard deviation) or four (comparisons of two populations). In regression it is treated as 2*(N+1) parameters—the coefficients, constant, and their standard errors, though in fact it should also include the (N*N-1)/2 covariance of the estimates (speaking of things never looked at…)

So instead of p-fishing—or, in contemporary publications, industrial-scale p-trawling—one could imagine getting at least some sense of what is driving a model by looking at the simple correspondence of inputs and outputs: the off-diagonal elements, the false positives and false negatives, are your friends! Like a gossipy office worker who hates their job, they will give you the real story!

Going further afield, and at a computational price, vary the training cases, and compare multiple foundation models. There is also a fairly sizable literature—no, I’ve not researched it—on trying to figure out causal links in neural network models, and this is only likely to develop further—academics (and, my tribe these days, the policy community) are not the only people who hate black-boxed models—and while some of these are going to be beyond the computational resources available to all but the Big Dogs, some ingenious and computationally efficient techniques may emerge. (Mind you, these have been available for decades for regression models, and are almost never used)

5. There’s plenty more room to develop new “neural” architectures

Per this [probably paywalled for most of you] Economist article, existing computational neural networks have basically taken a very simple binary version of a biological “neural” structure and enlarged it to the point where—to the chagrin of some ex-Google employees—it can do language and image tasks that appear to involve a fair amount of cognitive ability. But as the article indicates, nature is way ahead of that model, and not just in terms of size and power consumption (though it is hugely better on both),

For now. But just as simple (or not so simple) neural structures could be simulated, and had sufficient payoff the specialized hardware could be justified, some of these other characteristics—signals changing in time and intensity, introducing still more layers and subunits of processing—could also be, and this can be done (and doubtlessly is being done: the payoffs are potentially huge) incrementally. So if we are currently at the Model T stage, ten years from now we might be at a Toyota Camry stage. And this could open still more possibilities.

At the very least, we are clearly doing this neural network thing horribly inefficiently, since the human brain, a neural system some orders of magnitude more complex than even the largest of the neural networks, consumes about 20 watts, which is apparently—estimates vary widely depending on the task—about half to a third to energy used by Apple’s M1 chip. Which has a tiny fraction of the power of a brain. Suggesting that there is a long way to go in terms of more efficient architecture.

6. Prediction: Sequences, sequences, sequences, then chunk, chunk, chunk

Lisa Feldman Barrett’s Seven and a Half Lessons About the Brain revolves around the theme that prediction is the fundamental task of neural systems developed by evolution, both predicting their external environment (Is there something to eat here? Am I about to get eaten? Where do I find a mate?) and their internal environment, specifically maintaining homeostasis in systems that may have very long delays (e.g. hibernation). So the fact that neural networks, even at a tiny fraction of the level of interconnections of many biological networks, are good at prediction is not surprising. The specific types of sequence prediction can be a little quirky they aren’t terribly far removed.

Suggesting these might be really useful for that nice little international community doing political forecasting of international conflict, but, alas, those are relatively rare events and novel conflicts are even rarer. So as a little project, what about a parallel problem in business: predicting whether companies will fail (or their stock will fall: think it would be possible to make money on that?): presumably resources beyond our imagination are being invested in this and perhaps some of the methods will spill over into conflict forecasting.

And arguably this is just a start. We’ve got Wikipedia—and by now, I’m sure Lord of the Rings, A Song of Ice and Fire, and the entire Harry Potter corpus—in our pre-trained knowledge bases. But this is all text, which is quite useful, but it is inefficient, and it is not how experts work: experts chunk and categorize. Given the cue “populist policies,” a political analyst can come up with a list of those common across time, specific to various periods, right- and left-wing variants etc, but these are phrased in concepts, not in specific texts. [10].

So could we chunk and then predict? As it happens, we are already doing chunking in the various layers of the neural networks, and in particular this is how word vectors—a form of chunking—were developed. Across a sufficiently large corpus of historical events, I am guessing we will find a series of “super-events” which, I’m guessing, will eventually stabilize in forms not dissimilar to those used as concepts by qualitative analyst. Along those same lines, I’m guessing that we should generally expect to see human-like errors—that is, plausible near-misses from activities implied by the text or bad analogies from the choice of training cases—rather than the oftentimes almost random errors found in existing systems.

7. No, they aren’t sentient, though it may be useful to treat them as such [11]

As usual, returning to the opening key, a few words on the current vociferous debate on whether these systems—or at least their close relatives, or near-term descendants—are sentient. Uh, no…but it’s okay, provided you can keep your name out of the Washington Post, [13] and it may be useful to think of them this way.

For starters, we seem to go through this sentient computer program debate periodically, starting with reactions to the ELIZA program from the mid-1960s (!) and Seale’s Chinese Room argument from 1980 (!) and yet one almost never sees it mentioned in contemporary coverage.

But uh, no, they aren’t sentient, though just to toss a bit more grist into the mill—or perhaps that should be “debris into the chipper”—here are the recent Economist cites on the debate:

Pro: https://www.economist.com/by-invitation/2022/06/09/artificial-neural-networks-are-making-strides-towards-consciousness-according-to-blaise-aguera-y-arcas

Con: https://www.economist.com/by-invitation/2022/06/09/artificial-neural-networks-today-are-not-conscious-according-to-douglas-hofstadter

Now, despite firmly rejecting the notion that these or any other contemporary neural network is sentient, I am guessing—and to date, we’ve insufficient institutional experience to know how this is going to play out—we will do this after a fashion. Consider the following from the late Marshall Sahlins The New Science of the Enchanted Universe: An Anthropology of Most of Humanity

Claude Levi-Strauss retells an incident reported by Canadian ethnologists Diamond Jenness, apropos of the spiritual place masters or “bosses” known to Native Americans as rulers of localities, but who generally kept out of sight of people. “They are like the government in Ottawa, an old Indian remarked. An ordinary Indian can never see the ‘government.’ he is sent from one office to another, is introduced to this man and to that, each of whom sometimes claims to be the ‘boss,’ but he never see the real government, who keeps himself hidden.”

Per the above, there might be something in our cognitive abilities that make it useful to treat a transformer system as sentient just as we [constantly] treat organizations as sentient and having human-like personalities and preferences even though, as with consciousness, we can’t locate these. [14] From the perspective of “understanding”—a weasel-word almost as dangerous as “sentient”—one needs to think of these things as a somewhat lazy student coder whose knowledge of the world comes mostly from Wikipedia. Thus invoking the classical Programmer’s Lament, which I believe also dates from the 1960’s

I really hate this damn machine
I really wish they’d sell it
It never will do what I want
But only what I tell it

combined with the First Law of Machine Learning:

A model trained only on examples of black crows will not conclude that all crows are black, but that all things are crows

And so, enough for now, and possibly, in the future, a bit more context and a few more experimental revelations. But in the meantime, get to work on these things!!

Footnotes

1. The first of “Clarke’s Laws“, the other two being

2. The only way of discovering the limits of the possible is to venture a little way past them into the impossible.

3. Any sufficiently advanced technology is indistinguishable from magic.

Clarke also noted that in physics and mathematics, “elderly” means “over 30.” Leading to another saying I recently encountered which should perhaps be the Parus Analytics LLC mission statement:

Beware of an old man in a profession where men die young.
Sean Kernan on mobster Michael Franzese

2. Very shortly after COVID began, when people still thought it was going to be an inconvenience for a few weeks, I turned down a job opportunity when I discovered that the potential employer was a vanity project run by a tyrannical ex-hippie who was so committed to butts-on-seats that he (explicit use of gender-specific pronoun…) expected people in the office even if they’d returned the previous night on a flight from Europe. Mind you, it also didn’t help that they wanted twelve references—poor dears, probably never learned to read or use the internet or GitHub—and that the response of one of the references I contacted before abandoning the ambition was “Oh, you mean the motherfuckers who stole my models?” They never filled the position.

3. Groveling apology for length: once again, for a blog entry, this composition is too long and too disjointed. The editors at MouseCorp are going to be furious! Except, uh, they don’t exist. Like you couldn’t tell.

Hey, it’s the first entry in over two years! And I’ve been working on this transformer stuff for close to a year. So I hope, dear reader, you are engaging with this voluntarily, all the while knowing some of you may not be.

4. There are so many introductions available—Google it—with the selection changing all of the time, that I’m not going to recommend any one, and the “best” is going to depend a lot on where you are with machine learning. I glazed over quite a few until one—I’ve lost the citation but I’m pretty sure it was from an engineering school in the Czech Republic—worked for me. For those who are Python programmers, the sample code on HuggingFace also helps a lot.

5. Elaborating slightly, I got this observation from a citation-now-lost TheSequence interview with some VC superstar and programming genius who starts the interview with “after I finished college in Boston…”—like maybe the same Boston “college” Zuckerburg and the Winklevoss twins went to, you just suppose?— that we’re actually on the descending slope of the current AI wave, and the best new AI companies are probably already out there; it just isn’t clear who they are. The interview also contained the interesting observation about the choice of investing in physical infrastructure vs investing in new software development: curiously I’ve not thought about that much over the years, particularly recently, since, as elaborated further below, at my ground-level perspective Moore’s Law, then briefly a couple little-used university “supercomputer” centers, then the cloud, painlessly took care of all imaginable hardware needs, but at the outer limits of AI, e.g. Watson and GPT-3, we’re definitely back to possible payoffs from significant investments in specialized hardware.

6. From the perspective of the research community as whole, this is actually a huge deal, so it is worth of some further speculation. I have zero inside tracks on the decision-making here, but I’m guessing three factors are the primary drivers

Factor 1. (and this probably accounts for most of the behavior) By all accounts, talent in this area is exceedingly scarce. This spins off in at least three ways

Whatever the big bosses may want (beyond, of course, butts-on-seats…), talented programmers live in an open source world, and given the choice between two jobs, will take the one which is more open. This is partly cultural, but it is also out of self-interest: you want as much of your current work (or something close to it) to be available in your next job. I recently received a query from a recruiter from Amazon—they obviously had not read the caveats in my utterly snarky “About” on my LinkedIn profile—asking about my interest in a machine learning position and Amazon’s job description not only highlights the ability to publish in journals as one of the attractions of the job, but lists a number of journals where their researchers have published.
And speaking of next job, the more your current work is visible, the better your talents can be assessed. Nothing says “Yeah, right…next please” than “I did this fantastic work but I can’t tell you anything about it.”
On the flip side of that, a company may be able to hire people, whether from other companies or out of school, already familiar with their systems: this can save at least weeks if not months of expensive training.

Factor 2. To explore these types of models in depth, you need massive amounts of equipment, which only the Big Dogs have, so they are going to have a comparative advantage in hardware even if the software is open. This, ironically, puts us back into the situation prior to maybe 1995 when a “supercomputer” was still a distinct thing and a relatively scarce resource, so a few large universities and federal research centers could use hardware to attract talent. Thanks to Moore’s Law, somewhere in the 2000s the capabilities of personal computers were sufficient for almost anything people wanted to do—university supercomputer centers I was familiar with spent almost all of their machine cycles on a tiny number of highly specialized problems, usually climate and cosmological models, despite desperately trying—descending even to the level of being nice to social scientists!—to broaden their applications. As Moore’s Law leveled off in the 2010s, cloud computing gave anyone with a credit card access to effectively unlimited computing power at affordable prices.

Factor 3. The potential applications are so broad, and because of the talent shortage, none of these giants are going to be able to fully ascertain the capabilities (and defects) of their software anyway, so better to let a wider community do this. If something interesting comes up, they will be able to quickly throw large engineering teams and large amounts of hardware at it: the costs of discovery are very low in the field, but the costs of deployment at scale are relatively high.

7. GPU = “graphical processing unit”, specialized hardware chips originally developed, and then produced in the hundreds of millions, to facilitate the calculations required for the real-time display of exploding zombie brains in video games but, by convenient coincidence, readily adapted to the estimation and implementation of very large neural networks.

On a bit of a side note, Apple’s new series of “Apple Silicon” chips incorporate, on the chip, a “neural engine” but, compared to the state-of-the-art GPUs, this has relatively limited capabilities and presumably is mostly intended to (and per Apple’s benchmarks, definitely does) improve the performance of various processes in Apple’s products such as the iPhone.

But the “neural engine” is not a GPU: in Apple’s benchmarks using the distilBERT model, the neural engine achieves an increase of 10x on interference, whereas my experiments with various chips available in Google’s Colaboratory saw increases of 30x on inference and 60x on estimation, and the differences for inference (which is presumably all that the Apple products are doing; the model estimation having been done, and thoroughly optimized, at the development level) is almost entirely proportional to the difference in the number of on-chip processing units.

Having said that, Apple has made the code for using this hardware available in the widely-used PyTorch environment, so there might be some useful applications. Though it is hard to imagine this being cost-competitive against Google’s $10/month Colaboratory pricing.

A key difference between the Apple Silicon and the cloud GPUs is power consumption: this is absolutely critical in Apple’s portable devices but, at least at first, was not a concern in GPUs, though with these massive new models using very large amounts of energy—albeit not at the level of cryptocurrency mining—energy use has become a concern.

A final word on “Apple Silicon”, having discovered in February-2022 the hard way that you do not (!!!) want to wait until one of your production machines completely dies—and, truth be told, I probably kind of killed the poor thing running transformer models 24/7 in the autumn of 2021 before I discovered how simple Colaboratory is to use—I replaced my ca. 2018 MacBook Air with the equivalent which uses the M2 chip, and the thing is so absolutely blindingly fast it is disorienting. Though I’m sure I will get used to it…

8. A word of caution: you almost certainly do not want to try to involve any academic computer scientists in this, and you most certainly don’t want to give them access to your grant money: this stuff is just engineering, of no interest or payoff to academics. Certainly of no payoff when the results are going to be published five years later in a paywalled social science journal. And hey, it’s not just me: here is another rendition.

Having had some really bad experiences in such “inter-disciplinary” “collaborations”, I used to think that when it comes to externally funded research, computer scientists were like a sixth grade bully shaking down a fourth grader for their lunch money. But now I think it is more primordial: computer scientists see themselves as cheetahs running down that lumbering old zebra at the back of the herd, and think no more of making off with social science grant money—from their perspective, “social science” is an oxymoron, and there is nothing about political behavior that can’t be gleaned from playing the Halo franchise and binge-watching Game of Thrones—than we think of polishing off the last donut in the coffee lounge.

Don’t be that zebra.

I’m sure there are exceptions to this, but they are few. At the contemporary level of training in political methodology at major research universities, this stuff just isn’t that hard, so use your own people. Really.

9. Andrew Gelman’s variant on Clarke’s Third Law: “Any sufficiently crappy research is indistinguishable from fraud.”

10. Though it would be interesting to see whether a really big model could handle this, particularly, say, a model fine-tuned on a couple dozen comparative politics textbooks. More generally, textbooks may be useful fine-tuning fodder for political science modeling as they are far more concentrated than Wikipedia, though, as always, the computational demands of this might be daunting.

11. At this point we pause to make the obvious note that the issue of consciousness and sentience (which depending on the author, may or may not be the same thing) goes way back into psychology, and was a key focus of William James (and to a somewhat ambiguous extent Carl Jung) and a bunch of other discussions that eventually got swamped in behavioralism (and by the fact that the materialist paradigm prevailing by the middle of the 20th century made zero progress on the matter)

These are really difficult concepts: the COVID virus is presumably not sentient, but is a mosquito? Or is a mosquito still just a tiny if exceedingly complex set of molecules not qualitatively different than a virus?. Where do we draw the line?: mammals, probably, and—just can’t bring myself to eat those little guys any more having listened to this, octos. But chickens and turkeys, which I do eat, albeit with tinges of guilt? Is domestication a sort of evolutionary tradeoff where the organism gives up free will and sentience for sheer numbers?

Is an adjunct professor sentient?—most deans don’t behave as though they are [12:]. Is consciousness even a purely material phenomenon: as Marshall Sahlins argues that’s been the working hypothesis among the elite for perhaps only the past five generations, but not the previous 2000 generations of our ancestors who were nonetheless successful enough to, well, be our ancestors.

12. Shout-out to the recently deceased Susan Welch, long-time Dean of the College of Liberal Arts at Penn State, who among many social science-friendly policies, in fact did treat adjuncts as not only sentient, but human.

13. An ultimate insider-reference to a very strong warnung provided at the kickoff meeting for the DARPA ICEWS competition in 2007.

14. Sahlins again, scare quotes in original:

It is not as if we did not live ourselves in a world largely populated by nonhuman persons. I am an emeritus professor of the University of Chicago, an institution (cum personage) which “prides itself” on it “commitment” to serious scholarship, for the “imparting” of which it “charges” undergraduate students a lot of loney, the Administration “claiming” this is “needed” to “grow” the endowment, and in return for which it “promises” to “enrich” the students’ lives by “making” them into intellectually and morally superior persons. The University of Chicago is some kind of super-person itself, or at least the Administration “thinks” it is. [pg 71]

Posted in Methodology, Politics, Programming | 1 Comment

Advice to involuntarily remote workers from someone with [almost] seven years of remote experience

Posted on March 12, 2020 by schrodt735

As I’ve alluded to at various points—see here, here, and here—I have been working remotely since leaving academic life almost seven years ago. I had, in fact, been planning an entry on how I believe remote work is going to have substantial—and generally quite positive—social and economic effects but now, out of a most unexpected corner, comes the entry of millions of people, almost all involuntarily, into remote work. So something less abstract seems in order.

Before—or as an alternative to—going through my suggestions, avail yourselves of the increasing large literature on this, most of it fairly consistent: for example this and this and certainly this and more generally everything you can find of interest under the “Resources” tab here, And get on the https://weworkremotely.com/ email list for ever more links. Ignorance is no excuse: this approach has been developing rapidly over the past decade.

The points below are listed roughly in the order of priority, though of course I expect you will read the whole thing since you’ve got plenty of time and no one is looking over your shoulder at what you are reading, right? You hope: see points 3 and 4.

1. Loneliness and isolation are likely to be your biggest problem

One recent article—link lost, alas—I read said that in 2020, we’ve essentially solved all of the problems of remote work except one: loneliness and isolation. This invariably is rated as the most important downside by remote workers—see here and here—even those who otherwise thoroughly embrace the approach. Be very, very aware of it.

It is not inevitable—well, no more so than you encounter (and your degree of comfort with) solitude/loneliness in other parts of your life—but for those who are suddenly and involuntarily remote, I’m guessing the issue will quickly become a serious public mental health issue. Newspapers are already full of articles on “Telecommuting really sucks!” Like after about three days.

As with almost every point in this essay, the approach for dealing with loneliness will vary dramatically with the individual. The INTJ types of the data science world will, often as not, find the transition fairly easy, and largely positive, though it is still a transition. [1] The ESFPs without a good work-life balance will wonder what befell them.

For a start, however, take the following observation: If you are familiar with traditional rural communities where homes are widely spread apart and mechanized agriculture is largely a solitary pursuit, you will also be familiar with the little cafes—they probably have espresso machines now—where every morning there are clusters of [usually] men in overalls and caps sharing at least coffee and sometimes breakfast, and plenty of conversation and old jokes, before they head back to a day of work alone on the farm. And beyond that there are little rural bars in the middle of nowhere that are packed with cars on the weekends, and there are little churches one knows literally from cradle to grave [2], and there are active parent-teacher associations, and in the old days, various fraternal organizations: all the institutions Bob Putnam described in Bowling Alone that decayed with the suburbanization of post-industrial society. These situations were not ideal and can be too-easily romanticized—like on fabulously successful public radio programs—but are an evolved response to what could otherwise have been a much more lonely life. Not one of these is a co-working space.

2. Togetherness may well be your second biggest problem

When you reach my age and watch people retire, a very common issue is the couple who were very happy and well-adjusted when they spent most of the daylight hours in a workplace with other people, and go batshit crazy when they are together 24/7. Some find useful ways around this, typically through community volunteer work, but others divorce, and still others continue in lives of quiet desperation and/or addiction. [3]

If you are sharing space with another person, whether in a committed relationship or even just out of convenience, are you suddenly in this world. Possibly with children in the mix as well. I’ve no personal advice on this, as both my independently-employed wife and I have our own offices, but on the positive side—as much of David Brooks’s writing in recent years, such as his auspiciously timed curent article in The Atlantic, has noted—during most of human existence, we’ve worked day after day in the presence of the same group of people, and clearly have evolved the social, cultural, and cognitive tools to cope with this.[4] Even if several generations of Fredrick Taylor- and Alfred Sloan-inspired managers have done everything in their power to adapt humans, or some shadow thereof, to the conditions of Lancashire cotton mills in the 1820s, even—or particularly—if the workspace is an open office configuration of a unicorn tech company in the 2020s.

3. Schedule your in-work downtime: you need it

In my previous entry, I mentioned the issue of deep work and the fact that it is tiring and consequently in limited supply.[5] Let me generalize this: as you transition from a working environment where there are constant interruptions to one with no interruptions [6], you need to systematically, not randomly, provide the downtime for yourself.

People who have always worked in a busy office environment miss this: they figure “wow, I’ve got 100% control of my time!” and think that means they will be working optimally for that 100%. For a while, yes, you might, particularly if there is some really neat project you’ve been waiting a long time to find time for. (Though conversely, you might be dazzled and confused by the new situation from Day 1 and watch your productivity plummet.) But this burst of productivity won’t last indefinitely. And at that point, you need a plan. [7]

Once again, there are probably as many ways to deal with this as there are personalities, but you need to take into consideration at least the following

What are your optimal times of day for doing your best work?: protect these [8]
How long—be realistic—can you sustain productive work on various tasks you need to do? (this will vary a great deal depending on the task)
What type of break from work is most effective and can be done on a regular basis?

It took me a while to realize the importance of this, and in the absence of systematic breaks, I’d fall into these unpredictable but sometimes extended periods of procrastination, made even worse as now I’m surrounded by technologies insidiously designed to distract me, when I really should have just gone for a walk. So now I just go for a walk, or two or three walks during the day. My doctor, meanwhile, is thrilled with this behavior.

That’s me: there are plenty of other alternatives, but just make sure they refresh you and the time you spend on them is more or less predictable: Predictability, as in “getting back to work,” is an advantage of walking or running, or a class at a gym or yoga studio, or going to a coffee shop to make a purchase (watch the accumulating calories. And your A1C results). Predictability is most decidedly not a characteristic of YouTube or browsing social media.

4. Be very suspicious of any software or hardware your employer wants in your home

I’m already seeing articles—typically in “Business” sections which presumably the hoi polloi are not expected to read—from managers confidently asserting “I’m okay with our people working remotely, because our software records every keystroke they enter and every web page they visit! [maniacal laughter]” These articles are not [all] from Chinese sources. Mr. Orwell, 1984 is calling, and not the one where UVA made the Final Four.

If you are in a corporate environment, I would suggest being very suspicious of any unconventional software your employer wants you to install on your own computer[s]—I’d be inclined to refuse this if such autonomy is possible—and any corporate-configured hardware you bring home. Not insanely paranoid: Faraday cages are probably overkill, though a soundproof box with a lid may not be. Same with masking tape over the camera when it is not in use.[10] And don’t think about what your loveable boss might install: think about that creepy guy in tech support.[11]

Enough said. Though I’m guessing we will start seeing stories about unpleasant experiences along these lines in the near future.

5. Use video conferencing. And the mute option.

I’m a big fan of video conferencing, and most definitely was not a fan of audio-only teleconferences. However, there are effective and ineffective ways to do this. There seems to have developed a fairly high consensus in the remote-work world on best-practices, and at the top of the list:

Unless there are bandwidth issues, video is on for everyone for the entire meeting
Everyone is connecting from their office computer: meetings where half the group is sitting in a conference room (and thus not really remote) are a disaster
Stay on mute unless you are talking [12]. And be sure to turn mute back on after you stop: many embarrassing stories devolve on failures to do this. [13]

I’ve been doing fine—well, no one has complained—with the built-in mic [14] and camera on my computers (an iMac and a MacBook Air), though many people recommend buying a good camera and mic separately to get good quality. I use over-ear bluetooth headphones; others are content with wired or bluetooth earbuds.

The one thing that took me quite some time to get right was video lighting levels: contemporary cameras can make do with remarkably little light, but the results do not necessary look pleasant. I generally just use natural light in my office, which has high windows, and it took quite a bit of experimenting, and purchasing an additional desk light I use almost exclusively when I’m doing video, to get things so I don’t appear to be auditioning for a B-grade monster movie.

Sharing desktops and presentations remotely introduces another level of complexity—and for screen-sharing, still more opportunities for embarrassing experiences—and frankly I’d stick with tried-and-true software for doing this—the likes of Zoom and Hangout—not something the boss’s cousin Jason’s start-up just released. [15] Alas, this involves installing software that accesses your mic and camera: we must be cautious. If you are a large company (or government agency, for godsakes), pay the subscriptions for the fully-functional versions of the damn software! [16]

6. Dedicated space if you can find it

After a brief and unintentional experiment with working from home, I’ve always had a separate office, four in total, two of which I was very happy with (including where I am currently writing this and I’ve now been almost five years), one which was too socially isolated, even for me, and one in a co-working situation which did not work out (but fortunately I was renting that by the month).[17]

But I’m the exception here: surveys indicate that by far most remote workers do so from home—though usually from dedicated space, the suburban above-garage “bonus room” and/or “guest room” being favorites—and, presumably, working from home will be the short-term solution for most people who are involuntarily remote. [18]

Which, like the loneliness/togetherness issue, is going to take a lot of individual adaptation and the primary thing I advise is reading the blogs and other materials from experienced remote workers to get ideas. But working from the dining room table and/or the couch will get very tiresome very quickly, on many different dimensions, as we are already seeing in assorted first-person accounts/diatribes.

Literally as I was composing this, and quite independently, one of the folks in our CTO group posted to our Slack channel what his company, in addition to cancelling all travel until 1-Aug-2020, is providing for their newly remote workers:

All employees who need to work remotely are authorized to spend $1,000 outfitting their home for remote work. For example, if you do not currently have a comfortable headset with a microphone, or a chair and desk that you can sit in, you should get one. We trust you to use this budget judiciously.

The point on chairs is critical: your dining room chair will kill your butt, and your couch will kill your lower back.

The temporary—and worse, unpredictably temporary—nature of these involuntary transitions to remote work is quite problematic: most regular small office spaces (if you can find them at a fair price) require a lease of at least a year, though you might be able to find something monthly, and a lot of spaces that could be easily adapted in pre-internet days—many a successful novel has been written in a converted little garden shed in the back corner of a property—run into issues with the need for internet access—though as we’ve all noticed from seeing our neighbor’s printer as an option for our connections, wireless has quite quite an extended reach now [19]—and may require more electrical outlets than may be prudent from an extension cord. [20]

7. Now is a very good time to assess your work-life balance

One of the best articles I’ve read recently—alas, I’ve misplaced the link—on the advantages of remote work emphasized that no, the people you work with may not be the best group of people to socialize with, and if your company is trying to persuade you that they are, and is trying to merge the domains of work and play, you are probably being exploited. This is not to say you can’t have friends at work, but if these are your only friends—they have been chosen for you by HR, eh?—you are in a vulnerable situation. And don’t forget who HR works for: not you.

Wrapping us neatly back to the opening key: you need a community—”communities”, really, and broadly defined—that goes beyond the workplace, and the re-development of such communities may be one of the major effects of remote work. These take time—for mature adults, easily years to get to a point where there is a deep level of understanding, history, trust, and interdependence—and usually involve an assortment of missteps and experimentation to find what really interests you and binds you with other people but, well, every journey starts with a single step, right? Again, just read Putnam, David Brooks and Arthur Brooks on this.[17] Or talk to your [great?-]grandparents about how things worked in the good-old-days.

So, I know a whole lot for you didn’t want this, but you may, like so many long-time remote workers, come to enjoy its many advantages such as the possibility of living in areas with a low cost of living, minimal (or zero) commutes, and competing for employment in a national or international market. Meanwhile, stay safe, don’t believe most of the crap circulating on social media, check on your neighbors, particularly if they are older, live long and prosper.

Footnotes

1. If these terms are unfamiliar, you are not an INTJ. If folks are correct in arguing that in many organizations, introverts provide most of the value while extraverts take most of the credit, covid-19 may unexpectedly provide one of those “you don’t know who is swimming naked until the tide goes out” moments.

2: Except when they are Protestant and split—10% on profoundly esoteric issues of theology and 90% on soap-opera-grade issues of personality—upon passing Dunbar’s number.

3: Suicide increases dramatically for men in this condition; I will not speculate on the occurrence of homicide and abuse, though I suspect it can also be quite serious.

4. Brooks also makes the interesting observation that self-selected “tribes”—which of course we Boomers figured we invented, just like sex and wild music, as hippies in the 1960s—are historically common based on DNA analyses of ancient burials.

5. For the past six weeks I’ve been working intensely on a complex set of programming problems—first fruits of this are here—and periodically frustrated that I usually just get in four or five hours of really good work per day. Darn: over the hill. Then checked my logs for a similar project eight years ago during a period of largely unstructured time while on a research Fulbright in Norway: same numbers.

6. This sort of autonomy, of course, doesn’t apply to every job, but it does apply to many that are shifting to the involuntary-remote mode.

7. There’s a great deal of cautionary lore in the academic world on how during sabbaticals—ah, sabbaticals, now a distant memory for the majority of those in academic positions—months could be frittered away before you realized that you hadn’t transitioned to unstructured time, and by then the sabbatical would be almost over. Most decidedly not an urban legend!

8. Based on a discussion last week in our CTO group [9]—very much like those rural cafes except we’re not wearing caps and overalls and there is a mix of genders—the “optimal time” for deep work varies wildly between people, but the key is knowing when it is, and if you can control when meetings are scheduled, do this during your down time, not your creative time.

9. I’m locally an honorary CTO based on my past experience with project management. We meet monthly, not daily, and I learn a great deal from these folks, who are mostly real CTOs working for companies with revenues in the $1M to $100M range. Few of which you’ve heard of, but these are abundant in Charlottesville. Bits of their wisdom now goes into this blog.

10. Audio: if Alexa or Siri are already in your home, that horse has left the barn. A stampede of horses.

11. Look, I am fully aware that remote security issues are real: I’ve worked remotely on multiple projects where our most probable security threats were at the nation-state level—and nation-states that are rather adept at this sort of thing—and countering that is a pain, and my PMs could tell you that I was not always the most patient or happiest of campers about the situation, though after a while it becomes routine. But we did this—successfully as far as I know—with well-known, standard open tools on the client side (and generally the server side as well), and current industry best-practices, not recommendations dating to high school. This is a totally different situation than being asked to install unknown software acquired by IT after a pitch by some fast-talking sleazeball over drinks at a trade show in Vegas: you don’t want that stuff in your home.

12. I have endless stories of attempted audio connections going badly, though my favorite remains someone attempting to give a presentation by audio while parked next to a railroad and then one of those multi-mile-long trains came by. Experienced readers of this blog will be shocked, shocked to learn this occurred in the context of a DARPA project.

13. Though with video, we are no longer treated to the once-common experience of someone forgetting to mute and soon transmitting the unmistakable sounds of a restroom.

14. microphone

15. Had a really bad experience on those lines a few months back…though it was with an academic institution and they were probably trying to save money. But I do not completely dismiss the possibility of cousin Jason’s startup.

16. Oh, if I only had a lifetime collection of out-takes of bad remote presentation experiences, mostly with government agencies and institutions with billions of dollars in their budgets. A decade—well, it seemed like a decade—of the infamous Clippy. Suggestions for software updates that refused to go away. Advertising popping up for kitchen gadgets. Though at least it wasn’t for sex toys. Multi-million-dollar bespoke networking installations that crashed despite the heroic efforts of on-site tech support and we were reduced to straining to hear and speak to a cell phone placed forlornly on the floor in the middle of a conference room.

17. My costs, for 200 sq ft (20 sq m), have consistently been around $5000/year, which The Economist reports to be the average that corporations spend per employee on space. Though guessing most of those employees don’t have 200 sq ft. Nor a door, windows, or natural light.

18. It will be curious to see what involuntary remote work does for co-working spaces: if these have sufficient room that one can maintain a reasonable distance, they would not necessarily be a covid-19 hazard, and may be the only short-term alternative to working from the kitchen table. But they do involve mingling with strangers. Assuming one is okay with the distractions of co-working spaces in the first place: I’m not. All that said, there are probably a whole lot of people happy now that they never had the opportunity to buy into WeWork’s IPO.

19. Reliable internet is an absolute must, particularly for video conferencing but even in day-to-day work if you are constantly consulting the web. The internet in my office has been gradually and erratically deteriorating, presumably due in part to unmet bandwidth issues (thanks, CenturyLink…) and it can be really annoying.

20. I have this dream of the vast acreage of now-defunct shopping centers—a major one here just gave up the ghost last week—being redeveloped as walkable/bikeable mixed-use centers with offices (not co-working spaces) in a wide variety of sizes oriented to individuals and small companies doing remote work: just having people around and informal gathering spaces—remember those rural cafes—goes a long way to solving the isolation issue. But that’s not going to happen in the next couple of months.

21. And give her credit, Hillary Clinton

Posted in Uncategorized | Leave a comment

Seven reflections on work—mostly programming—in 2020

Posted on January 13, 2020 by schrodt735

Reading time: Uh, I dunno: how fast do you read? [0]

Well, it’s been a while since any entries here, eh? Spent much of the spring of 2019 trying to get a couple projects going that didn’t work out, then most of the fall working intensely on one that did, and also made three trips to Europe this year: folks, that’s where the cutting edge has moved on instability forecasting. And on serious considerations of improving event data collection: please check out https://emw.ku.edu.tr/aespen-2020/: Marseille in early summer, 4 to 8 page papers, and an exclusive focus on political event data!

All distractions, and what has finally inspired me to write is an excellent blog entry—found via a retweet by Jay Ulfelder then re-retweeted by Erin Simpson, this being how I consume the Twittersphere—by Ethan Rosenthal on remote work in data science:

https://www.ethanrosenthal.com/2020/01/08/freelance-ds-consulting/

While Rosenthal’s experience—largely working with private sector start-ups, many retail—would seem quite distant from the sort of projects Jay and I have usually worked on (sometimes even as collaborators), Jay noted how closely most of the practical advice paralleled his own experience [1] and I found exactly the same, including in just the past couple of months:

Desperately trying to persuade someone that they shouldn’t hire me
Doing a data audit for a proposed project to make sure machine-learning methods had a reasonable chance of producing something useful
Pipelines, pipelines, pipelines
The importance and difficulties of brutally honest estimates

and much more. Much of this is consistent with my own advice over almost seven years of remote contracting—see here, here, and here—but again, another view from a very different domain, and with a few key differences (e.g. Rosenthal works in Manhattan. New York, not Kansas).

And having decided to comment on a couple of core points from Rosenthal, I realized there were some other observations since the spring of 2019—yes, it has been that long—and came up with the requisite seven, and meanwhile my primary project is currently on hold due to issues beyond my control involving a rapacious publisher in an oligopoly position—things never change—so here we go…

Deep work is a limited and costly resource

Rosenthal has one of the best discussions of the nuances of contractor pricing that I’ve seen. Some of this covers the same ground I’ve written on earlier, specifically that people on salary in large organizations—and academia is probably the worst offender as they rarely deal with competitive pricing or any sort of accountability [2], but people whose life experience has been in government and corporations can be just as clueless—have no idea whatsoever of what their time actually costs and how much is totally wasted. Rosenthal echoes the point I’ve made several times that unless you carefully and completely honestly log your time—I’ve done so, for decades, at half-hour increments, though I still have difficulties with the “honestly” part, even for myself—you have no idea how much time you are actually working productively. People who claim to do intellectually demanding “work” for 80 hours a week are just engaging in an exercise of narcissistic self-deception, and if you estimate level-of-effort for a project in that mindset, you are doomed.

Where Rosenthal’s discussion is exceptional—though consistent with a lot of buzz in the remote-work world of late—is distinguishing between “deep” and “shallow” work and arguing that while “deep” work should be billed at a high rate—the sort of rate that causes academics in particular to gasp in disbelief—you can’t do it for 40 hours a week (to say nothing of the mythical 80-hours-per-week), and you are probably lucky to sustain even 30 hours a week beyond occasional bursts.[3] So, ethically, you should only be charging your top rate when you are using those deep skills, and either not charge, or charge at a lower rate, when you are doing shallow work. My experience exactly.

Deep work can be really exhausting! [6] Not always: in some instances, when one has a task that fits perfectly into the very narrow niche where the well-documented and much-loved “flow” experience occurs, it is exhilarating and time flies: you almost feel like you should be paying the client (no, I don’t…). But, bunkos, in real work on real problems with real data, most of your time is not spent in a “flow” state, and some of it can be incredibly tedious, while still requiring a high skill set that can’t really be delegated: after all, that’s why the client hired you. In those instances, you simply run out of energy and have to stop for a while. [7]

Rosenthal also argues for the rationality of pricing by the project, not by the hour, particularly when working on software that will eventually be transferred to the client. The interests of client and contractor are completely aligned here: the client knows the cost in advance, the contractor bears the risk of underestimating efforts, but also has greater knowledge about the likely level of effort required, and the contractor has incentives to invest in developments that make the task as efficient as possible, which will then eventually get transferred (or can be) to the client. There’s no downside!

Yet it’s remarkably hard to get most clients—typically due to their own bureaucratic restrictions—to agree to this, due to most organizations still having a 19th century industrial mindset where output should be closely correlated with time spent working. [8] Also for some totally irrational reason—I suppose a variant on Tversky and Kahneman’s well-researched psychological “loss-aversion”—project managers seem to be far more concerned that the contractor will get the job done too quickly, thus “cheating” them on a mutually-agreed-upon amount, while ignoring the fact the otherwise they’ve given the contractor zero incentive to be efficient. [9] Go figure.

Remote work is cool and catching on

I’ve worked remotely the entire time I’ve been an independent contractor, so it’s fascinating to watch this totally catching on now: The most common thing I now hear when talking with CTOs/VPs-of-Engineering in the Charlottesville area is either that their companies are already 100% remote, or they are heading in that direction, at least for jobs involving high-level programming and data analytics. The primary motivator is the impossibility of finding very idiosyncratic required skill sets locally, this being generally true except in three or four extraordinarily expensive urban areas, and often not even there.

But it is by no means just Charlottesville or just computing, as two recent surveys illustrate:

https://usefyi.com/remote-work-report/

https://buffer.com/state-of-remote-work-2019

While there are certainly tasks which don’t lend themselves to remote work, I’ll be curious to see how this finally settles out since we’re clearly early in the learning curve. [10]

Three observations regarding those surveys:

The level of satisfaction—noting, of course, that both are surveying people doing remote work, not the entire workforce—is absolutely stunning, in excess of 90%: it’s hard to think of any recent workplace innovation that has had such a positive reception. Certainly not open office plans!
I was surprised at the extent to which people work from home [10a], as I’ve argued vociferously for the importance of working in an office away from home. At least three things appear to account for this difference: First, flexibility in childcare is a major factor for many remote workers that is not relevant to me. Second, I’m doing remote work that pays quite well, and the monthly cost of my cozy little office is covered in my first three or four hours of deep work, which would not be true for, say, many editing or support jobs. Third, from the photos, a lot of people are in large suburban houses, probably with a so-called “bonus room” that can be configured as customized workspace, whereas my residence is in an older urban neighborhood of relatively small mid-20th-century houses.
People are appreciating that remote work can be done in areas with relatively low real estate prices and short commuting times: my 1-mile “commute” is about 20 minutes on foot and 5 minutes on a Vespa, with further savings in our family owning just one car. If remote work continues to expand, this may have discernible redistributive effects: as The Economist notes on a regular basis, the high professional salaries in urban areas are largely absorbed by [literal] rents, and since remote work is generally priced nationally, and sometimes globally, there is nothing like getting Silicon Valley or Northern Virginia wages while living outside those areas. [11] This is apparently leading to a revival of quite a few once-declining secondary urban centers, and in some instances even rural areas, where the money goes a really long way.

All this said, your typical 19th-century-oriented [typically male] manager does not feel comfortable with remote work! They want to be able to see butts on seats! And at least 9-to-5! This is frequently justified with some long-reimagined story where they assisted a confused programmer with a suggestion [12], saving Earth from a collision with an astroid or somesuch, ignoring that 99% of the time said programmer’s productivity was devastated by their interruptions. But managerial attitudes remain “If it was good enough for Scrooge and Marley, it’s good enough for me.” Still a lot of cultural adaptation to be done here.

The joy of withered technologies

From David Epstein. 2019. Range: Why generalists triumph in a specialized world. NY: Riverhead Books, pg. 193-194, 197. [12a]

By “withered technology”, [Nintendo developer Gunpie] Yokoi meant tech that was old enough to be extremely well understood and easily available so it didn’t require a specialist’s knowledge. The heart of his “lateral thinking with withered technology” philosophy was putting cheap, simple technology to use in ways no one else considered. If he could not think more deeply about new technologies, he decided, he would think more broadly about old ones. He intentionally retreated from the cutting edge, and set to monozukuri [“thing making”].

When the Game Boy was released, Yokoi’s colleague came to him “with a grim expression on his face,” Yokoi recalled, and reported that a competitor’s handheld had hit the market. Yokoi asked him if it had a color screen. The man said it did. “Then we’re fine.” Yokoi replied.

I encountered this over the past few months when developing customized coding software for the aforementioned data collection project. While I certainly know how to write coding software using browser-based interfaces—see CIVET, as well as a series of unpublished customized modules I created for coding the Political Instability Task Force Worldwide Atrocities Dataset—I decided to try a far simpler, terminal-based interface for the new project, using the Python variant of the old C-language curses library, which I’d learned back in 2000 when writing TABARI’s coding interface.

The result: a coding program that is much faster to use, and probably physically safer, because my fingers never leave the keyboard, and most commands are one or two keystrokes, not complex mouse [13] movements requiring at least my lower arm and probably my elbow as well. Thus continuing to avoid—fingers crossed, but not too tightly—the dreaded onset of carpal tunnel syndrome which has afflicted so many in this profession.

And critically, the code is far easier to maintain and modify, as I’m working directly with a single library that has been stable for the better part of three decades, rather than the multiple and ever-changing layers of code in modern browsers and servers and the complex model-template-view architectural pattern of Django, as well as three different languages (Python, php and javascript). Really, I just want to get the coding done as efficiently as possible, and as the system matured, the required time to code a month of data dropped almost in half. Like Yokoi, frankly I don’t give a damn what it looks like.

Just sayin’…and we can generalize this to…

The joy of mature software suites: there is no “software crisis”

We have a local Slack channel frequented mostly by remote workers (some not local) and in the midst of the proliferation of “so, how about them 2010s?” articles at the end of 2019, someone posted a series of predictions made on the Y-Combinator Slack-equivalent back in 2010.

Needless to say, most of these were wrong—they did get the ubiquity of internet-enabled personal information devices correct, and some predictions are for technologies still in development but which will likely happen fairly soon—making the predictable errors one expect from this group: naive techno-optimism and expectation of imminent and world-changing “paradigm shifts,” and consistently underestimating the stability of entrenched institutions, whether government, corporate—the demise, replacement, or radical transformation of Microsoft, Facebook, Google, and/or Twitter was a persistent theme—technological or social.[14] But something caught my attention

…in the far future, software will be ancient, largely bug free, and not be changed much over the centuries. Information management software will evolve to a high degree of utility and then remain static because why change bug free software that works perfectly. … What we think of programming will evolve into using incredible high level scripting languages and frameworks. Programs will be very short.

This hasn’t taken anything close to centuries because in statistics (and rapidly developing, machine-learning), whether R or the extensive Python packages for data analytics and visualization, that’s really where we are already: these needs are highly standardized so the relevant code—or something close enough [15]—is already out there with plenty of practical use examples on the web, so the scripts for very complex analyses are, indeed, just a couple dozen lines.

What is remarkable here—and I think we will look back at the 2010s as the turning point —is that we’ve now evolved (and it was very much an organic evolution, not a grand design) a highly decentralized and robust system for producing stable, inexpensive, high quality software that involves the original ideas generally coming from academia and small companies, then huge investments by large corporations (or well-funded start-ups) to bring the technology to maturity (including eventually establishing either formal or de facto standards), all the while experiencing sophisticated quality control [17] and pragmatic documentation (read: Stack Overflow). This is most evident at the far end of the analytical pipeline—the aforementioned data analytics and visualization—but, for example, I think we see it very much at work in the evolution of multiple competing frameworks for javascript: this is a good thing, not a bad thing, if sometimes massively annoying at the time. The differences between now and even the 1990s is absolutely stunning.

So why the endless complaints about the “software crisis?” Two things I’m guessing. First, in data analytics we still have, and will always have, a “first mile” and “last mile” problem: at the beginning data needs to be munged in highly idiosyncratic ways in order to be used with these systems, and that process is often very tedious. At the end stages of analysis, the results need to be intelligently presented and interpreted, which also requires a high level of skills often in short supply. And then there’s the age-old problem that most non-technical managers hate skilled programmers, because skilled programmers don’t respond predictably to the traditional Management Triad—anger, greed, and delusion—and at the end of the [working] day, far too many IT managers really just want to employ people, irrespective of technical competence, they will feel comfortable with doing vodka shots in strip clubs. That has only barely changed since the 1990s. Or 1970s.

Whoever has the most labelled cases wins

Fascinating Economist article (alas, possibly paywalled depending on your time and location):

https://www.economist.com/technology-quarterly/2020/01/02/chinas-success-at-ai-has-relied-on-good-data

arguing that the core advantage China has in terms of AI/ML is actually labelled cases, which China has built a huge infrastructure for generating in near-real-time and at low cost, rather than in the algorithms they are using:

Many of the algorithms used contain little that is not available to any computer-science graduate student on Earth. Without China’s data-labelling infrastructure, which is without peer, they would be nowhere.

Also see this article Andy Haltermann alerted me to: https://arxiv.org/pdf/1805.05377.pdf

Labelled cases—and withered technologies—become highly relevant when we look at the current situation for the automated production of event data. All of the major projects in the 2010s—BBN’s work on ICEWS, UT/Dallas’s near-real-time RIDIR Phoenix, UIUC Cline Center Historical Phoenix—use the parser/dictionary approach first developed in the 1990s by the KEDS and VRA-IDEA projects, then followed through to the TABARI/CAMEO work of the early 2000s. But seen from the perspective of 2020, Lockheed’s successful efforts on the original DARPA ICEWS (2008-2011) went with a rapidly-deployable “withered technology”—TABARI/CAMEO—and initially focused simply on improving the news coverage and actors dictionaries—both technically simple tasks—leaving the core program and its algorithms intact, even to the point where, at DARPA’s insistence, the original Lockheed JABARI duplicated some bugs in TABARI, only later making some incremental improvements: monozukuri + kaizen. Only after the still-mysterious defense contractor skullduggery at the end of the research phase of ICEWS—changing the rules so that BBN, presumably intended as the winner in the DARPA competition all along, could now replace Lockheed—was there a return to the approach of focusing on highly specialized coding algorithms.

But that was then, and despite what I’ve written earlier, probably the Chinese approach—more or less off-the-shelf machine learning algorithms [18], then invest in generating masses of training data (readily available as grist for the event data problem, of course)—is most appropriate. We’ll see.

David Epstein’s Range: Why generalists triumph in a specialized world is worth a read

Range is sort of an anti-Malcolm-Gladwell—all the more interesting given that Gladwell, much to his credit, favorably blurbs it—in debunking a series of myths about what it takes to become an expert. The first of two major take-aways—the book is quite wide-ranging—are that many of the popular myths are based on expertise gained in “kind” problems where accumulated past experience is a very good guide to how to get a positive outcome in the future: golf, chess, and technical mastery of musical instruments being notoriously kind cases.[19] In “wicked” problems, concentrated experience per se isn’t much help, and, per the title of the book, generalists with a broad range of experience and experimentation in many different types and levels of problems excel instead.

The other myth Epstein thoroughly debunks is the “10,000-hours to expertise” rule extolled by Gladwell. For starters, this is largely an urban legend with little systematic evidence to back it. And in the “well, duh…” category, the actual amount of time required to achieve expertise depends on the domain—starting with that kind vs. wicked distinction—and the style of the experience/training (Epstein discusses interesting work on effects of mixing hard and easy problems when training), and on the individual: some people absorb useful information more quickly than others.

So where is programming (and data analytics) on this perspective? Curiously with aspects on both ends. Within a fixed environment, it is largely “kind”: the same input will always produce the same output [20]. But the overall environment, particularly for data analytics in recent years, is decidedly wicked: while major programming languages change surprisingly slowly, libraries and frameworks change rapidly and somewhat unpredictably, and this is now occurring in analytics (or at least predictive analytics) as well, with machine-learning supplanting—sometimes inappropriately—classical statistical modeling (which by the 1990s had largely degenerated to almost complete reliance on variants of linear and logistic regression [16]) and rapid changes can also occur in machine-learning, as the rapid ascendency of deep learning neural networks has shown.

As for what this means for programmers, well…

The mysteries of 1000-hour neuroplasticity

I’ll finish on a bit of a tangent, Goleman and Davidson’s Altered Traits: Science Reveals How Meditation Changes Your Mind, Brain, and Body (one hour talk at Google here).

Goleman and Davidson are specifically interested in meditation methods that have been deliberately refined over millennia to alter how the brain works in a permanent fashion: altered traits as distinct from temporarily altered states in their terminology, and these changes now can be consistently measured with complex equipment rather than self-reporting. But I’m guessing this generalizes to other sustained “deep” cognitive tasks.

What I find intriguing about this research is what I’d call a “missing middle”: There is now a great deal of meditation research on subjects with very short-term experience—typically either secular mindfulness or mantra practice—involving a few tens of hours of instruction, if that, followed by a few weeks or at most months of practice of varying levels of consistency. Davidson, meanwhile, has gained fame for his studies, in collaboration with the Dalai Lama, on individuals, typically Tibetan monastics, with absolutely massive amounts of meditation experience, frequently in excess of 50,000 lifetime hours, including one or more five-year retreats, and intensive study and training.[21]

My puzzle: I think there is a fair amount of anecdotal evidence that the neuroplasticity leading to “altered traits” probably starts kicking in around a level of 1,000 to 2,000 lifetime hours of “deep” work, and this probably occurs in a lot of domains, including programming. But trying to assess this is complicated by at least the following issues

reliably keeping track of deep practice over a long period of time—a year or two at least, probably more like five years, since we’re looking at time spent in deep work, not PowerPoint-driven meetings or program/performance reviews [22]—and standardizing measures of its quality, per Epstein’s observations in Range
standardizing a definition of “expertise”: We all know plenty of people who have managed for decades to keep professional jobs apparently involving expertise mostly by just showing up and not screwing up too badly too conspicuously too often
Figuring out (and measuring for post-1000-to-2000-hour subjects) baselines and adjusting for the likely very large individual variance even among true experts
Doing these measures with fabulously expensive equipment the validity of which can be, well, controversial. At least in dead salmon.

So, looking at what I just wrote, maybe 1000 to 2000 hour neuroplasticity, if it exists, will remain forever terra incognita, though it might be possible in at least a few domains where performance is more standardized: London taxi drivers again.[reprise 21] But I wonder if this addresses an issue one finds frequently in fields involving sustained mental activity, where a surprisingly high percentage of elaborately-trained and very well compensated people drop out after five to ten years: Is this a point where folks experiencing neuroplasticity—and learning how to efficiently use their modified brains, abandoning inefficient habits from their period of learning and relying more on a now-effective “intuition,” setting aside the proverbial trap to focus on the fish—find tasks increasingly easy, while those who have not experienced this change are still tediously stumbling along, despite putting in equivalent numbers of hours? Just a thought. So to speak.

Happy New Year. Happy 2020s.

Footnotes

0. And about all those damn “footnotes”…

1. And transcends the advice found in much of the start-up porn, which over-emphasizes the returns on networking and utilizing every possible social media angle. Rosenthal does note his networks have been great for locating free-lance jobs, but these were networks of people he’d actually worked with, not social media networks.

2. By far the worst experience I’ve had with a nominally full-time—I momentarily thought I’d use the word “professional,” but…no…—programmer I was supposedly collaborating with—alas, with no authority over them—was in an academic institution where the individual took three months to provide me with ten lines of code which, in the end, were in a framework I decided wouldn’t work for the task, so even this code was discarded and I ended up doing all of the coding for an extended project myself. The individual meanwhile having used that paid time period to practice for a classical music competition, where they apparently did quite well. They were subsequently “let go”, though only when this could be done in the context of a later grant not coming through.

As it happens, I recently ran into the now-CTO of that institution and, with no prompting from me, they mentioned the institution had a major problem with a large number of programmers on payroll, for years, who were essentially doing nothing, and quite resented any prospects of being expected to do anything. So it was in the institutional culture: wow! Wonder how many similar cases there are like this? And possibly not only in academia.

3. Note this is one of the key reasons programming death marches don’t work, as Brooks initially observed and, Yourdon later elaborated in more detail. [4] In programming, the time you “save” by not taking a break, or just calling it quits for the day, can easily, easily end up genuinely costing you ten or more times the effort down the road. [5]

4. I gather if I were a truly committed blogger, apparently there are ways I could monetize these links to Amazon and, I dunno, buy a private island off the coast of New Zealand or somesuch. But for now they are just links…

5. As with most engineering tasks but unlike, say, retail. Or, surprisingly, law and finance if their 80-hour work weeks are real. Medicine?: they bury their mistakes.

6. I’m pretty sure Kahneman and Tversky did a series of experiments showing the same thing. Also pretty sure, but too tired to confirm, Kahneman discusses these in Thinking Fast and Slow. (Google talk here )

7. I suppose nowadays taking stimulating drugs would be another response. Beyond some caffeine in the morning (only), not something I do: my generation used/uses drugs recreationally, not professionally. But that may just be me: OK, boomer.

8. Far and away the best managed project I’ve been involved with not only paid by the sub-project, but paid in advance! This was a subcontract on a government project and I was subsequently told on another government subcontract that, no, this is impossible, it never happens. Until I gave up arguing the point, I was in the position of discussions with Chico Marx in Duck Soup: “Well, who you gonna believe, me or your own eyes?” Granted, I think I was involved in some sort of “Skunkworks” operation—it was never entirely clear and the work was entirely remote beyond a couple meetings in conference rooms in utterly anonymous office parks—but still, that pre-paying project went on for about two years with several subcontracts.

9. The “cost-plus” contracts once [?] common in the defense industry are, of course, this moral hazard on steroids.

10. On a learning curve but definitely learning: one of the fascinating things I’ve seen is how quickly people have settled on two fundamental rules for remote meetings :

Everyone is on a remote connection from their office, even if some of the people are at a central location: meetings with some people in a conference room and the rest coming in via video are a disaster
Video is on for everyone: audio-only is a recipe for being distracted

These two simple rules go most of the way to explaining why remote meetings work with contemporary technology (Zoom, Hangout) but didn’t with only conference room video or audio-only technology: OMG “speaker phones,” another spawn of Satan or from your alternative netherworld of choice.

10a. So the typical remote worker uses a home office, and tech companies are moving to 100% remote, and yet in downtown Charlottesville there are currently hundreds of thousands of square feet of office space under construction that will be marketed to tech companies: am I missing something here?

11. On the flip side, there is also nothing like getting Mumbai wages while living in San Francisco or Boston.

12. Uh, dude, that’s what Slack and StackOverflow are used for now…

12a. Actual page references evidence that I bought the physical book at the National Book Festival after listening to Epstein talk about it, rather than just getting a Kindle version.

13. I’ve actually used a trackball for decades, but same difference. Keyboard also works nicely in sardine-class on long flights.

14. One prediction particularly caught my attention: “A company makes a practice of hiring experienced older workers that other companies won’t touch at sub-standard pay rates and the strategy works so well they are celebrated in a Fortune article.” Translation: by 2020, pigs will fly.

15. E.g. I was surprised but exceedingly pleased to find a Python module that was mostly concerned with converting tabular data to data frames, but oh-by-the-way automatically converted qualitative data to dummy variables for regression analysis [16]

16. Yes, I recently did a regression on some data. ANOVA actually: it was appropriate.

17. For all my endless complaints about academic computer science, their single-minded focus on systematically comparing the performance of algorithms is a very valuable contribution to the ecosystem here. Just don’t expect them to write maintainable and documented code: that’s not what computer scientists or their graduate students do.

18. Algorithms from the 2020s, of course, and probably casting a wide net on these, as well as experimenting with how to best pre-process the training data—it’s not like parsing is useless —but general solutions, not highly specialized ones.

19. Firefighting, curiously, is another of his examples of a “kind” environment for learning.

20. If it doesn’t, you’ve forgotten to initialize something and/or are accessing/corrupting memory outside the intended range of your program. The latter is generally all but impossible in most contemporary languages, but certainly not in C! And C is alive and well! Of course, getting different results each time a program is run is itself a useful debugging diagnostic for such languages.

21. Another example of brain re-wiring following intensive focused study involves research on London taxi drivers: Google “brain london taxi drivers” for lots of popular articles, videos etc.

22. If Goleman and Davidson’s conclusions—essentially from a meta-analysis—can be generalized, periods of sustained deep cognitive work, which in meditation occurs in the context of retreats, may be particularly important for neuroplasticity. Such periods of sustained concentration are certainly common in other domains involving intense cognitive effort; the problem would remain reliably tracking these over a period of years. And we’re still stuck with the distinction that neuroplasticity is the objective of most intensive meditation practices, whereas it is an unsystematic side effect of professional cognitive work.

Posted in Methodology, Programming | 1 Comment

Seven current challenges in event data

Posted on March 13, 2019 by schrodt735

This is the promised follow-up to last week’s opus, “Stuff I Tell People About Event Data“, herein referenced as SITPAED. It is motivated by four concerns:

As I have noted on multiple occasions, the odd thing about event data is that it never really takes off, but neither does it ever really go away
As noted in SITPAED, we presently seem to be languishing with a couple “good enough” approaches—ICEWS on the data side and PETRARCH-2 on the open-source coder side—and not pushing forward, nor is there any apparent interest in doing so
To further refine the temporal and spatial coverage of instability forecasting models (IFMs)—where there are substantial current developments—we need to deal with near-real-time news input. This may not look exactly like event data, but it is hard to imagine it won’t look fairly similar, and confront most of the same issues of near-real-time automation, duplicate resolution, source quality and so forth
Major technological changes have occurred in recent years but, at least in the open source domain, coding software lags well behind these, and as far as I know, coder development has stopped even in the proprietary domain

I will grant that in current US political circumstances—things are much more positive in Europe—”good enough” may be the best we can hope for, but just as the “IFM winter” of the 2000s saw the maturation of projects which would fuel the current proliferation of IFMs, perhaps this is the point to redouble efforts precisely because so little is going on.

Hey, a guy can dream.

Two years ago I provided something of a road-map for next steps in terms of some open conjectures and additional reflections can be found here and here. This essay is going to be more directed, with an explicit research agenda, along the lines of the proposal for a $5M research program at the conclusion of this entry from four years ago. [1] These involve quite a variety of levels of effort—some could be done as part of a dissertation, or even an ambitious M.A. thesis, others would require a team with substantial funding—but I think all are quite practical. I’ll start with seven in detail, then briefly discuss seven more.

1. Produce a fully-functional, well-tested, open-source coder based on universal dependency parsing

As I noted in SITPAED, PETRARCH-2 (PETR-2)—the most recent open source coder in active use, deployed recently to produce three major data sets—was in fact only intended as a prototype. As I also noted in SITPAED, universal dependency parsing provides most of the information required for event data coding in an easily processed form, and as a bonus is by design multi-lingual, so for example, in the proof-of-concept mudflat coder, Python code sufficient for most of the functionality required for event coding is about 10% the length of comparable earlier code processing a constituency parse or just doing an internal sparse parse. So, one would think, we’ve got a nice opportunity here, eh?

Yes, one would think, and for a while it appeared this would be provided by the open-source “UniversalPetrarch” (UP) coder developed over the past four years under NSF funding. Alas, it now looks like UP won’t go beyond the prototype/proof-of-concept stage due to an assortment of “made sense at the time”—and frankly, quite a few “what the hell were they thinking???”—decisions, and, critically, severe understaffing. [2] With funding exhausted, the project winding down, and UP’s sole beleaguered programmer mercifully reassigned to less Sisyphean tasks, the project has 31 open—that is, unresolved—issues on GitHub, nine of these designated “critical.”

UP works for a couple of proofs-of-concepts—the coder as debugged in English will, with appropriate if very finely tuned dictionaries, also code in Arabic, no small feat—but as far as I am following the code, the program essentially extracts from the dependency parse the information found in a constituency parse, this approach consistent with UP using older PETR-1 and PETR-2 dictionaries and being based on the PETR-2 source code. It sort of works, and is of course the classical Pólya method of converting a new problem to something you’ve already solved, [8] but seems to be going backwards. Furthermore PETR-1/-2 constituency-parse-based dictionaries [10] are all that UP has to work with: no dictionaries based on dependency parses were developed in the project. Because obviously the problem of writing a new event coder was going to be trivial to solve.

Thus putting us essentially back to square one, except that NSF presumably now feels under no obligation to pour additional money down what appears to be a hopeless rathole. [11] So it’s more like square zero.

Well, there’s an opportunity here, eh? And soon: there is no guarantee either the ICEWS or UT/D-Phoenix near-real-time data sets will continue!!

2. Learn dictionaries and/or classifiers from the millions of existing, if crappy, text-event pairs

But the solution to that opportunity might look completely different from any existing coder, being based on machine-learning classifiers—for example some sort of largely unsupervised indicator extraction based on the texts alone, without an intervening ontology (I’ve seen several experiments along these lines, as well as doing a couple myself)—rather than dictionaries. Or maybe it will still be based on dictionaries. Or maybe it will be a hybrid, for example doing actor assignment from dictionaries—there are an assortment of large open-access actor dictionaries available, both from the PETRARCH coders and ICEWS, and these should be relatively easy to update—and event assignment (or, for PLOVER, event, mode, and context assignment) from classifiers. Let a thousand—actually, I’d be happy with one or ideally at least two—flowers bloom.

But unless someone has a lot of time [12]—no…—or a whole lot of money—also no…—this new approach will require largely automated extraction of phrases or training cases from existing data: the old style of human development won’t scale to contemporary requirements.

On the very positive side, compared to when these efforts started three decades ago, we now have millions of coded cases, particularly for projects such as TERRIER and Cline-Phoenix (or for anyone with access to the LDC Gigaword corpus and the various open-source coding programs) which have both the source texts and corresponding events. [13] Existing coding, however, is very noisy—if it wasn’t, there would be no need for a new coder—so the challenge is extracting meaningful information (dictionaries, training cases, or both) for a new system, either in a fully-automated or largely automated fashion. I don’t have any suggestions for how to do this—or I would have done it already—but I think the problem is sufficiently well defined as to be solvable.

3. ABC: Anything but CAMEO

As I pointed out in detail in SITPAED, and which is further elaborated in the PLOVER manual and various earlier entries in this blog, despite being used by all current event data sets, CAMEO was never intended as a general-purpose event ontology! I have a bias towards replacing it with PLOVER—presumably with some additional refinements—and in particular I think PLOVER’s proposed event-mode-context format is a huge improvement, both from a coding, interpretation, and analytical perspective, over the hierarchical format embedded in earlier schemes, starting with WEIS but maintained, for example, in BCOW as well as CAMEO.

But, alas, zero progress on this, despite the great deal of enthusiasm following the original meeting at NSF where we brought together people from a number of academic and government research projects. Recent initiatives on automated coding have, if anything, gone further away, focusing exclusively on coding limited sets of dependent variables, notably protests. Just getting the dependent variable is not enough: you need the precursors.

Note, by the way, that precursors do not need to be triggers: they can be short-term structural changes that can only be detected via event data because they are unavailable in the tradition structural indicators reported only on an annual basis and/or national level. For at least some IFMs, it has been demonstrated that at the nation-year level, event measures can be substituted for structural measures and provide roughly the same level of forecasting accuracy (sometimes a bit more, sometimes a bit less, always more or less in the ballpark). While this has meant there is little gained from adding events to models with nation-year resolution, at the monthly and sub-state geographical levels, events (or something very similar to events) are almost certainly going to be the only indicators available.

4. Native coders vs machine translation

At various points in the past couple of years, I’ve conjectured that the likelihood that native-language event coders—a very small niche application—would progress more rapidly than machine translation (MT)—an extremely large and potentially very lucrative application—is pretty close to zero. But that is only a conjecture, and both fields are changing rapidly. Multi-language capability is certainly possible with universal dependency parsing—that is much of the point of the approach—and in combination with largely automated dictionary development (or, skipping the dictionaries all together, classifiers), it is possible that specialized programs would be better than simply coding translated text, particularly for highly-resourced languages like Spanish, Portuguese, French, Arabic, and Chinese, and possibly in specialized niches such as protests, terrorism, and/or drug-related violence.

Again, I’m much more pessimistic about the future of language-specific event coders than I was five years ago, before the dramatic advances in the quality of MT using deep-learning methods, but this is an empirical question. [14]

5. Assessing the marginal contribution of additional news sources

As I noted in SITPAED, over the course of the past 50 years, event data coding has gone from depending on a small number of news sources—not uncommonly, a single source such as the New York Times or Reuters [15]—to using hundreds or even thousands of sources, this transition occurring during the period from roughly 2005 to 2015 when essentially every news source on the planet established a readily-scraped web presence, often at least partially in English and if not, accessible, at least to those with sufficient resources, using MT. Implicit to this model, as with so many things in data science, was the assumption that “bigger is better.”

There are, however, two serious problems to this. The first—always present—was the possibility that all of the event signal relevant to the common applications of event data—currently mostly IFMs and related academic research—is already captured by a few—I’m guessing the number is about a dozen—major news sources, specifically the half-dozen or so major international sources (Reuters, Agence France Presse, BBC Monitoring, Associated Press and probably Xinhua) and another small number of regional sources or aggregators (for example, All Africa). The rest is, at best, redundant because anything useful will have been picked up by the international sources. [16] and/or noise. Unfortunately, as processing pipelines become more computationally intensive (notably with external rather than internal parsing, and with geolocation) those additional sources consume a huge amount of resources, in some cases to supercomputer levels, and limit the possible sponsors of near-real-time data.

That’s the best scenario: the worst is that with the “inversion”—more information on the web is fake than real—these other sources, unless constantly and carefully vetted, are introducing systematic noise and bias.

Fortunately it would be very easy to study this with ICEWS (which includes the news source for each coded event, though not the URL) by taking a few existing applications—ideally, something where replication code is already available—and seeing how much the results change by eliminating various news sources (starting with the extremely long tail of sources which generate coded events very infrequently). It is also possible that there are some information-theoretic measures that could do this in the abstract, independent of any specific application. Okay, it’s not that it might be possible, there are definitely measures available, but I’ve no idea whether they will produce results meaningful in the context of common applications of event data.

6. Analyze the TERRIER and Cline Center long time series

The University of Oklahoma and University of Illinois/Urbana Champaign have both recently released historical data sets—TERRIER and yet-another-data-set-called Phoenix [17] respectively—which vary significantly from ICEWS: TERRIER is “only” about 50% longer (to 1980) but [legally] includes every news source available on LexisNexis, and the single-sourced Cline Center sets are much longer, back to 1945.

As I noted in SITPAED, the downsides of both are they were coded using the largely untested PETR-2 coder and with ca. 2011 actor dictionaries, which themselves are largely based on ca. 2005 TABARI dictionaries, so both recent and historical actors will be missing. That said, as I also showed in SITPAED, at higher levels of aggregation the overall picture provided by PETR-2 may not differ much from other coders (but it might: another open but readily researched question), and because lede sentences almost always refer to actors in the context of their nation-states, simply using dictionaries with nation-states may be sufficient. [18] But most importantly, these are both very rich new sources for event data that are far more extensive than anything available to date, and need to be studied.

7. Find an open, non-trivial true prediction

This one is not suitable for dissertation research.

For decades—and most recently, well, about two months ago—whenever I talked with the media (back in the days when we had things like local newspapers) about event data and forecasting, they would inevitably—and quite reasonably—ask “Can you give us an example of a forecast?” And I would mumble something about rare events, and think “Yeah, like you want me to tell you the Islamic Republic has like six months to go, max!” and then more recently, with respect to PITF, do a variant on “I could tell you but then I’d have to kill you.” [19]

For reasons I outlined in considerable detail here, this absence of unambiguous contemporary success stories is not going to change, probably ever, with respect to forecasts by governments and IGOs, even as these become more frequent, and since these same groups probably don’t want to tip their hand as to the capabilities of the models they are using, we will probably only get the retrospective assessments by accident (which will, in fact, occur, particularly as these models proliferate [20]) and—decades from now—when material is declassified.

Leaving the task of providing accessible examples of the utility of CRMs instead to academics (and maybe some specialized NGOs) though for reasons discussed earlier, doing so obscurely would not bother me. Actually, we need two things: retrospective assessments using the likes of ICEWS, TERRIER, and Cline-Phoenix on what could have been predicted (no over-fitting the models, please…) based on data available at the time, and then at some point, a documentable—hey, use a blockchain!—true prediction of something important and unexpected. Two or three of these, and we can take everything back undercover.

The many downsides to this task involve the combination of rare events, with the unexpected cases being even rarer [21], and long time horizons, these typically being two years at the moment. So if I had a model which, say—and I’m completely making this up!—predicted a civil war in Ghana [22] during a twelve month period after two years, a minimum of 24 months, and a maximum of 36 months, will pass before that prediction can be assessed. Even then we are still looking at probabilities: a country may be at a high relative risk, for example in the top quintile, but still have a probability of experiencing instability well below 100%. And 36 months from now we’ll probably have newer, groovier models so the old forecast still won’t demonstrate state of the art methods.

All of those caveats notwithstanding, things will get easier as one moves to shorter time frames and sub-national geographical regions: for example Nigeria has at least three more or less independent loci of conflict: Boko Haram in the northeast, escalating (and possibly climate-change-induced) farmer-herder violence in the middle of the country, and somewhat organized violence which may or may not be political in the oil-rich areas in the Delta, as well as potential Christian-Muslim, and/or Sunni-Shia religiously-motivated violence in several areas, and at least a couple of still-simmering independence movements. So going to the sub-state level both increases the population of non-obvious rare events, and of course going to a shorter time horizon decreases the time it will take to assess this. Consequently a prospective—and completely open—system such as ViEWS, which is doing monthly forecasts for instability in Africa at a 36-month horizon with a geographical resolution of 0.5 x 0.5 decimal degrees (PRIO-GRID; roughly 50 x 50 km) is likely to provide these sorts of forecasts in the relatively near future, though getting a longer time frame retrospective assessment would still be useful.

A few other things that might go into this list

Trigger models: As I noted in my discussion of IFMs , I’m very skeptical about trigger models (particularly in the post-inversion news environment), having spent considerable time over three decades trying to find them in various data sets, but I don’t regard the issue as closed.
Optimal geolocation: MORDECAI seems to be the best open-source program out there at the moment (ICEWS does geolocation but the code is proprietary and, shall we say, seems a bit flakey), but it turns out this is a really hard problem and probably also isn’t well defined: not every event has a meaningful location.
More inter-coder and inter-dataset comparison: as noted in SITPAED, I believe the Cline Center has a research project underway on this, but more would be useful, particularly since there are almost endless different metrics for doing the comparison.
How important are dictionaries containing individual actors?: The massive dictionaries available from ICEWS contain large compendia of individual actors, but how much is actually gained by this, particularly if one could develop robust cross-sentence co-referencing? E.g. if “British Prime Minister Theresa May” is mentioned in the first sentence, a reference to “May” in the fourth sentence—assuming the parser has managed to correctly resolve “May” to a proper noun rather than a modal verb or a date—will also resolve to “GBRGOV”.
Lede vs full-story coding: the current norm is coding the first four or six sentences of articles, but to my knowledge no one has systematically explored the implications of this. Same for whether or not direct quotations should be coded.
Gold standard records: also on the older list. These are fabulously expensive, unfortunately, though a suitably designed protocol using the “radically efficient” prodigy approach might make this practical. By definition this is not a one-person project.
A couple more near-real-time data generation projects: As noted in SITPAED, I’ve consistently under-estimated the attention these need to guarantee 24/7/365 coverage, but as we transition from maintaining servers in isolated rooms cooled to meat-locker temperatures and with fans so noisy as to risk damage to the hearing of their operators except server operators tend to frequent heavy metal concerts…I digress…to cloud-based servers based in Oregon and Northern Virginia, this should get easier, and not terribly expensive.

Finally, if you do any of these, please quickly provide the research in an open access venue rather than providing it five years from now somewhere paywalled.

Footnotes

1. You will be shocked, shocked to learn that these suggestions have gone absolutely nowhere in terms of funding, though some erratic progress has been made, e.g. on at least outlining a CAMEO alternative. One of the suggestions—comparison of native-language vs MT approaches—even remains on this list.

2. Severely understaffed because the entire project was predicated on the supposition that political scientists—as well as the professional programming team at BBN/Raytheon who had devoted years to writing and calibrating an event coder—were just too frigging stupid to realize the event coding problem had already been solved by academic computer scientists and a fully functioning system could be knocked out in a couple months or so by a single student working half time. Two months turned into two years turned into three years—still no additional resources added—and eventually the clock just ran out. Maybe next time.

I’ve got a 3,000-word screed written on the misalignment of the interests of academic computer scientists and, well, the entire remainder of the universe, but the single most important take-away is to never, ever, ever forget that no computer scientist ever gained an iota of professional merit writing software for social scientists. Computer scientists gain merit by having teams of inexperienced graduate students [3]—fodder for the insatiable global demand by technology companies, where, just as with law schools, some will eventually learn to write software on the job, not in school [4]—randomly permute the hyper-parameters of long-studied algorithms until they can change the third decimal point of a standardized metric or two in some pointless—irises, anyone?—but standardized data set, with these results published immediately in some ephemeral conference proceeding. That’s what academic computer scientists do: they don’t exist to write software for you. Nor have they the slightest interest in your messy real-world data. Nor in co-authoring an article which will appear in a paywalled venue after four years and three revise-and-resubmits thanks to Reviewer #2. [6] Never, ever, ever forget this fact: if you want software written, train your own students—some, at least in political methodology programs, will be surprisingly good at the task [7]—or hire professionals (remotely) on short-term contracts.

Again, I have written 3,000 words on this topic but, for now, will consign it to the category of “therapy.”

3. These rants do not apply to the tiny number of elite programs—clearly MIT, Stanford, and Carnegie Mellon, plus a few more like USC, Cornell and, I’ve been pleased to discover, Virginia Tech, which are less conspicuous—which consistently attract students who are capable of learning, and at times even developing, advanced new methods and at those institutions may be able to experiment with fancier equipment than they could in the private sector, though this advantage is rapidly fading. Of course, the students at those top programs will have zero interest in working on social science projects: they are totally involved with one or more start-ups.

4. And just as in the profession of law, the incompetent ones presumably are gradually either weeded out, or self-select out: I can imagine no more miserable existence than trying to write code when you have no aptitude for the task, except if you are also surrounded, in a dysfunctional open-plan office setting [5], by people for whom the task is not only very easy, but often fun.

5. The references on this are coming too quickly now: just Google “open plan offices are terrible” to get the latest.

6. I will never forget the reaction of some computer scientists, sharing a shuttle to O’Hare with some political scientists, on learning of the publication delays in social science journals: it felt like we were out of the Paleolithic and trying to explain to some Edo Period swordsmiths that really, honest, we’re the smartest kids on the block, just look at the quality of these stone handaxes!

7. Given the well-documented systemic flaws in the current rigged system for recruiting programming talent—see this and this and this and this and this—your best opportunities are to recruit, train, and retain women, Blacks and Hispanics: just do the math. [8]

8. If you are a libertarian snowflake upset with this suggestion, it’s an exercise in pure self-interest: again, do the math. You should be happy.

9. I was originally going to call this the “Pólya trap” after George Pólya’s How to Solve It —once required reading in many graduate programs but now largely forgotten—and Pólya does, in fact, suggest several versions of solving problems by converting them to something you already know how to solve, but his repertoire goes far beyond this.

10. They are also radically different: as I noted in SITPAED, in their event coding PETR-1, PETR-2, and UP are almost completely different programs with only their actor dictionaries in common.

11. Mind you, these sorts of disappointing outcomes are hardly unique to event data, or the social sciences—the National Ecological Observatory Network (NEON), a half-billion-dollar NSF-funded facility has spent the last five years careening from one management disaster to another like some out-of-control car on the black ice of Satan’s billiard table. Ironically, the generally unmanaged non-academic open source community—both pure open source and hybrid models—with projects like Linux and the vast ecosystem of Python and R libraries, has far more efficiently generated effective (that is, debugged, documented, and, through StackOverflow, reliably supported) software than the academic community, even with the latter’s extensive public funding.

12. Keep in mind the input to the eventual CAMEO dictionaries was developed at the University of Kansas over a period of more than 15 years, and focused primarily on the well-edited Reuters and later Agence France Presse coverage of just six countries (and a few sub-state actors) in the Middle East, with a couple subsets dealing with the Balkans and West Africa.

13. With a bit more work, one can use scrapping of major news sites and the fact that ICEWS, while not providing URLs, does provide the source of its coded events, and in most cases the article an event was coded from is quite unambiguous by looking at the actors involved (again, actor dictionaries are open and easy to update). Using this method, over time a substantial set of current article-event pairs could be accumulated. Just saying…

14. This, alas, is a very expensive empirical question since it would require a large set of human-curated test cases, ideally with the non-English cases coded by native speakers, to evaluate the two systems, even if one had a credibly-functioning system working in one or more of the non-English languages. Also, of course, even if the language-specific system worked better than MT on one language, that would not necessarily be true on others due to differences on either the event coder, the current state of MT for that language (again, this may differ dramatically between languages), or the types of events common to the region where the language is used (some events are easier to code, and/or the English dictionaries for coding them are better developed, than others). So unless you’ve got a lot of money—and some organizations with access to lots of non-English text and bureaucratic incentives to process these do indeed have a lot of money—I’d stay away from this one.

15. For example for a few years, when we had pretty good funding, the KEDS project at Kansas had its own subscription to Reuters. And when we didn’t, we were ably assisted by some friendly librarians who were generous with passwords.

The COPDAB data set, an earlier, if now largely forgotten, competitor to WEIS, claimed to be multi-source (in those days of coding from paper sources, just a couple dozen newspapers), but its event density relative to the single-sourced WEIS came nowhere close to supporting that contention, and the events themselves never indicated the sources: What probably happened is that multiple sourcing was attempted, but the human coders could not keep up and the approach was abandoned.

16. Keep in mind that precisely because these are international and in many instances, their reporters are anonymous, they have a greater capacity to provide useful information than do local sources which are subject to the whims/threats/media-ownership of local political elites and/or criminals. Usually overlapping sets.

17. Along with “PETRARCH,” let’s abandon that one, eh: I’m pretty good with acronyms—along with self-righteous indignation, it’s my secret superpower!—so just send me a general idea of what you are looking for and I’ll get back to you with a couple of suggestions. Seriously.

Back in the heady days of decolonization, there was some guy who liked to design flags—I think this was just a hobby, and probably a better hobby than writing event coders—who sent some suggestions to various new micro-states and was surprised to learn later that a couple of these flags had been adopted. This is the model I have in mind.

Or do it yourself—Scrabble™-oriented web sites are your best tool!

18. Militarized non-state actors, of course, will be missing and/or misidentified—”Irish Republican Army” might be misclassified as IRLMIL—though these tend to be less important prior to 1990. Managing the period of decolonization covered by the Cline data is also potentially quite problematic: I’ve not looked at the data so I’m not sure how well this has been handled. But it’s a start.

19. PITF, strictly speaking, doesn’t provide much information on how the IFM models have been used for policy purposes but—flip side of the rare events—there have been a few occasions where they’ve seemed be quite appreciative of the insights provided by the IFMs, and it didn’t take a whole lot of creativity to figure out what they must have been appreciative about.

That said, I think this issue of finding a few policy-relevant unexpected events is what has distinguished the generally successful PITF from the largely abandoned ICEWS: PITF (and its direct predecessor, the State Failures Project) had a global scope from the beginning and survived long enough—it’s now been around more than a quarter century—that the utility of its IFMs became evident. ICEWS had only three years (and barely that: this included development and deployment times) under DARPA funding, and focused on only 27 countries in Asia, some of these (China, North Korea) with difficult news environments and some (Fiji, Solomon Islands) of limited strategic interest. So compared to PITF, the simple likelihood that an unexpected but policy-relevant rare event would occur was quite low, and, as it happened, didn’t happen. So to speak.

20. In fact I think I may have picked up such an instance—the release may or may not have been accidental—at a recent workshop, though I’ll hold it back for now.

21. In a properly calibrated model, most of the predictions will be “obvious” to most experts: only the unexpected cases, and due to cognitive negativity bias, here largely the unexpected positive cases, will generate any interest. So one is left with a really, really small set of potential cases of interest.

22. In an internet cafe in some remote crossroads in Ghana, a group of disgruntled young men are saying “Damn, we’re busted! How’d he ever figure this out?”

Posted in Methodology, Programming | 1 Comment

Stuff I tell people about event data

Posted on March 5, 2019 by schrodt735

Every few weeks—it’s a low-frequency event with a Poisson distribution, and thus exponentially distributed inter-arrival times—someone contacts me (typically from government, an NGO or a graduate student) who has discovered event data and wants to use it for some project. And I’ve gradually come to realize that there’s a now pretty standard set of pointers that I provide in terms of the “inside story” [1] unavailable in the published literature, which in political science tends to lag current practice by three to five years, and that’s essentially forever in the data science realm. While it would be rare for me to provide this entire list—seven items of course—all of these are potentially relevant if you are just getting into the field, so to save myself some typing in the future, here goes.

(Note, by the way, this is designed to be skimmed, not really read, and I expect to follow this list fairly soon with an updated entry—now available!—on seven priorities in event data research.)

1. Use ICEWS

Now that ICEWS is available in near real time—updated daily, except when it isn’t— it’s really the only game in town and likely to remain so until the next generation of coding programs comes along (or, alas, its funding runs out).

ICEWS is not perfect:

the technology is about five years old now
the SERIF/ACCENT coding engine and verb/event dictionaries are proprietary (though they can be licensed for non-commercial use: I’ve been in touch with someone who has successfully done this)
the output is in a decidedly non-standard format, but see below
sources are not linked to traced back to specific URLs—arrgghhh, why not???
the coding scheme is CAMEO, never intended as a general ontology [2], and in a few places—largely to resolve ambiguities in the original—this is defined somewhat differently than the original University of Kansas CAMEO
the original DARPA ICEWS project was focused on Asia, and there is definitely still an Asia-centric bias to the news sources
due to legal constraints on the funding sources—no, not some dark conspiracy: this restriction dates to the post-Watergate 1970s!—it does not cover the US

But ICEWS has plenty of advantages as well:

it provides generally reliable daily updates
it has relatively consistent coverage across more than 20 years, though run frequency checks over time, as there as a couple quirks in there, particularly at the beginning of the series
it is archived in the universally-available and open-access Dataverse
it uses open (and occasionally updated) actor and sector/agent databases
there is reasonably decent (and openly accessible) documentation on how it works
it was written and refined by a professional programming team at BBN/Raytheon which had substantial resources over a number of years
it has excellent coverage across the major international news sources (though again, run some frequency checks: coverage is not completely consistent over time)
it has a tolerable false-positive rate

And more specifically, there is at least one large family of academic journals which now accepts event data research—presumably with the exception of studies comparing data sets—only if they are done using ICEWS: if you’ve done the analysis using anything else, you will be asked to re-do it with ICEWS. Save those scripts!

As for the non-standard data format: just use my text_to_CAMEO program to convert the output to something that looks like every other event data set.

The major downside to ICEWS is a lack of guaranteed long-term funding, which is problematic if you plan to rely on it for models intended to be used in the indefinite future. More generally, I don’t think there are plans for further development, beyond periodically updating the actor dictionaries: the BBN/Raytheon team which developed the coder left for greener pastures [3] and while Lockheed (the original ICEWS contractor) is updating the data, as far as I know they aren’t doing anything with the coder. For the present it seems that the ICEWS coder (and CAMEO ontology) are “good enough for government work” and it just is what it is. Which isn’t bad, just that it could be better with newer technology.

2. Don’t use one-a-day filtering

Yes, it seemed like a good idea at the time, around 1995, but it amplifies coding errors (which is to say, false positives): see the discussion in http://eventdata.parusanalytics.com/papers.dir/Schrodt.TAD-NYU.EventData.pdf (pp. 5-7). We need some sort of duplicate filtering, almost certainly based on clustering the original articles at the text level (which, alas, requires access to the texts, so it can’t be done as a post-coding step with the data alone), but the simple one-a-day approach is not it. Note that ICEWS does not use one-a-day filtering.

3. Don’t use the “Goldstein” scale

Which for starters, isn’t the Goldstein scale, which Joshua Goldstein developed in a very ad hoc manner back in the late 1980s [https://www.jstor.org/stable/174480: paywalled of course, this one at $40] for the World Events Interaction Survey (WEIS) ontology. The scale which is now called “Goldstein” is for the CAMEO ontology, and was an equally ad hoc effort initiated around 2002 by a University of Kansas graduate student named Uwe Reising for an M.A. thesis while CAMEO was still under development, primarily by Deborah Gerner and Ömür Yilmaz, and then brought into final form by me, maybe 2005 or so, after CAMEO had been finalized. But it rests entirely on ad hoc decisions: there’s nothing systematic about the development. [4]

The hypothetical argument that people make against using these scales—the WEIS- and CAMEO-based scales are pretty much comparable—is that positive (cooperative) and negative (conflictual) events in a dyad could cancel each other out, and one would see values near zero both in dyads where nothing was happening and in dyads where lots was happening. In fact, that perfectly balanced situation almost never occurs: instead any violent—that is, material—conflict dominates the scaled time series, and completely lost is any cross-dyad or cross-time variation in verbal behavior—for example negotiations or threats—whether cooperative or conflictual.

The solution, which I think is common in most projects now, is to use “quad counts”: the counts of the events in the categories material-cooperation, verbal-cooperation, verbal-conflict and material-conflict.

4. The PETRARCH-2 coder is only a prototype

The PETRARCH-2 coder (PETR-2) was developed in the summer of 2015 by Clayton Norris, at the time an undergraduate (University of Chicago majoring in linguistics and computer science) intern at Caerus Analytics. [14] It took some of the framework of the PETRARCH-1 (PETR-1) coder, which John Beieler and I had written a year earlier—for example the use of a constituency parse generated by the Stanford CoreNLP system, and the input format and actor dictionaries are identical—but the event coding engine is completely new, and its verb-phrase dictionaries are a radical simplification of the PETR-1 dictionaries, which were just the older TABARI dictionaries. The theoretical approach underlying the coder and the use of the constituency parse are far more sophisticated than those of the earlier program, and it contains prototypes for some pattern-based extensions such as verb transformations. I did some additional work on the program a year later which made PETR-2 sufficiently robust as to be able to code a corpus of about twenty-million records without crashing. Even a record consisting of nothing but exam scores for a school somewhere in India.

So far, so good but…PETR-2 is only a prototype, a summer project, not a fully completed coding system! As I understand it, the original hope at Caerus had been to secure funding to get PETR-2 fully operational, on par with the SERIF/ACCENT coder used in ICEWS, but this never happened. So the project was left in limbo on at least the following dimensions

While a verb pattern transformation facility exists in PETR-2, it is only partially implemented for a single verb, ABANDON
If you get into the code, there are several dead-ends where Norris clearly had intended to do more work but ran out of time
There is no systematic test suite, just about seventy more or less random validation cases and a few Python unit-tests [5]
The new verb dictionaries and an internal transformation language called pico effectively defines yet-another dialect of CAMEO
The radically simplified verb dictionaries have not been subjected to any systematic validation and, for example, there was a bug in dictionaries—I’ve now corrected this on GitHub—which over-coded the CAMEO 03 category
The actor dictionaries are still essentially those of TABARI at the end of the ICEWS research phase, ca. 2011

This is not to criticize Norris’s original efforts—it was a summer project by an undergraduate for godsakes!—but the program has not had the long-term vetting that several other programs such as TABARI (and its Java descendent, JABARI [6]) and SERIF/ACCENT have had. [7]

Despite these issues, PETR-2 has been used to produce three major data sets—Cline Phoenix , TERRIER and UT/Dallas Phoenix. All of these could, at least in theory, be recoded at some point since all of these are based on legal copies of the relevant texts [8]

5. But all of these coders generate the same signal: The world according to CAMEO looks pretty much the same using any automated event coder and any global news source

Repeating a point I made in an earlier entry [https://asecondmouse.wordpress.com/2017/02/20/seven-conjectures-on-the-state-of-event-data/] which I simply repeat here with minimal updating as little has changed:

The graph below shows frequencies across the major (two-digit) categories of CAMEO using three different coders, PETRARCH 1 and 2 , and Raytheon/BBN’s ACCENT (from the ICEWS data available on Dataverse) for the year 2014. This also reflects two different news sources: the two PETRARCH cases are LexisNexis; ICEWS/ACCENT is Factiva, though of course there’s a lot of overlap between those.

Basically, “CAMEO-World” looks pretty much the same whichever coder and news source you use: the between-coder variances are completely swamped by the between-category variances. What large differences we do see are probably due to changes in definitions: for example PETR-2 over-coded “express intent to cooperate” (CAMEO 03) due to the aforementioned bug in the verb dictionaries; I’m guessing BBN/ACCENT did a bunch of focused development on IEDs and/or suicide bombings so has a very large spike in “Assault” (18) and they seem to have pretty much defined away the admittedly rather amorphous “Engage in material cooperation” (06).

I think this convergence is due to a combination of three factors:

News source interest, particularly the tendency of news agencies (which all of the event data projects are now getting largely unfiltered) to always produce something, so if the only thing going on in some country on a given day is a sister-city cultural exchange, that will be reported (hence the preponderance of events in the low categories). Also the age-old “when it bleeds, it leads” accounts for the spike on reports of violence (CAMEO categories 17, 18,19).
In terms of the less frequent categories, the diversity of sources the event data community is using now—as opposed to the 1990s, when the only stories the KEDS and IDEA/PANDA projects coded were from Reuters, which is tightly edited—means that as you try to get more precise language models using parsing (ACCENT and PETR-2), you start missing stories that are written in non-standard English that would be caught by looser systems (PETR-1 and TABARI). Or at least this is true proportionally: on a case-by-case basis, ACCENT could well be getting a lot more stories than PETR-2 (alas, without access to the corpus they are coding, I don’t know) but for whatever reason, once you look at proportions, nothing really changes except where there is a really concentrated effort (e.g. category 18), or changes in definitions (ACCENT on category 06; PETR-2 unintentionally on category 03).
I’m guessing (again, we’d need the ICEWS corpus to check, and that is unavailable due to the usual IP constraints) all of the systems have similar performance in not coding sports stories, wedding announcements, recipes, etc: I know PETR-1 and PETR-2 have about a 95% agreement on whether a story contains an event, but a much lower agreement on exactly what the event is: again, their verb dictionaries are quite different. The various coding systems probably also have a fairly high agreement at least on the nation-state level of which actors are involved.

6. Quantity is not quality

Which is to say, event data coding is not a task where throwing gigabytes of digital offal at the problem is going to improve results, and we are almost certainly reaching a point where some of the inputs to the models have been deliberately and significantly manipulated. This also compounds the danger of focusing on where the data is most available, which tends to be areas where conflict has occurred in the past and state controls are weak. High levels of false positives are bad and contrary to commonly-held rosy scenarios, duplicate stories aren’t a reflection of importance but rather of convenience, urban, and other biases. But you need the texts to reliably eliminate duplicates.

The so-called web “inversion”—the point where more information on the web is fake than real, which we are either approaching or have already passed—probably marks the end of efforts to develop trigger models—the search for anticipatory needles-in-a-haystack in big data—in contemporary data. That said, a vast collection of texts from prior to the widespread manipulation of electronic news feeds exists (both in the large data aggregators—LexisNexis, Factiva, and ProQuest—and with the source texts held, under unavoidable IP restrictions, by ICEWS, Cline, the University of Oklahoma TERRIER project and presumably the EU JRC) and these are likely to be extremely valuable resources for developing filters which can distinguish real from fake news.

Due to the inversion, particularly when dealing with politically sensitive topics (or rather, topics that are considered sensitive by some group with reasonably good computer skills and an internet connection), social media are probably now a waste of time in terms of analyzing real-world events (they are still, obviously, useful in analyzing how events appear on social media), and likely will provide a systematically distorted signal.

7. There is an open source software singularity (but not the other singularity…)

Because I don’t live in Silicon Valley, some of the stuff coming out of there by the techno-utopians —Ray Kurzweil is the worst, with Peter Thiel (who has fled the Valley) and Elon Musk close seconds, and Thomas Friedman certainly an honorary East Coast participant—seems utterly delusional. Which, in fact, it is, but in my work as a programmer/data scientist I’ve begun to understand where at least some of this is coming from, and that is what I’ve come to call the “software singularity.” This being the fact that code—usually in multiple ever-improving variants—for doing almost anything you want is now available for free and has an effective support community on Stack Overflow: things that once took months now can be done in hours.

Some examples relevant to event data:

the newspaper3k library downloads, formats and updates news scrapping in 20 lines of Python
requests-HTML can handle downloads even when the content is generated by javascript code
universal dependency parses provide about 90% of the information required for event coding [9]
easily deployed data visualization dashboards are now too numerous to track [10]

And this is a tiny fraction of the relevant software: for example the vast analytical capabilities of the Python and R statistical and machine learning libraries would have, twenty years ago, cost tens if not hundreds of thousands of dollars (but the comparison is meaningless: the capabilities in these libraries simply didn’t exist at any price) and required hundreds of pounds—or if you prefer, linear-feet—of documentation.

To take newspaper3k as an illustrative example, the task of downloading news articles, even from a dedicated site such as Reuters, Factiva, or LexisNexis (and these are the relatively easy cases) requires hundreds of lines of code—and I spent countless hours over three decades writing and modifying such code variously in Pascal, Simula [11], C, Java, perl, and finally Python—to handle the web pipeline, filtering relevant articles, getting rid of formatting, and extracting relevant fields like the date, headline, and text. With newspaper3k , the task looks pretty much [READ THIS FOOTNOTE!!!] like this:

import newspaper

reut_filter = ["/photo/", "/video", "/health/", "/www.reuters.tv/",
"/jp.reuters.com/",...,  "/es.reuters.com/"] # exclude these

a_paper = newspaper.build("https://www.reuters.com/")
for article in a_paper.articles:
    if "/english/" not in article.url: # section rather than article
        continue
    for li in reut_filter:
        if li in article.url: break
    else
        article.download()
        article.parse()
        with open("reuters_" + article.url + ".txt") as fout:
            fout.write("URL: " + article.url + "\n")
            fout.write("Date: " + str(article.publish_date) + "\n")
            fout.write("Title: " + article.title + "\n")
            fout.write("Text:\n" + article.text + "\n")

An important corollary: The software singularity (and inexpensive web-based collaboration tools) enables development to be done very rapidly with small decentralized “remote” teams rather than the old model of large programming shops. In the software development community in Charlottesville, our CTO group [12] focuses on this as the single greatest current opportunity, and doing it correctly is the single greatest challenge, and I think Gen-Xers and Millennials in academia have also largely learned this: for research at least, the graduate “bull-pen” [13] is now global.

That other singularity?: no, sentient killer robots are not about to take over the world, and you’re going to die someday. Sorry.

A good note to end on.

Reference

Blog entries on event data in rough order of utility/popularity:

The legal status of event data [14 Feb 2014]
Seven observations on the newly released ICEWS data [30 Mar 2015]
Instability Forecasting Models: Seven Ethical Considerations [20 Feb 2019]
Seven Guidelines for Generating Data using Automated Coding [1] [10 May 2013]
Seven Conjectures on the State of Event Data [20 Feb 2017]
Should an event coder be more like a baby? [5 Jun 2018] (discussion of parsing/dictionary coders vs. example-based classifiers)
Entropy, Data Generating Processes and Event Data [12 Jan 2018] (borders on crazy…)

and the followup to this:

Seven current challenges in event data [13 Mar 2019]

Footnotes

READ THIS FOOTNOTE!!!: I’ve pulled out the core code here from a working program which is about three times as long—for example it adjusts for the contingency that article.publish_date is sometimes missing—and this example code alone may or may not work. The full program is on GitHub: it definitely works and ran for days without crashing.

1. The working title for this entry was “S**t I tell people about event data.”

2. See the documentation for PLOVER —alas, still essentially another prototype—on problems with using CAMEO as a general coding framework.

3. Though I have heard this involved simply taking jobs with another company working out of the same anonymous Boston-area office park.

4. Around this same time, early 2000s, the VRA project undertook a very large web-based effort using a panel of experts to establish agreed-upon weights for their IDEA event coding ontology, but despite considerable effort they could never get these to converge. In the mid-1990s, I used a genetic algorithm to find optimal weights for a [admittedly somewhat quirky] clustering problem: again, no convergence, and wildly different sets of weights could produce more or less the same results.

5. TABARI, in contrast, has a validation suite—typically referred to as the “Lord of the Rings test suite” since most of the actor vocabulary is based on J.R.R. Tolkien’s masterwork, which didn’t stop a defense contractor from claiming “TABARI doesn’t work” after trying to code contemporary news articles based on a dictionary focused on hobbits, elves, orcs, and wizards—of about 250 records which systematically tests all features of the program as well as some difficult edge cases encountered in the past.

6. Lockheed’s JABARI, while initially just a Java version of TABARI—DARPA, then under the suzerainty of His Most Stable Genius Tony Tether, insisted that Lockheed’s original version duplicate not just the features of TABARI, but also a couple of bugs that were discovered in the conversion—was significantly extended by Lockheed’s ICEWS team, and was in fact an excellent coding program but was abandoned thanks to the usual duplicitous skullduggery that has plagued US defense procurement for decades: when elephants fight, mice get trampled. After witnessing a particularly egregious episode of this, I was in our research center at Kansas and darkly muttered to no one in particular “This is why you should make sure your kids learn Chinese.” To which a newly hired secretary perked up with “Of course my kids are learning Chinese!”

7. I will deal with the issue of UniversalPETRARCH—another partially-finished prototype—in the next entry. But in the meanwhile, note that the event coding engines of these three “PETRARCH” programs are completely distinct; the main thing they share in common is their actor dictionaries.

8. See in particular the Cline Center’s relatively recent “Global News Archive“: 70M unduplicated stories, 100M original, updated daily. The Cline Center has some new research in progress comparing several event data sets: a draft was presented at APSA-18 and a final version is near completion: you can contact them. Also there was a useful article comparing event data sets in Science about two years ago: http://science.sciencemag.org/content/353/6307/1502

9. 90% in the sense that in my experiments so far, specifically with the proof-of-concept mudflat coder, code sufficient for most of the functionality required for event coding is about 10% the length of comparable code processing a constituency parse or a just doing an internal sparse parse. Since mudflat is just a prototype and edge cases consume lots of code, 90% reduction is probably overly generous, but still, UD parses are pretty close to providing all of the information you need for event coding.

10. Curiously, despite the proliferation of free visualization software, the US projects ICEWS, PITF and UT/D RIDIR never developed public-facing dashboards, compared to the extensive dashboards available at European-based sites such as ACLED, ViEWS, UCDP and EMM NewsBrief.

11. A short-lived simulation language developed at the University of Oslo in the 1960s that is considered the first object-oriented language and had a version which ran on early Macintosh computers that happened to have some good networking routines (the alternative at the time being BASIC). At least I think that’s why I was using it.

12. I’ve been designated an honorary CTO in this group because I’ve managed large projects in the past. And blog about software development. Most of the participants are genuine CTOs managing technology for companies doing millions of dollars of business per year, and were born long after the Beatles broke up.

13. I think this term is general: it refers to large rooms, typically in buildings decades past their intended lifetime dripping with rainwater, asbestos, and mold where graduate students are allocated a desk or table typically used, prior to its acquisition by the university sometime during the Truman administration, for plotting bombing raids against Japan. Resemblance to contemporary and considerably more expensive co-working spaces is anything but coincidental.

14. Norris was selected for this job by an exhaustive international search process consisting of someone in Texas who had once babysat for the lad asking the CEO of Caerus in the Greater Tyson’s Corner Metropolitan Area whether she by chance knew of any summer internship opportunities suitable for someone with his background.

Posted in Methodology | 2 Comments

Instability Forecasting Models: Seven Ethical Considerations

Posted on February 20, 2019 by schrodt735

So, welcome, y’all, to the latest bloggy edition on an issue probably relevant to, at best, a couple hundred people, though once again it has been pointed out to me that it is likely to be read by quite a few of them. And in particular, if you are some hapless functionary who has been directed to read this, a few pointers

“Seven” is just a meme in this blog
Yes, it is too long: revenge of the nerds. Or something. More generally, for the length you can blame some of your [so-called] colleagues to whom I promised I’d write it
You can probably skip most of the footnotes. Which aren’t, in fact, really footnotes so much as another meme in the blog. Some of them are funny. Or at least that was the original intention
You can skip Appendix 1, but might want to skim Appendix 2
ICEWS = DARPA Integrated Conflict Early Warning System; PITF = U.S. multi-agency Political Instability Task Force; ACLED = Armed Conflict Location and Event Data; PRIO = Peace Research Institute Oslo; UCDP = Uppsala [University] Conflict Data Program; DARPA = U.S. Defense Advanced Research Projects Agency; EU JRC = European Commission Joint Research Centre
Yes, I’m being deliberately vague in a number of places: Chatham House rules at most of the workshops and besides, if you are part of this community you can fill in the gaps and if you aren’t, well, maybe you shouldn’t have the information [1]

Violating the Bloggers Creed of absolute self-righteous certainty about absolutely everything, I admit that I’m writing this in part because some of the conclusions end up at quite a different place than I would have expected. And there’s some inconsistency: I’m still working this through.

Prerequisites out of the way, we shall proceed.

Our topic is instability forecasting models—IFMs—which are data-based quantitative models, originally statistical, now generally using machine learning methods, which forecast the probabilities of various forms of political instability such as war, civil war, mass protests, even coups, at present typically (though not exclusively) at the level of the nation-state and with a time horizon of about two years. The international community developing these models has, in a sense, become the dog that caught the car: We’ve gone from “forecasting political instability is impossible: you are wasting your time” to “everyone has one of these models” in about, well, seven years.

As I’ve indicated in Appendix 1—mercifully removed from the main text so that you can skip it—various communities have been at this for a long time, certainly around half a century, but things have changed—a lot—in a relatively short period of time. So for purposes of discussion, let’s start by stipulating three things:

Political forecasting per se is nothing new: any policy which requires a substantial lead time to implement (or, equivalently, which is designed to affect the state of a political system into the future, sometimes, as in the Marshall Plan or creation of NATO and later the EU, very far into the future) requires some form of forecasting: the technical term (okay, one technical term…) is “feedforward.” The distinction is we now can do this using systematic, data-driven methods.[2]
The difference between now and a decade ago is that these models work and they are being seriously implemented, with major investments, into policy making in both governments and IGOs. They are quite consistently about 80% accurate,[3] against the 50% to 60% accuracy of most human forecasters (aside from a very small number of “superforecasters” who achieve machine-level accuracy). This is for models using public data, but I’ve seen little evidence that private data substantially changes accuracy, at least at the current levels of aggregation (it is possible that it might at finer levels in both geographical and temporal resolution) [4]. The technology is now mature: in recent workshops I’ve attended, both the technical presentations and the policy presentations were more or less interchangeable. We know how to do these things, we’ve got the data, and there is an active process of integrating them into the policy flow: the buzzphrase is “early warning and early action” (EWEA), and the World Bank estimates that even if most interventions fail to prevent conflict, the successes have such a huge payoff that the effort is well worthwhile even from an economic, to say nothing a humanitarian, perspective.
In contrast to weather forecasting models—in many ways a good analogy for the development of IFMs—weather doesn’t respond to the forecast, whereas political actors might: We have finally hit a point where we need to worry about “reflexive” prediction. Of course, election forecasting has also achieved this status, and consequently is banned in the days or weeks before elections in many democracies. Economic forecasting long ago also passed this point and there is even a widely accepted macroeconomic theory, rational expectations, dealing with it. But potential reflexive effects are quite recent for IFMs.

As of about ten years ago, the position I was taking on IFMs—which is to say, before we had figured out how to create these reliably, though I still take this position with respect to the data going to these—was that our ideal end-point would be something similar to the situation with weather and climate models [5]: an international epistemic community would develop a series of open models that could be used by various stakeholders—governments, IGOs and NGOs—to monitor evolving cases of instability across the planet, and in some instances these alerts would enable early responses to alleviate the conflict—EWEA—or failing that, provide, along the lines of the famine forecasting models, sufficient response to alleviate some of the consequences, notably refugee movements and various other potential conflict spill-over effects. As late as the mid-2000s, that was the model I was advocating.

Today?—I’m far less convinced we should follow this route, for a complex set of reasons both pragmatic and ethical which I still have not fully resolved and reconciled in my own mind, but—progress of sorts—I think I can at least articulate the key dimensions.

1. Government and IGO models are necessarily going to remain secret, for reasons both bureaucratic and practical.

Start with the practical: in the multiple venues I’ve attended over the past couple of years, which is to say during the period when IFMs have gone from “impossible” to “we’re thinking about it” to “here’s our model”, everyone in official positions has been adamant that their operational models are not going to become public. The question is then whether those outside these organizations, particularly as these models are heavily dependent on NGO and academic data sets, should accept this or push back.

To the degree that this tendency is simply traditional bureaucratic siloing and information hoarding—and there are certainly elements of both going on—the natural instinct would be to push back. However, I’ve come to accept the argument that there could be some legitimate reasons to keep this information confidential due to the fact that the decisions of governments and IGOs, which can potentially wield resources on the order of billions of dollars, can have substantial reflexive consequences on decisions that could affect the instability itself, in particular

foreign direct investment and costs of insurance
knowledge that a conflict is or is not “on the radar” for possible early action
support for NGO preparations and commitments
prospects for collective action, discussed below

2. From an academic and NGO perspective, there is a very substantial moral issue in forecasting the outcome of a collective action event.

This is the single most difficult issue in this essay: are there topics, specifically those dealing with collective action, which should be off-limits, at least in the public domain, even for the relatively resource-poor academic and NGO research communities?

The basic issue here is that—at least with the current state of the technology—even if governments and IGOs keep their exact models confidential, the past ten years or so have shown that one can probably fairly easily reverse engineer these except for the private information: at least at this point in time, anyone trying to solve this problem is going to wind up with a model with relatively clear set of methods, data and outcomes, easily duplicated with openly available software and data.[6][7]

So in our ideal world—the hurricane forecasting world—the models are public, and when they converge, the proverbial red lights flash everywhere, and the myriad components of the international system gear up to deal with the impending crisis, and when it happens the early response is far more effective than waiting until the proverbial truck is already halfway over the cliff. And all done by NGOs and academic researchers, without the biases of governments.

Cool. But what if, instead, those predictions contribute to the crisis, and in the worst case scenario, cause a crisis that otherwise would not have occurred. For example through individuals reading predictions of impending regime transition, using that information to mobilize collective action, which then fails: we’re only at 80% to 85% accuracy as it is, and this is before taking into account possible feedback effects. [8] Hundreds killed, thousands imprisoned, tens of thousands displaced. Uh, bummer.

One can argue, of course, that this is no different that what is already happening with qualitative assessments: immediately coming to mind is the Western encouragement of the Hungarian revolt in 1956, the US-supported Bay of Pigs invasion, North Vietnam’s support of the Tet Offensive, which destroyed the indigenous South Vietnamese communist forces,[9] and US ambiguity with respect to the Shi’a uprisings following the 1991 Iraq War. And this is only a tiny fraction of such disasters.

But they were all, nonetheless, disasters with huge human costs, and actions which affect collective resistance bring to mind J.R.R. Tolkien’s admonition: “Do not meddle in the affairs of wizards, for they are subtle and quick to anger.” Is this the sort of thing the NGO and academic research community, however well meaning, should risk?

3. Transparency is nonetheless very important in order to assess limitations and biases of models.

Which is what makes the first two issues so problematic: despite the convergence in the existing models, every model has biases [10] and while the existing IFMs have converged, there is no guarantee that this will continue to be the case as new models are developed which are temporally and/or spatially more specific, or which take on new problems, for example detailed refugee flow models. Furthermore, since the contributions of the academic and NGO communities were vital to moving through the “IFM winter”—see Appendix 1—continuing to have open, non-governmental efforts seems very important.

Two other thoughts related to this

Is it possible that the IFM ecosystem has become too small because the models are so easy to create? I’m not terribly worried about this because I’ve seen, in multiple projects, very substantial efforts to explore the possibility that other models exist, and they just don’t seem to be there, at least as for the sets of events currently of interest, but one should always be alert to the possibility of what appears to be a technological maturity is a failure of imagination.
Current trends in commercial data science (as opposed to open source software and academic research) may not be all that useful for IFM development because this is not a “big data” problem: one of the curious things I noted at a recent workshop on IFMs is that deep learning was never mentioned. Though looking forward counterfactually, it is also possible that rare events—where one can envision even more commercial applications than those available in big data—are the next frontier in machine learning/artificial intelligence.

4. Quality is more important than quantity.

Which is to say, this is not a task where throwing gigabytes of digital offal at the problem is going to improve results, and we may be reaching a point where some of the inputs to the models have been deliberately and significantly manipulated because such manipulation is increasingly common. Also there is a danger in focusing on where the data is most available, which tends to be areas where conflict has occurred in the past and state controls are weak. High levels of false positives—notably in some atomic (that is, ICEWS-like) event data sets—are bad and contrary to commonly-held rosy scenarios, duplicate stories aren’t a reflection of importance but rather of convenience, urban and other biases.

The so-called web “inversion”—the point where more information on the web is fake than real, which we are either approaching or may have already passed—probably marks the end, alas, of efforts to develop trigger models—the search for anticipatory needles-in-a-haystack in big data—in contemporary data, though it is worth noting that a vast collection of texts from prior to the widespread manipulation of electronic news feeds exists (both in the large news aggregators—LexisNexis, Factiva, and ProQuest—and with the source texts held, under unavoidable IP restrictions, by ICEWS, the University of Illinois Cline Center, the University of Oklahoma TERRIER project and presumably the EU JRC) and these are likely to be extremely valuable resources for developing filters which can distinguish real from fake news. They could also be useful in determining whether, in the past, trigger models are real, rather than a cognitive illusion borne of hindsight—having spent a lot of time searching for these with few results, I’m highly skeptical, but it is an empirical question—but any application of these in the contemporary environment will require far more caution than would have been needed, say, a decade ago.[11]

5. Sustainability of data sources.

It has struck me at a number of recent workshops—and, amen, in my own decidedly checkered experience in trying to sustain near-real-time atomic event data sets—the degree to which event data—structural data being generally solidly funded as national economic and demographic statistics—used in IFM models depends on a large number of small projects without reliable long-term funding sources. There are exceptions—UCDP as far as I understand has long-term commitments from the Swedish government, both PRIO and ACLED have gradually accumulated relatively long-term funding through concerted individual efforts, and to date PITF has provided sustained funding for several data sets, notably Polity IV and less notably the monthly updates of the Global Atrocities Data Set—but far too much data is coming from projects with relatively short-term funding, typically from the US National Science Foundation, where social science grants tend to be just two or three years, with no guarantee of renewal, and grants from foundations which tend to favor shiny new objects over slogging through stuff that just needs to be done to support a diffuse community.

The ethical problem here is the extent to which one can expect researchers to invest in models using data which may not be available in the future, and, conversely, whether the absence of such guarantees is leading the collective research community to spend too much effort in the proverbial search for the keys where the light is best. Despite several efforts over the years, political event data, whether the “atomic” events similar to ICEWS or the “episodic” events similar to ACLED, the Global Terrorism Database, and UCDP, have never attained the privileged status the U.S. NSF has accorded to the continuously-maintained American National Election Survey and General Social Survey, and the user community may just be too small (or politically inept) to justify this. I keep thinking/hoping/imagining that increased automation in ever less expensive hardware environments will bring the cost of some of these projects down to the point where they could be sustained, for example, by a university research center with some form of stable institutional support, but thus far I’ve clearly underestimated the requirements.

Though hey, it’s mostly an issue of money: Mr. and Ms. Gates, Ms. Powell-Jobs, Mr. Buffet and friends, Mr. Soros, y’all looking for projects?

6. Nothing is missing or in error at random: incorrect predictions and missing values carry information.

This is another point where one could debate whether this involves ethics or just professional best-practice—again, don’t confine your search for answers to readily available methods where you can just download some software—but these decisions can have consequences.

The fact that information relevant to IFMs is not missing at random has been appreciated for some time, and this may be one of the reasons why machine learning methods—where “missing” is just another value—have fairly consistently out-performed statistical models. This does, however, suggest that statistical imputation—now much easier thanks to both software and hardware advances—may not be a very good idea and is potentially an important source of model bias.

There also seems to be an increasing appreciation that incorrect predictions, particularly false positives (that is, a country or region has been predicted to be unstable but is not) may carry important information, specifically about the resilience of local circumstances and institutions. And more generally, those off-diagonal cases—both the false positives and false negatives—are hugely important in the modeling effort and should be given far more attention than I’m typically seeing. [12]

A final observation: at what point are we going to get situations where the model is wrong because of policy interventions? [8, again] Or have we already? — that’s the gist of the EWEA approach. I am guessing that in most cases these situations will be evident from open news sources, though there may be exceptions where this is due to “quiet diplomacy”—or as likely, quiet allocation of economic resources—and will quite deliberately escape notice.

7. Remember, there are people at the end of all of these.

At a recent workshop, one of the best talks—sorry, Chatham House rules—ended with an impassioned appeal on this point from an individual from a region which, regrettably, has tended to be treated as just another set of data points in far too many studies. To reiterate: IFMs are predicting the behaviors of people, not weather.

I think these tendencies have been further exacerbated by what I’ve called “statutory bias” [10, again] in both model and data development: the bureaucratic institutions responsible for the development of many of the most sophisticated and well-publicized models are prohibited by law from examining their own countries (or in the case of the EU, set of countries). And the differences can be stark: I recently saw a dashboard with a map of mass killings based on data collected by a European project which, unlike PITF and ICEWS data, included the US: the huge number of cases both in the US and attributable to US-affiliated operations made it almost unrecognizable compared to displays I was familiar with.

This goes further: suppose the massive increase in drug overdose deaths in the US, now at a level exceeding 70,000 per year, and as amply documented, the direct result of a deliberate campaign by one of America’s wealthiest families, whose philanthropic monuments blot major cities across the land, suppose this had occurred in Nigeria, Tajikistan or Indonesia, might we at the very least be considering that phenomenon a candidate for a new form of state weakness and/or the ability of powerful drug interests to dominate the judicial and legislative process? But we haven’t.

On the very positive side, I think we’re seeing more balance emerging: I am particularly heartened to see that ECOWAS has been developing a very sophisticated IFM, at least at the level of North American and European efforts, and with its integration with local sources, perhaps superior. With the increasing global availability of the relevant tools, expertise, and, through the cloud, hardware, this will only increase, and while the likes of Google and Facebook have convinced themselves only whites and Asians can write software, [13] individuals in Africa and Latin America know better.

Whew…so where does this leave us? Between some rugged rocks and some uncomfortable hard places, to be sure, or there would have been no reason to write all of this in the first place. Pragmatics aside—well-entrenched and well-funded bureaucracies are going to set their own rules, irrespective of what academics, NGOs and bloggers are advocating—the possibility of developing models (or suites of models) which set off ill-advised collective action concerns me. But so does the possibility of policy guided by opaque models developed with flawed data and techniques, to say nothing of policies guided by “experts” whose actually forecasting prowess is at the level of dart-throwing chimps. And there’s the unresolved question of whether there something special about the forecasts of a quantitative model as distinct from those of an op-ed in the Washington Post or a letter or anonymous editorial in The Economist, again with demonstrably lower accuracy and yet part of the forecasting ecosystem for a century or more. Let the discussion continue.

I’ll close with a final personal reflection that didn’t seem to fit anywhere else: having been involved in these efforts for forty or so years, it is very poignant for me to see the USA now almost completely out of this game, despite the field having largely been developed in the US. It will presumably remain outside until the end of the Trump administration, and then depending on attitudes in the post-Trump era, rebuilding could be quite laborious given the competition with industry for individuals with the required skill sets though, alternatively, we could see a John Kennedyesque civic republican response by a younger generation committed to rebuilding democratic government and institutions on this side of the Atlantic. In the meantime, as with high speed rail, cashless payments, and universal health care, the field is in good hands in Europe. And for IFMs and cashless payments, Africa.

Footnotes

1. I went to college in a karst area containing numerous limestone caves presenting widely varying levels of technical difficulty. The locations of easy ones where you really had to make an effort—or more commonly, drink—to get yourself into trouble were widely known. The locations of the more difficult were kept confidential among a small group with the skills to explore them safely. Might we be headed in a similar direction in developing forecasting models?—you decide.

Someone about a year ago at one of these IFM workshops—there have been a bunch, to the point where many of the core developers know each other’s drink preferences—raised the issue that we don’t want forecasts to provide information to the “bad guys.” But where to draw the line on this, given that some of the bad guys can presumably reverse engineer the models from the literature, given the technical sophistication we’ve seen by such groups, e.g. in IEDs and the manipulation of social media. Suddenly the five-year publication lags (and paywalls?) in academic journals becomes a good thing?

2. I finally realized the reason why we haven’t had serious research into how to integrate quantitative and qualitative forecasts—this is persistently raised as a problem by government and IGO researchers—is the academics and small research shops like mine have a really difficult time finding real experts (as opposed, say, to students or Mech Turkers) who have a genuine interest and knowledge of a topic, as distinct from just going through the motions and providing uninformed speculation. In such circumstances the value added by the qualitative information will be marginal, and consequently we’re not doing realistic tests of expert elucidation methods. So by necessity this problem—which is, in fact, quite important—is probably going to have to be solved in the government and IGO shops.

3. I’m using this term informally, as the appropriate metric for “accuracy” on these predictions, which involve rare events, is complicated. Existing IFMs can consistently achieve an AUC of 0.80 to 0.85, rarely going above (or below) that level, which is not quite the same as the conventional meaning of “accuracy” but close enough. There are substantial and increasingly sophisticated discussions within the IFM community on the issue of metrics: we’re well aware of the relevant issues.

4. One curious feature of IFMs may be that private data will become important at short time horizons but not at longer horizons. This contrasts to the typical forecasting problem where errors increase more or less exponentially as the time horizon increases. In current IFMs, structural indicators (mostly economic, though also historical), which are readily available in public sources, dominate in the long term, whereas event-based conditions may be more important in the short term. E.g. “trigger models”—if these are real, an open question—are probably not relevant in forecasting a large-scale event like Eastern Europe in 1989 or the Arab Spring, but could be very important in forecasting at a time horizon of a few weeks in a specific region.

5. Science had a nice article [http://science.sciencemag.org/content/363/6425/342] recently on these models: Despite the key difference of IFMs being potentially reflexive and the fact that that one of our unexplored domains is the short term forecast, some of the approaches used in those models—emphasized in the excerpt below—could clearly be adapted to IFMs

Weather forecasts from leading numerical weather prediction centers such as the European Centre for Medium-Range Weather Forecasts (ECMWF) and National Oceanic and Atmospheric Administration’s (NOAA’s) National Centers for Environmental Prediction (NCEP) have also been improving rapidly: A modern 5-day forecast is as accurate as a 1-day forecast was in 1980, and useful forecasts now reach 9 to 10 days into the future (1). Predictions have improved for a wide range of hazardous weather conditions [emphasis added], including hurricanes, blizzards, flash floods, hail, and tornadoes, with skill emerging in predictions of seasonal conditions.

…

Because data are unavoidably spatially incomplete and uncertain, the state of the atmosphere at any time cannot be known exactly, producing forecast uncertainties that grow into the future. This sensitivity to initial conditions can never be overcome completely. But, by running a model over time and continually adjusting it to maintain consistency with incoming data [emphasis added], the resulting physically consistent predictions greatly improve on simpler techniques. Such data assimilation, often done using four-dimensional variational minimization, ensemble Kalman filters, or hybridized techniques, has revolutionized forecasting.

Sensitivity to initial conditions limits long-term forecast skill: Details of weather cannot be predicted accurately, even in principle, much beyond 2 weeks. But weather forecasts are not yet strongly constrained by this limit, and the increase in forecast skill has shown no sign of ending. Sensitivity to initial conditions varies greatly in space and time [emphasis added], and an important but largely unsung advance in weather prediction is the growing ability to quantify the forecast uncertainty [emphasis added] by using large ensembles of numerical forecasts that each start from slightly different but equally plausible initial states, together with perturbations in model physics.

6. I’m constantly confronted, of course, with the possibility that there are secret models feeding into the policy process that are totally different than those I’m seeing. But I’m skeptical, particularly since in some situations, I’m the only person in the room who has been witness to the process by which independent models have been developed, such being the reward, if that’s the word, for countless hours of my life frittered away in windowless conference rooms watching PowerPoint™ presentations. All I see is convergence, not just in the end result, but also in the development process.

Consequently if a trove of radically different—as distinct from incrementally different, however much their creators think they are novel—secret models exists, there is a vast and fantastically expensive conspiracy spanning multiple countries creating an elaborate illusion solely for my benefit, and frankly, I just don’t think I’m that important. I’m sure there are modeling efforts beyond what I’m seeing, but from the glimmers I see of them, they tend to be reinventing wheels and/or using methods that were tried and rejected years or even decades ago, and the expansiveness (and convergence) of known work makes it quite unlikely—granted, not impossible—that there is some fabulously useful set of private data and methodology out there. To the contrary, in general I see the reflections from the classified side as utterly hampered by inexperience, delusional expectations, and doofus managers and consultants who wouldn’t make it through the first semester of a graduate social science methodology course and who thus conclude that because something is impossible for them, it is impossible for anyone. Horse cavalry in the 20th century redux: generally not a path with a positive ending.

7. Providing, of course, one wants to: there may be specialized applications where no one has bothered to create public models even though this is technically possible.

8. One of the more frustrating things I have heard, for decades, is a smug observation that if IFMs become successful, the accuracy of our models will decline and consequently we modelers will be very sad. To which I say: bullshit! Almost everyone involved in IFM development is acutely aware of the humanitarian implications of the work, and many have extended field experience in areas experiencing stress due to political instability (which is not, in general, true of the folks making the criticisms, pallid Elois whose lives are spent in seminar rooms, not in the field). To a person, model developers would be ecstatic were the accuracy of their models to drop off because of successful interventions, and this is vastly more important to them than the possibility of Reviewer #2 recommending against publication in a paywalled journal (which, consequently, no one in a policy position will ever read) because the AUC hasn’t improved over past efforts.

9. Back in the days when people still talked of these things—the end of the Vietnam War now being almost as distant from today’s students than the end of World War I was from my generation—one would encounter a persistent urban legend in DoD operations research—ah, OR…now there’s a golden oldie…—circles that somewhere deep in the Pentagon was a secret computer model—by the vague details, presumably one of Jay Forrester’s systems dynamics efforts, just a set of difference equations, as the model was frequently attributed to MIT—that precisely predicted every aspect of the Vietnam War and had decision-makers only paid attention to this, we would have won. You know, like “won” in that we’d now be buying shrimp, t-shirts and cheap toys made in Vietnam and it would be a major tourist destination. I digress.

Anyway, I’m pretty sure that in reality dozens of such models were created during the Vietnam War period, and some of them were right some of the time, but, unlike the Elder Wand of the Harry Potter universe, no such omniscient Elder Model existed. This land of legends situation, I would also note, is completely different than where we are with contemporary IFMs: the models, data, methods, and empirical assessments are reasonably open, and there is a high degree of convergence in both the approaches and their effectiveness.

10. I’d identify five major sources of bias in existing event data: some of these affect structural data sets as well, but it is generally use to be aware of these.

Statutory bias, also discussed under point 7: Due to its funding sources, ICEWS and PITF are prohibited by a post-Vietnam-era law from tracking the behavior of US citizens. Similarly, my understanding is that the EU IFM efforts are limited (either by law or bureaucratic caution) in covering disputes between EU members and internal instability within them. Anecdotally, some NGOs also have been known to back off some monitoring efforts in some regions in deference to funders.
Policy bias: Far and away the most common application of event data in the US policy community has been crisis forecasting, so most of the effort has done into collecting data on violent (or potentially violent) political conflict. The EU’s JRC efforts are more general, and for example have foci on areas where the EU may need to provide disaster relief, but is still strongly focused on areas of concern to the EU.
Urban bias: This is inherent in the source materials: for example during the Boko Haram violence in Nigeria, a market bombing in the capital Abuja generated about 400 stories; one in the regional capital of Maiduguri would typically generate ten or twenty, and one in the marginal areas near Lake Chad would generate one or two. Similarly, terrorist incidents in Western capitals such as Paris or London generate days of attention where events with far higher casualty rates in the Middle East or Africa typically are covered for just a day.
Media fatigue: This is the tendency of news organizations to lose interest in on-going conflicts, covering them in detail when they are new but shifting attention even though the level of conflict continues.
English-language bias: Most of the event data work to date—the EU JRC’s multi-language work being a major exception—has been done in English (and occasionally Spanish and Portuguese) and extending beyond this is one of the major opportunities provided by contemporary computationally-intensive methods, including machine translation, inter-language vector transformations, and the use of parallel corpora for rapid dictionary development; IARPA has a new project called BETTER focused on rapid (and low effort) cross-language information extraction which might also help alleviate this.

11. See for example https://publications.parliament.uk/pa/cm201719/cmselect/cmcumeds/1791/1791.pdf

12. Though this is changing, e.g. see Michael Colaresi https://twitter.com/colaresi/status/842291411298996224 on bi-separation plots, which, alas, links to yet-another-frigging paywalled article, but at least the sentiment is there.

13. See https://www.nytimes.com/2019/02/13/magazine/women-coding-computer-programming.html. Google and Facebook have 1% blacks and 3% Hispanics in their technical employees! Microsoft, to its credit, seems to be more enlightened.

Appendix 1: An extraordinarily brief history of how we got here

This will be mostly the ramblings of an old man dredging up fading memories, but it’s somewhat important, in these heady days of the apparently sudden success of IFMs, to realize the efforts go way back. In fact there’s a nice MA thesis to be done here, I suppose in some program in the history of science, on tracking back how the concept of IFMs came about. [A1]

Arguably the concept is firmly established by the time of Leibnitz [], who famously postulated a “mathematical philosophy” wherein

“[…] if controversies were to arise, there would be no more need of disputation between two philosophers than between two calculators. For it would suffice for them to take their pencils in their hands and to sit down at the abacus, and say to each other (and if they so wish also to a friend called to help): Let us calculate.”

I’m too lazy to thoroughly track things during the subsequent three centuries, but Newtonian determinism expressed through equations was in quite the vogue during much of the period—Laplace, famously—and by the 19th century data-based probabilistic inference would gradually develop, along with an ever increasing amount of demographic and economic data, and by the 1920s, we had a well-established, if logically inconsistent, science of frequentist statistical inference. The joint challenges of the Depression and planning requirements of World War II (and Keynesian economic management more generally) led to the incorporation of increasingly sophisticated economic models into policy making in the 1930s and 1940s, while on the political side, reliable public opinion polling was established after some famous missteps, and by the 1950s used for televised real-time election forecasting.

By the time I was in graduate school, Isaac Asimov’s Foundation Trilogy—an extended fictional work whose plot turns on the failures of a forecasting model—was quite in vogue, and on a more practical level, the political forecasting work of the founder of numerical meteorology, Lewis Fry Richardson—originally done in the 1930s and 1940s then popularized in the early 1970s by Anatol Rapoport and others, and by the establishment of the Journal of Conflict Resolution—who in 1939 self-published a monograph titled Generalized Foreign Politics where he convinced himself [A2] that the unstable conditions in his arms race models, expressed as differential equations, for the periods 1909-1913 and 1933-1938 successfully predicted the two world wars. Also at this point we saw various “systems dynamics” models, most [in]famously the Club of Rome’s fabulously inaccurate Limits to Growth model published in 1972, which spawned about ten years of [also very poorly calibrated] similar efforts.

More critically, by the time I was in graduate school, DARPA was funding work on IFMs at a level that kept me employed as a computer programmer rather than teaching discussion sections for introductory international relations classes. These efforts would carry on well into the Reagan administration—at no less a level than the National Security Council, under Richard Beale’s leadership of a major event data effort—before finally being abandoned as impractical, particularly on the near-real-time data side,

In terms of the immediate precedents to contemporary IFMs, in the 1990s there were a series of efforts coming primarily coming out of IGOs and NGOs—specifically Kumar Rupesinghe at the NGO International Alert and the late Juergen Dedring within the United Nations (specifically its Office for Research and the Collection of Information)—as well as the late Ted Robert Gurr in the academic world, Vice President Al Gore and various people associated with the US Institute for Peace in the US government, and others far too numerous to mention (again, there’s a modestly interesting M.A. thesis here, and there is a very ample paper trail to support it) but again these went nowhere beyond spawning the U.S. State Failures Project, the direct predecessor of PITF, but the SFP’s excessively elaborate (expensive, and, ultimately, irreproducible) IFMs initially failed miserably due to a variety of technical flaws.

We then went into a “IFM Winter”—riffing on the “AI Winter” of the late-1980s—in the 2000s where a large number of small projects with generally limited funding continued to work in a professional environment which calls to mind Douglas Adams’s classical opening to Hitchhiker’s Guide to the Galaxy

Far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy lies a small unregarded yellow sun. Orbiting this at a distance of roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea.

Yeah, that’s about right: during the 2000s IFM work was definitely amazingly primitive and far out in the academically unfashionable end of some uncharted backwaters. But this decade was, in fact, a period of gestation and experimentation, so that by 2010 we had seen, for example, the emergence of the ACLED project under Clionadh Raleigh, years of productive experimentation at PITF under the direction of Jay Ulfelder, the massive investment by DARPA in ICEWS [A3], substantial modeling and data collections effort at PRIO under the directorship of Nils-Petter Gleditsch and substantial expansion of the UCDP datasets. While models in the 1960s and 1970s were confined to a couple dozen variables—including some truly odd ducks, like levels of US hotel chain ownership in countries as a measure of US influence—PITF by 2010 had assembled a core data set containing more than 2500 variables. Even if it really only needed about a dozen of these to get a suite of models with reasonable performance.

All of which meant that the IFM efforts which had generally not been able to produce credible results in the 1990s became—at least for any group with a reasonable level of expertise—almost trivial to produce by the 2010s.[A5] Bringing us into the present.

Appendix footnotes

A1. A colleague recently reported that a journal editor, eviscerating an historical review article no less, required him (presumably because of issues of space, as we all are aware that with electronic publication, space is absolutely at a premium!) to remove all references to articles published prior to 2000. Because we are all aware that everything of importance—even sex, drugs, and rock-and-roll!—was introduced in the 21st century.

A2. I’m one of, I’m guessing, probably a couple dozen people who have actually gone through Richardson’s actual papers at Lancaster University (though these were eventually published, and I’d also defer to Oliver Ashford’s 1985 biography as the definitive treatment) and Richardson’s parameter estimates which lead to the result of instability are, by contemporary standards, a bit dubious and using more straightforward methods actually leads to a conclusion of stability rather than instability. But the thought was correct…

A3. Choucri and Robinson’s Forecasting in International Relations (1974) is a good review of these efforts in political science, which go back into the mid-1960s. As that volume has probably long been culled from most university libraries, Google brings up this APSR review by an obscure assistant professor at Northwestern but, demonstrating as ever the commitment of professional scientific organizations and elite university presses to the Baconian norm of universal access to scientific knowledge, reading it will cost you $25. You can also get a lot from an unpaywalled essay by Choucri still available at MIT.

A4. The ICEWS program involved roughly the annual expenditures of the entire US NSF program in political science. Even if most of this went to either indirect costs or creating PowerPoint™ slides, with yellow type on a green background being among the favored motifs.

A5. As I have repeated on earlier occasions—and no, this is not an urban legend—at the ICEWS kick-off meeting, where the test data and the unbelievably difficult forecasting metrics, approved personally by no less than His Stable Genius Tony Tether, were first released, the social scientists went back to their hotel rooms and on their laptops had estimated models which beat the metrics before the staff of the defense contractors had finished their second round of drinks at happy hour. Much consternation followed, and the restrictions on allowable models and methods became ever more draconian as the program evolved. The IFM efforts of ICEWS—the original purpose of the program—never gained traction despite the success of nearly identical contemporaneous efforts at PITF—though ICEWS lives on, at least for now, as a platform for the production of very credible near-real-time atomic event data.

Appendix 2: Irreducible sources of error

This is included here for two reasons. First, the exposition of a systematic set of reasons as to why IFMs have an accuracy “speed limit”—apparently an out-of-sample AUC in the range of 0.80 to 0.85 at the two-year time horizon for nation-states—and if you try to get past this, in all likelihood you are just over-fitting the model. Second, it takes far too long to go through all of these reasons in a workshop presentation, but they are important.

Specification error: no model of a complex, open system can contain all of the relevant variables: “McChrystal’s hairball” is the now-classic exposition of this.
Measurement error: with very few exceptions, variables will contain some measurement error. And this presupposing there is even agreement on what the “correct” measurement is in an ideal setting.
Predictive accuracy is limited by measurement error: for example in the very simplified case of a bivariate regression model, if your measurement reliability is 80%, your accuracy can’t be more than 90%. This biases parameter estimates as well as the predictions.
Quasi-random structural error: Complex and chaotic deterministic systems behave as if they were random under at least some parameter combinations. Chaotic behavior can occur in equations as simple as $x_{t+1} = ax_t^2 + bx_t$
Rational randomness such as that predicted by mixed strategies in zero-sum games.
Arational randomness attributable to free-will: the rule-of-thumb from our rat-running colleagues: “A genetically standardized experimental animal, subjected to carefully controlled stimuli in a laboratory setting, will do whatever it damn pleases.”
Effective policy response: as discussed at several point in the main text, in at least some instances organizations will have taken steps to head off a crisis that would have otherwise occurred, and as IFMs are increasingly incorporated into policy making, this is more likely to occur. It is also the entire point of the exercise.
The effects of unpredictable natural phenomenon: for example, the 2004 Indian Ocean tsunami dramatically reduced violence in the long-running conflict in Aceh, and on numerous occasions in history important leaders have unexpectedly died (or, as influentially, not died and their effectiveness was gradually diminished).

Tetlock (2013) independently has an almost identical list of the irreducible sources of forecasting error.

Please note that while the 0.80 to 0.85 AUC speed limit has occurred relentlessly in existing IFMs, there is no theoretical reason for this number, and with finer geographical granularity and/or shorter time horizons, this could be smaller, larger, or less consistent across behaviors. For a nice discussion of the predictive speed limit issue in a different context, criminal recidivism, see Science 359:6373 19 Jan 2018, pg. 263; the original research is reported in Science Advances 10.1126/sciadv.aao5580 (2018)

Posted in Methodology, Politics | 3 Comments

Yeah, I blog…

Posted on September 13, 2018 by schrodt735

A while back I realized I’d hit fifty blog posts, and particularly as recent entries have averaged—with some variance—about 4000 words, that’s heading towards 200,000 words, or two short paperbacks, or about the length of one of the later volumes of the Harry Potter opus, or 60%-70% of a volume of Song of Ice and Fire. So despite my general admonishment to publishers that I am where book projects go to die, maybe at this point I have something to say on the topic of blog writing.

That and I recently received an email—I’m suspicious that it comes from a bot, though I’m having trouble figuring out what the objectives of the bot might be (homework exercise?)—asking for advice on blogging. Oh, and this blog has received a total of 88,000 views, unquestionably vastly exceeding anything I’ve published in a paywalled journal. [1] And finally I’ve recently been reading/listening, for reasons that will almost certainly never see the light of day [2] on the process of writing: Bradbury (magical) [3], Forster (not aging well unless you are thoroughly versed in the popular literature of a century ago), James Hynes’s Great Courses series on writing fiction, as well as various “rules for writing” lists by successful authors.

So, in my own style, seven observations.

1. Write, write, write

Yes, write, write, write: that’s the one of two consistent bits of advice every writer gives. [4] The best consistently write anywhere from 500 to 1500 words a day, which I’ve never managed (I’ve tried: it just doesn’t work for me) but you just have to keep writing. And if something doesn’t really flow, keep writing until it does (or drop it and try something else). And expect to throw away your first million words. [5]

But keep your day job: I’ve never made a dime off this, nor expect to: I suppose I’ve missed opportunities to earn some beer money by making some deal with Amazon for the occasional links to books, but doesn’t seem worth the trouble/conflicts of interest, and you’ve probably also noticed the blog isn’t littered with advertisements for tactical flashlights and amazing herbal weight-loss potions. [6] Far from making money, for all I know my public display of bad attitude has lost me some funding opportunities. Those which would have driven me (and some poor program manager) crazy.

2. Edit, edit, edit

Yes, in a blog you are freed from the tyranny of Reviewer #2, but with great power comes great responsibility, so edit ruthlessly. This has been easy for me, as Deborah Gerner and I did just that on the papers we wrote jointly for some twenty years, and at least some people noticed. [7] And as the saying goes, variously attributed to Justice Louis Brandeis and writer Robert Graves, “There’s no great writing, only great rewriting.”

In most cases these blog entries are assembled over a period of days from disjointed chunks—in only the rarest of cases will I start from the proverbial blank page/screen and write something from beginning to end—which gradually come together into what I eventually convince myself is a coherent whole, and then it’s edit, edit, edit. And meanwhile I’ll be writing down new candidate sentences, phrases, and snark on note cards as these occur to me in the shower or making coffee or walking or weeding: some of them work, some don’t. For some reason WordPress intimidates me—probably the automatic formating, I note as I’m doing final editing here—so now I start with a Google Doc—thus insuring an interesting selection of advertisements subsequently presented to me by the Google omniverse—and only transfer to WordPress in the last few steps. Typically I spend about 8 to 10 hours on an entry, and having carefully proofread it multiple times before hitting “Publish,” invariably find a half-dozen or so additional typos afterwards. I’ll usually continue to edit and add material for a couple days after “publication,” while the work is still in my head, then move on.

3. Be patient and experiment

And particularly at first: It took some time for me to find the voice where I was most comfortable, which is the 3000 – 5000 word long form—this one finally settled in at about 3100 words, the previous was 4100 words—rather than the 600-900 words typical of an essay or op-ed, to say nothing of the 140/280 characters of a Tweet. [8] My signature “Seven…” format works more often than not, though not always and I realized after a while it could be a straitjacket. [9] Then there is the early commenter—I get very occasional comments, since by now people have figured out I’m not going to approve most and I’m not particularly interested in most feedback, a few people excepted [4]—who didn’t like how I handled footnotes, but I ignored this and it is now probably the most definitive aspect of my style.

4. Find a niche

I didn’t have a clear idea of where the blog would go when I started it six years ago beyond the subtitle “Reflections on social science, politics and education.” It’s ended up in that general vicinity, though “Reflections on political methodology, conflict forecasting and politics” is probably more accurate now. I’ve pulled back on the politics over the last year or so since the blogosphere is utterly awash in political rants these days, and the opportunities to provide anything original are limited: For example I recently started and then abandoned an entry on “The New Old Left” which reflected on segments of the Democratic Party returning to classical economic materialist agendas following a generation or more focused on identity but, like, well, duh… [10] More generally, I’ve got probably half as much in draft that hasn’t gone in as that which has, and some topics start out promising and never complete themselves: you really have to listen to your subject. With a couple exceptions, it’s the technical material that generates the most interest, probably because no one else is saying it.

5. It usually involves a fair amount of effort. But occasionally it doesn’t.

The one entry that essentially wrote itself was the remembrance of Heather Heyer, who was murdered in the white-supremacist violence in Charlottesville on 12 August 2017. The commentary following Will Moore’s suicide was a close second, and in both of these cases I felt I was writing things that needed to be said for a community. “Feral…”, which after five years invariably still gets couple views a day [11], in contrast gestated over the better part of two years, and its followup, originally intended to be written after one year, waited for three.

Successful writers of fiction often speak of times where their characters—which is to say, their subconscious—take hold of a plot and drive it in unexpected but delightful ways. For the non-fiction writer, I think the equivalent is when you capture a short-term zeitgeist and suddenly find relevant material everywhere you look [18], as well as waking up and dashing off to your desk to sketch out some phrases before you forget them. [12]

6. Yeah, I’m repetitive and I’m technical

Repetitive: see Krugman, P., Friedman, T., Collins, G., Pournelle, J., and Hanh, T. N. Or, OMG, the Sutta Pikata. And yes, there is a not-so-secret 64-character catch-phrase that is in pretty much every single entry irrespective of topic.[13] As in music, I like to play with motifs, and when things are working well, it’s nice to resolve back to the opening chord.

Using the blog as technical outlet, notably on issues dealing with event data, has been quite useful, even if that wasn’t in the original plan. Event data, of course, is a comparatively tiny niche—at most a couple hundred people around the world watch it closely—but as I’ve recently been telling myself (and anyone else who will listen), the puzzle with event data is it never takes off but it also never goes away. And the speed with which the technology has changed over the past ten years in particular is monumentally unsuited to the standard outlets of paywalled journals with their dumbing-down during the review process and massive publication delays. [14] Two entries, “Seven observations on the [then] newly released ICEWS data” and “The legal status of event data” have essentially become canonical: I’ve seen them cited in formal research papers, and they fairly reliably get at least one or two views a week, and more as one approaches the APSA and ISA conferences or NSF proposal deadlines. [15]

7. The journey must be the reward

Again, I’ve never made a dime off this directly [16], nor do I ever expect to unless somehow enough things accumulate that they could be assembled into a book, and people buy it. [17] But it is an outlet that I enjoy and I also have become aware, from various comments over the years, that this has made my views known to people, particularly on the technical side in government, I wouldn’t ever have direct access to: They will mention they read my blog, and a couple times I believe they’ve deliberately done so in the earshot of people who probably wish they didn’t. But fundamentally, like [some] sharks have to keep moving to stay alive, and salmon are driven to return upstream, I gotta write—both of my parents were journalists, so maybe as with the salmon it’s genetic?—and you, dear reader, get the opportunity to read some of it.

Footnotes

1. But speaking of paywalled journals, the major European research funders are stomping down big-time! No embargo period, no “hybrid models”, publish research funded by these folks in paywalled venues and you have to return your grant money. Though if health care is any model, this trend will make it across the Atlantic in a mere fifty to seventy-five years.

2. A heartfelt 897-page Updike-inspired novel centered on the angst of an aging computer programmer in a mid-Atlantic university town obsessed with declining funding opportunities and the unjust vicissitudes of old age, sickness, and death.

Uh, no.

African-Americans, long free in the mid-Atlantic colonies due to a successful slave revolt in 1711-1715 coordinated with native Americans—hey, how come every fictional re-working of U.S. history has to have the Confederacy winning the Civil War?—working as paid laborers on the ever-financially-struggling Monticello properties with its hapless politician-owner, now attacked by British forces seeking to reimpose Caribbean slavery (as well as being upset over the unpleasantness in Boston and Philadelphia). Plus some possible bits involving dragons, alternative dimensions most people experience only as dark energy, and of course Nordic—friendly and intelligent—trolls.

Or—totally different story—a Catalonian Jesuit herbalist—yeah, yeah, I’m ripping off Edith Pargeter (who started the relevant series at age 64!), but if there is the village mystery genre (Christie, Sayers (sort of…), Robinson) and the noir genre (Hammett, Chandler, Elroy), there’s the herbalist monk genre—working in the Santa Marie della Scala in the proud if politically defeated and marginalized Siena in the winter of 1575 who encounters a young and impulsive English earl of a literary bent who may or may not be seeking to negotiate the return of England to Catholicism, thus totally, like totally!!! changing the entire course of European history (oops, no, that’s Dan Brown’s schtick…besides, those sorts of machinations were going on constantly during that era. No dragons or trolls in this one.) but then a shot rings out on the Piazza del Campo, some strolling friars pull off their cloaks to reveal themselves as Swiss Guards, and a cardinal lies mortally wounded?

Nah…I’m the place where book projects go to die…

3. Ah, Ray Bradbury: Growing up in fly-over country before it was flown over, writing 1,000 words a day since the age of twelve, imitating various pulp genres until his own literary voice came in his early 20s. A friend persuades him to travel across the country by train to visit NYC where after numerous meetings with disinterested publishers, an editor notes that his Martian and circus short stories were, in fact, the grist for two publishable books—which I of course later devoured as a teenager—and he returns home to his wife and child in LA with checks covering a year’s food and rent. Then Bradbury, then only a high-school education, receives a note that Christopher Isherwood would like to talk with him, and then Isherwood says they really should talk to his friend Aldous Huxley. And by 1953, John Huston asks him to write a screenplay for Moby Dick, provided he do this while living in the gloom of Ireland.

4. And—beyond edit, edit, edit—about the only one. For example, Bradbury felt that a massive diet of movies in his youth fueled his imagination; Stephen King says if you can’t give up television, you’re not serious about writing. About half of successful writers apparently never show unfinished drafts to anyone, the other half absolutely depend on feedback from a few trusted readers, typically agents and/or partners.

Come to think of it, two other near-universal bits of advice: don’t listen to critics, and, closely related, don’t take writers’ workshops very seriously (even if you are being paid to teach in them).

5. Which I’d read first from Jerry Pournelle, but it seems to be general folklore: Karen Woodward has a nice gloss on this.

6. Or ads for amazing herbal potions for certain male body functions. I actually drafted a [serious] entry for “The Feral Diet” I’d followed with some success for a while but, alas, like all diet regimes, it only worked for weight loss for a while (weight maintenance has been fine): I ignore my details and just follow Michael Pollan and Gary Taubes.

7. High point was when we were asked by an NSF program director if it would be okay to share one of our [needless to say, funded] proposals with people who wanted an example of what a good proposal looked like.

8. Twitter is weird, eh? I avoided Twitter for quite some time, then hopped—hey, bird motifs, right?—in for about a year and a half, then hopped out again, using it now only a couple times a week. What is interesting is the number of people who are quite effectively producing short-form essays using 10 to 20 linked tweets, which probably not coincidentally translates to the standard op-ed length of around 700 – 800 words, but the mechanism is awkward, and certainly wouldn’t work for a long-form presentation. If Twitter bites the dust due to an unsustainable financial model—please, please, please, if only for the elimination of one user’s tweets in particular—that might open a niche for that essay form, though said niche might already be WordPress.

While we’re on the topic of alternative media, I’ve got the technology to be doing YouTube—works for Jordan Peterson and, by inference, presumably appeals to lobsters—but I suspect that won’t last both because of the technological limitations—WordPress may not be stable but the underlying text—it’s UTF-8 HTML!—is stable—and the fact the video form itself is more conversational and hence more transient. Plus I rarely watch YouTube: I can read a lot faster than most people speak.

9. Same with restricting the length, which I tried for a while, and usually putting constraints around a form improves it. But editing for length is a lot of work, as any op-ed columnist will tell you, and this is an informal endeavor. The “beyond the snark” reference section I employed for a while also didn’t last—in-line links work fine, and the ability to use hyperlinks in a blog is wonderful, one of the defining characteristics of the medium.

10. I’ve got a “Beyond Democracy” file of 25,000 words and probably a couple hundred links reflecting on the emergence of a post-democratic plutocracy and how we might cope with it: several unfinished essays have been stashed in this file. Possibly that could someday jell as a book, but, alas, have I mentioned that I am the place where book projects go to die? Are you tired of this motif yet?

11. The other entry which is consistently on the “Viewed” list on the WordPress dashboard—mind you, I only look at this for the two or three days after I post something to get a sense of whether it is getting circulated—is “History’s seven dumbest self-inflicted political disasters.” Whose popularity—this is Schrodt doing his mad and disruptable William McNeill imitation (badly…)—I absolutely cannot figure out: someone is linking it somewhere? Or some bot is just messing with me?

12. Dreaming of a topic for [seemingly] half the night: I hate that. The only thing worse—of course, beyond the standard dreams of being chased through a dank urban or forested landscape by a menacing evil while your legs turn to molasses and you simply can’t run fast enough—is dreaming about programming problems. If your dreams have you obsessing with some bit of writing, get out of bed and write it down: it will usually go away, and usually in the morning your nocturnal insight won’t seem very useful. Except when it is. Same with code.

13. Not this one: that would make it too easy.

14. I recently reviewed a paper—okay, that was my next-to-last review, honest, and a revise-and-resubmit, and really, I’m getting out of the reviewing business, and Reviewer #2 is not me (!!)—which attempted to survey the state of the art in automated event coding, and I’d say got probably two-thirds of the major features wrong. But the unfortunate author had actually done a perfectly competent review of the published literature, the problem being that what’s been published on this topic is the tip of the proverbial iceberg in a rapidly changing field and has a massive lag time. This has long been a problem, but is clearly getting worse.

15. Two others are also fairly useful, if both a bit dated: “Seven conjectures on the state of event data” and [quite old as this field goes] “Seven guidelines for generating data using automated coding“.

16. It’s funny how many people will question why one writes when there is no prospect of financial reward when I’ve never heard someone exclaim to a golfer: “What, you play golf for free?? And you even have to pay places to let you play golf? And spend hours and hours doing it? Why, that’s so stupid: Arnold Palmer, Jack Nicklaus, and Tiger Woods made millions playing golf! If you can’t, just stop trying!”

17. As distinct from Beyond Democracy, the fiction, and a still-to-jell work on contemporary Western Buddhism—like the world needs yet another book by a Boomer on Buddhism?—all of which are intended as books. Someday…maybe…but you know…

18. Like the Economist Espresso‘s quote of the day: “A person is a fool to become a writer. His [sic] only compensation is absolute freedom.” Roald Dahl (Charlie and the Chocolate Factory, Matilda, The Fantastic Mr. Fox). Yep.

Posted in Uncategorized | Leave a comment

The Human Component

Developing machine-assisted coding systems

Human vs automated coding: ChatGPT changes everything, right?

The Legal Situation

The data-point economic value paradox

Footnotes

The [likely] soon-to-be-fulfilled quest for the IP-free training and validation sets

Footnotes

The real background

Gimme a model

Why, and why now?

Stage 1: Large corporations accepted open source

Stage 2: Remote distributed teams emerge during the 2010s

Stage 3: Out of necessity, COVID clinches the remote model

Implications

So, why don’t we see this happening? Where’s the Economist special issue?

Final thoughts

Footnotes

So, what the heck is a transformer??[4]

1. Having been trained on Wikipedia, transformer base models have the long-sought “common sense” about political behavior.

2. These are relatively easy to deploy, both in terms of hardware and software

3. The synonym/homonym problems solved through word embeddings and context

4. 3 parameters good; 12 parameters bad; billions of parameters, a whole new world

5. There’s plenty more room to develop new “neural” architectures

6. Prediction: Sequences, sequences, sequences, then chunk, chunk, chunk

7. No, they aren’t sentient, though it may be useful to treat them as such [11]

Footnotes

1. Loneliness and isolation are likely to be your biggest problem

2. Togetherness may well be your second biggest problem

3. Schedule your in-work downtime: you need it

4. Be very suspicious of any software or hardware your employer wants in your home

5. Use video conferencing. And the mute option.

6. Dedicated space if you can find it

7. Now is a very good time to assess your work-life balance

Footnotes

Deep work is a limited and costly resource

Remote work is cool and catching on

The joy of withered technologies

The joy of mature software suites: there is no “software crisis”

Whoever has the most labelled cases wins

David Epstein’s Range: Why generalists triumph in a specialized world is worth a read

The mysteries of 1000-hour neuroplasticity

Footnotes

1. Produce a fully-functional, well-tested, open-source coder based on universal dependency parsing

2. Learn dictionaries and/or classifiers from the millions of existing, if crappy, text-event pairs

3. ABC: Anything but CAMEO

4. Native coders vs machine translation

5. Assessing the marginal contribution of additional news sources

6. Analyze the TERRIER and Cline Center long time series

7. Find an open, non-trivial true prediction

Footnotes

1. Use ICEWS

2. Don’t use one-a-day filtering

3. Don’t use the “Goldstein” scale

4. The PETRARCH-2 coder is only a prototype

5. But all of these coders generate the same signal: The world according to CAMEO looks pretty much the same using any automated event coder and any global news source

6. Quantity is not quality

7. There is an open source software singularity (but not the other singularity…)

Reference

Footnotes

1. Government and IGO models are necessarily going to remain secret, for reasons both bureaucratic and practical.

2. From an academic and NGO perspective, there is a very substantial moral issue in forecasting the outcome of a collective action event.

3. Transparency is nonetheless very important in order to assess limitations and biases of models.

4. Quality is more important than quantity.

5. Sustainability of data sources.

6. Nothing is missing or in error at random: incorrect predictions and missing values carry information.

7. Remember, there are people at the end of all of these.

Footnotes

Appendix 1: An extraordinarily brief history of how we got here

Appendix footnotes

Appendix 2: Irreducible sources of error

1. Write, write, write

2. Edit, edit, edit

3. Be patient and experiment

4. Find a niche

5. It usually involves a fair amount of effort. But occasionally it doesn’t.

6. Yeah, I’m repetitive and I’m technical

7. The journey must be the reward

Footnotes

Recent Posts