Big Data: Legal+Scientific Anomaly Detection

10 min readJun 6, 2021

Revealing Non-Public Information via Public Data

About ten years ago, “Big Data” was a big deal. Then A.I. came along and the big data hype fizzled. Aside from the fact that most “A.I.” is actually just big data, we’re finally getting to the “big data” point wherein we can actually extract signals from large datasets. As it turns out, many (probably most) large legacy datasets were never constructed to be readily ported, manipulated, and tested to reveal their subtextual truths. Well, here’s one such project (and, for those who occasionally ask, a project I dev’d).

Company: Qualis (prototype: Atticus)
Project: Detect the earliest signal of liability (broadly defined) from public data across individuals, corporations, products, and product components.
Potential Users/Areas: Regulators, attorneys (class action), finance, scientific research/research funding, insurance, journalists.
Datasets: all public/global.

The first step was to map out to general pathways that liability develops by reviewing known cases; that case review usually involved extracting from legal documents the precise catalyst of liability. We then traced the precise trigger of liability back to confirmable market (i.e. public data) signals. In general, the pathways can be divided into different segments: antitrust, banking, data breach, employment, marketing, patents, personal injury, product liability. Each of those broad categories generally produce non-overlapping liability pathways and often break down into sub-categories with some overlap with the mother category. Each broad category would require generally different and non-overlapping datasets that would require different machines and different algorithms (often of type but always of training). Thus, the categories present different problems.

In order to start building, we began with a single category and a single known example of liability; from there, we could identify all possible datasets that may be relevant, and then build the machines and test them to see if they can autonomously discover an otherwise known fact-pattern.

We decided to start with RoundUp, Monsanto’s (now Bayer’s) herbicide that had recently lost a string of court cases (including one for $2 billion) for causing (or possibly causing) cancer. Technically, Monsanto was typically found liable for not warning users of the potential dangers of RoundUp. International agencies began to question the carcinogenic properties of RoundUp’s active ingredient, glyphosate, in 2015, and the legal issues began around 2018. The question for us was: what’s the earliest possible time that a high-confidence public signal is produced that correlates RoundUp and/or glyphosate to carcinogenicity?

In breaking down the “product liability” category, we assumed there were at least two broad categories. (Worth noting that we’re not product or liability experts, so we just worked back from actual occurrences, as defined by verdicts/judgements, to attempt to discover patterns.) The two broad categories of product liability are (1) products that are functionally and imminently defective, such as exploding crock pots, and (2) products that are defective over time, usually due to dangerous ingredients. From a pattern recognition perspective, the primary initial difference is temporal. If you buy a crockpot from Walmart on Monday, and it blows up on Wednesday (causing severe injury), what public data would we find? We’d probably first look on social media (Twitter, some message boards, possibly Reddit) and local news (broadcast and print); it may show up on reviews or national news later, but we’re interested in the earliest public-data signal.

Another early signal for exploding crockpots we discovered is the company posting (e.g. on social media) instructions (or safety reminders) for products; under normal circumstances companies don’t post product hazard warnings on Twitter. (We back tested this against a known hazard with an infant device and found such social media posts to be remarkably early signals of liability with infant products.)

(Data note: it’s clearly not desirable to publicize discovered signals; publicizing signals may cause information pollution in the signal chain, and data-production may adapt once the signal becomes known, thus corrupting the mechanisms of signal detection. In other words, infant-product producers may stop posting random owner instructions on social media if they become aware that we’ve identified such postings as strong signals of potential liability. Thus, I’ve provided some examples here of the signals and patterns that the machines can/have detected, but these examples are far from exhaustive or even adequately descriptive of all discovered signals.)

In cases in which the time between purchase/use and the occurrence of liability is short and evident, the ingredients/components of such products aren’t particularly relevant. But in cases, where the timeframe is longer, there’s often a component/chemical catalyst. The question then becomes, how can we detect such liabilities? Keep in mind that the legal cases against Monsanto demonstrated that Monsanto failed to provide adequate information/warning about one ingredient of its products. We’d need to know the ingredients of products, the labels/instructions, and the scientific research for the ingredients. Then we’d need to identify points of divergence between a company’s communications regarding an ingredient/product and the scientific research for those ingredients. We’d be particularly interested in ingredients that scientists believe to be possibly dangerous and for which companies communicate legally inadequate information regarding that possible danger to consumers. That divergence between the public science and the public corporate communications is key to establishing liability in such consumer products. While we usually did not have the data to conclusively demonstrate that the company knew an ingredient is or may be dangerous (because such data is private), we could demonstrate that the company should have known because the studies were public.

In reviewing the public studies, we first ran across on a curious signal: studies that were dated significantly before they were published. Upon researching these phenomena, we discovered that this is often the result of a study undergoing peer review and the company (particularly the company’s scientists) making an effort to forestall the study’s publication. The result is some studies are dated years before publication, so the machine would need to observe and weight such dating discrepancies the same way a human’s interested would be piqued by such an anomaly.

Other curious phenomena were studies that were conducted by supposedly independent scientists that, upon further review, had a financial relationship with a company that profits from whatever is being studied. First, we had to build a machine that could assess such relationships and then we’d need to weight such findings accordingly. We decided that we would start by establishing the network of relationships by using patents; the machine that we developed looks for relationships between inventors (the scientist), assignees (the companies), and the chemicals (or other components). An inventor who assigns an invention to a company likely has or had a relationship with that company (we can also observe whether the inventor’s address is that same as the company’s address, which means that the inventor was employed by the company). So this is essentially a LinkedIn for scientist-company-chemical relationships. Once these are established, we can then under-weight scientific articles that are positive about the chemical/components if authored by scientists with probable financial relationships with the company (again, the same response a typical informed human reader would have).

None of these weightings are probabilistically determinative, but this enabled the machines to account for the phenomena of large companies flooding a professional venue (such as science journals) with narrative-framing content. Really, this is a classic example of attempting to minimize a signal by flooding a pipe with noise.

The Data

Ultimately, we were testing whether the machines could find a high-confidence signal for glyphosate substantially before glyphosate became publicly linked to cancer (around 2015). So while we aimed to gather the universe of all potential data, we limited it in some areas for the sake of sanity and budget. For scientific articles, we had a little over 32 million (peer-reviewed, clustered from 1970s-present); for products, we had data for about 700,000 (with label and company information). We included EPA and European regulatory data (chemical data, manufacturing data, etc.). To investigate glyphosate, we included consumer labels and data sheets for six countries (U.S., Canada, and four in Europe), press releases from major manufacturers (over 1,000 going back 20 years). For patent information, we API’d into the USPTO patent repo. We also included advertisements (e.g. television ads) for major manufacturers going back to the early 1990s.

Of course, we had to select a starting point, and we selected a European regulatory database of product chemicals, which includes a little over 65,000 chemicals and various known or probable attributes. From there, the machines could link those chemicals to other known attributes, products and companies, scientific articles, patents, people, etc.

(Data note: the European regulatory framework tends to be pro-active in contrast to the U.S. framework, which tends to be re-active. Due to this difference, European datasets tend to be optimal starting points but terrible ending points; in other words, the challenge for the machines is to discover patterns between current European datasets and future U.S. datasets. For this reason, European datasets were used as starting points but substantially under weighted beyond that so that they wouldn’t be determinative for results.)

We focused the machine reading the scientific articles on the conclusions but had to train the machine to understand the circumspect way that scientists write. (A scientist’s definitive conclusion will sound rather benign to most readers.) The machine also required fairly detailed medical ontologies so it could identify related discussions — for example, of different types of cancer; taxonomies of medical terms and legal terms (indicating potential liability) were also required. Ontologies and keyword groupings were also assigned relative weightings, so that medical terms could be understood as being more or less “serious” (similar to how a doctor might read a medical journal). In the end, the machine is attempting to discover liability (a legal principle) indicated by an anomaly between the science and corporate communications. In layman’s terms, we built a lawyer who went to med school and spends his time reading 60-years’ worth of science journals and reviewing corporate communications. I’m fairly certain no one is going to complain that we’re putting people out of work.

The patent API proved to be wonky, and we’ll probably create a static version. And the television advertisements proves useless; the machines extracted no signal value from them. (Useless for now; I suspect training a machine to read advertisements using “marketing” ontologies is possible and desirable but seemed out-of-scope at this time.) And the machines had but didn’t use a lot of data, from chemical formulae to manufacturing locations to aquatic toxicity and much else. (Such data is available in search functions but wasn’t used for discovery … discussed below.)

As noted, we only used “public data,” which means it’s available without special access; it doesn’t mean that data is easily obtainable or necessarily machine readable (or, more accurately machine usable). It was clear that many of these datasets are not regularly (if at all) being used in AI applications given the state they’re in.

Search & Discovery

This revealed-network of data has two fundamentally different components: search and discovery. Search allows a user to enter a term — chemical, company, product, location, person, etc. — and view the network in which that term exists. Discovery deploys the machines to identify and rank components of that network. We started with identify/rank of companies (the machines start with the 65,000+ list of chemicals, but the confidence levels are assigned to products and returned as companies that produce the products).

So this is what was returned with the discovery function focused on glyphosate (with a few “known safe” chemicals included); we limited the return to just a few possible companies in order to focus on the relative confidence levels of known dangerous to known safe chemicals and to enable use to confirm each result against known patterns.

Discovery produces this list; search (at the bottom) allows users to search for any term.

Clicking on ‘information’ for any result produces the underlying data (essentially, the machines produce the argument). The data includes the date that the machines have concluded the chemical/product turned legally anomalous. In the case of RoundUp/Glyphosate, the anomalous date is 2002, or 13 years before the public became aware of glyphosate’s potential dangers. (Worth noting that we confirmed the date with documents produced during the glyphosate 2017–2019 trials in which lawyers established that Monsanto was aware in 2002–04 that glyphosate was potentially carcinogenic. Importantly, we therefore demonstrated that the machines could reveal non-public information only using public data.)

A lot of the data (such as corp data) we ingested stopped at 2016–18. For purposes of this test, we knew that anything past 2015 wasn’t of use at this time. To produce this result, Atticus read 20 million papers, of which 3,191 pertained to glyphosate (with 403 human studies, 22% of which identified glyphosate as a disease trigger). The diseases that Atticus found most probably connected to glyphosate are listed here. The earliest point that glyphosate as used by Monsanto turned anomalous was 2002.

The platform generally displays all the data sources the machines utilizes to produce a ranking to enable human researchers to further confirm the ranking. This will enable human feedback for continuous training.

Once the signal is produced, the evidence for the signal is listed below to enable further research.

One example of public sentiment data (by company or ingredient or, really, any other term.)

Processing patents (by company, ingredient, keyword groups, etc.) … extracts relationships between companies, chemicals (etc.), and people. Also extracts locations but that wasn’t used in initial tests.

Produces a range of clickable/searchable graphs with revealed networks (companies/products/chemicals/people/locations, etc.) Useful for search (or manual discovery) but really we built this because it looks cool. Unfortunately it’s currently fed with live data (or, some live data and some static), which is annoying. I’m not sure there’s a good reason why this data can’t be entirely static, which will make graph search much faster. TBD.

Obviously the goal is to enable the machines across all 65,000+ chemicals and related data, but that needs to be done in the context of a controlled expansion of the universe of potential discovery results so that we may test and confirm the initial batches of returns (which is painful/time-consuming).

The Problem

This is a lot of work and consumes money/time … which requires patience and perseverance. Realistically, the training and confirmation process is probably another two years. (Which, decrypted from coder-talk to the real world, probably means three years.)

Big Data: Legal+Scientific Anomaly Detection

The Data

Search & Discovery

The Problem

Written by Nathan Allen

No responses yet