Case Law & Interactive Visualizations
I created this blog to provide creative ways to visualize and interact with American caselaw. Although the information provided is meant to be engaging and thought-provoking, it is not meant as a research tool. If you would like to use the data provided for research, contact me to discuss the analytical methods and heuristics used to generate the content seen here.
Background
The Harvard Law School’s Caselaw Access Project
In 2018, the Harvard Law School Library’s Innovation Lab launched its Caselaw Access Project, providing public access to the full corpus of published U.S. case law.
“Between 2013 and 2018, the Library digitized over 40 million pages of U.S. court decisions, transforming them into a dataset covering almost 6.5 million individual cases. The CAP API and bulk data service puts this important dataset within easy reach of researchers, members of the legal community and the general public.”
Harvard Law Today (here)
This dataset has not only served the research community, it has also been used to create fun applications including a caselaw lymrick generator and caselaw color visualizer.
“Decedent was delicate, just sick.
A Caselaw Limerick (here)
In some spots the undergrowth was thick.
The defendant, Chas.
The defendant, Charles.
The roadway was not oily or slick.”
The public availability of these 6.5 million cases, opened up the possibility for unprecedented application of corpus linguistics and natural language processing methods on U.S. caselaw.
I created this Caselaw Visualizer Project as a way to demonstrate, through short examples, what analyses are possible and how this vast amount of data can be visualized.
Methods
The Case.Law’s bulk data service was used to download the full dataset for state and federal caselaw. Python 3 was used along with the Natural Language Processing Toolkit (NLTK). The data was visualized through interactive javascript-based charts created using Toast UI Charts.
Data Processing & Normalization
Distribution of Cases
The Caselaw Access Project “includes all official, book-published United States case law — every volume designated as an official report of decisions by a court within the United States.”
Any corpus-based analysis, therefore, must be done using frequency not raw counts as the various jurisdictions within the U.S. contain vastly different numbers of cases and of published words. This variation is also evident within jurisdictions, as each court and judge is not homogeneously prolific.
<tip>
Instructions: Explore the interactive data visualizations on this site by clicking on the data points you wish to learn more about. On wider screens, depending on the type of visualization, you may see the option to display only the states you wish to learn more about on certain types of charts.
Troubleshooting: If you experience issues viewing or interacting with the visualizations, please use a non-Firefox desktop browser.
</tip>
The following chart demonstrates this point, showing only state court cases:
Because the data set does not include the following categories of cases, any generalizations gathered from the corpus, must be read in context.
- New cases as they are published.
- Cases not designated as officially published, such as most lower court decisions.
- Cases officially published in digital form, such as recent cases from Illinois, Arkansas, New Mexico, and North Carolina.
The following chart contains federal cases divided by circuit and court.
Substantive v. Procedural Cases
As mentioned the Caselaw Access Project “includes all official, book-published United States case law — every volume designated as an official report of decisions by a court within the United States.” This mean that the dataset contains not only substantive caselaw, but also procedural cases such as simple denials of Certiorari. See the following example from the Supreme Court of the US and the Supreme Court of Alabama denying Certiorari.
Complete Opinion from the Supreme Court of the United States:
C. A. 6th Cir. Certiorari denied.
515 U.S. 1145 (1995) (SCOTUS)
Complete Opinion from the Supreme Court of Alabama:
HARWOOD, Justice.
Petition of Early Lee Gaskin for Certiorari to the Court of Criminal Appeals to review and revise the judgment and decision of that Court in Gaskin v. State, 53 Ala.App. 64, 297 So.2d 388.
Writ denied.
HEFLIN, C. J., and MERRILL, MADDOX and FAULKNER, JJ., concur.
297 So. 2d 391 (1974) (Alabama Supreme Court)
As a proxy for substantive legal analysis, case length was used. If opinions contained 50 or more words, they were included in analyses and visualizations. If opinions contained fewer than 50 words, they were not included in either analyses or visualizations. Words were counted using Python’s NLTK Tokenizer, ignoring punctuation.
For state cases, the proportion of “procedural” cases (i.e., cases shorter than 50 words) varies significantly.
A similar phenomenon can be seen in federal cases. Here we can see that the Supreme Court of the United States has the highest number of “procedural” cases, most of which are likely denials of Certs.
Normalization by Word Count and by Case Count
As stated above, the various courts and jurisdictions within the dataset vary drastically both in number of cases published and in number of words published. Here are geographical visualizations of the cases and words published.
The same map measuring the number of cases is discernibly different.
If you would like to learn more about this project or would like to contribute to its growth and maintenance, contact me on Twitter or on LinkedIn.