Categories
Background

Caselaw Visualizer

Case Law & Interactive Visualizations

I created this blog to provide creative ways to visualize and interact with American caselaw. Although the information provided is meant to be engaging and thought-provoking, it is not meant as a research tool. If you would like to use the data provided for research, contact me to discuss the analytical methods and heuristics used to generate the content seen here.

Background

The Harvard Law School’s Caselaw Access Project

In 2018, the Harvard Law School Library’s Innovation Lab launched its Caselaw Access Project, providing public access to the full corpus of published U.S. case law.

“Between 2013 and 2018, the Library digitized over 40 million pages of U.S. court decisions, transforming them into a dataset covering almost 6.5 million individual cases. The CAP API and bulk data service puts this important dataset within easy reach of researchers, members of the legal community and the general public.”

Harvard Law Today (here)

This dataset has not only served the research community, it has also been used to create fun applications including a caselaw lymrick generator and caselaw color visualizer.

“Decedent was delicate, just sick.
In some spots the undergrowth was thick.
The defendant, Chas.
The defendant, Charles.
The roadway was not oily or slick.”

A Caselaw Limerick (here)

The public availability of these 6.5 million cases, opened up the possibility for unprecedented application of corpus linguistics and natural language processing methods on U.S. caselaw.

I created this Caselaw Visualizer Project as a way to demonstrate, through short examples, what analyses are possible and how this vast amount of data can be visualized.

Methods

The Case.Law’s bulk data service was used to download the full dataset for state and federal caselaw. Python 3 was used along with the Natural Language Processing Toolkit (NLTK). The data was visualized through interactive javascript-based charts created using Toast UI Charts.

Data Processing & Normalization

Distribution of Cases

The Caselaw Access Project “includes all official, book-published United States case law — every volume designated as an official report of decisions by a court within the United States.

Any corpus-based analysis, therefore, must be done using frequency not raw counts as the various jurisdictions within the U.S. contain vastly different numbers of cases and of published words. This variation is also evident within jurisdictions, as each court and judge is not homogeneously prolific.


<tip>

Instructions: Explore the interactive data visualizations on this site by clicking on the data points you wish to learn more about. On wider screens, depending on the type of visualization, you may see the option to display only the states you wish to learn more about on certain types of charts.

Troubleshooting: If you experience issues viewing or interacting with the visualizations, please use a non-Firefox desktop browser.

</tip>


The following chart demonstrates this point, showing only state court cases:


Because the data set does not include the following categories of cases, any generalizations gathered from the corpus, must be read in context.

The following chart contains federal cases divided by circuit and court.


Substantive v. Procedural Cases

As mentioned the Caselaw Access Project “includes all official, book-published United States case law — every volume designated as an official report of decisions by a court within the United States.” This mean that the dataset contains not only substantive caselaw, but also procedural cases such as simple denials of Certiorari. See the following example from the Supreme Court of the US and the Supreme Court of Alabama denying Certiorari.

Complete Opinion from the Supreme Court of the United States:

C. A. 6th Cir. Certiorari denied.

515 U.S. 1145 (1995) (SCOTUS)

Complete Opinion from the Supreme Court of Alabama:

HARWOOD, Justice.

Petition of Early Lee Gaskin for Certiorari to the Court of Criminal Appeals to review and revise the judgment and decision of that Court in Gaskin v. State, 53 Ala.App. 64, 297 So.2d 388.

Writ denied.

HEFLIN, C. J., and MERRILL, MADDOX and FAULKNER, JJ., concur.

297 So. 2d 391 (1974) (Alabama Supreme Court)

As a proxy for substantive legal analysis, case length was used. If opinions contained 50 or more words, they were included in analyses and visualizations. If opinions contained fewer than 50 words, they were not included in either analyses or visualizations. Words were counted using Python’s NLTK Tokenizer, ignoring punctuation.

For state cases, the proportion of “procedural” cases (i.e., cases shorter than 50 words) varies significantly.



A similar phenomenon can be seen in federal cases. Here we can see that the Supreme Court of the United States has the highest number of “procedural” cases, most of which are likely denials of Certs.

Normalization by Word Count and by Case Count

As stated above, the various courts and jurisdictions within the dataset vary drastically both in number of cases published and in number of words published. Here are geographical visualizations of the cases and words published.

The same map measuring the number of cases is discernibly different.


If you would like to learn more about this project or would like to contribute to its growth and maintenance, contact me on Twitter or on LinkedIn.

Categories
Background

About Me

My name is João Marinotti and I am a Visiting Fellow at the Yale Law School Information Society Project and a Postdoctoral Fellow at Indiana University Maurer School of Law’s Center for Law, Society and Culture. I started this project as a student at Harvard Law School (J.D. ’20) and had a blast using my educational background in linguistics, informatics, and law to create this blog. Follow me on Twitter and LinkedIn for updates on this and other projects. If you have suggestions or want to contribute to the blog, don’t hesitate to DM me on Twitter.