Table of Contents
- About RIPA
- About the project
- How to contribute
Assembly Bill 953 requires each state and local agency in California that employs peace officers to annually report to the Attorney General data on all stops, as defined in Government Code 12525.5(g)(2), conducted by the agency's peace officers. The bill requires the collected data to include specified information, including the time, identity, date and location of the stop, and the reason for the stop. The current dataset (RIPA Stop Data.csv) is composed of data collected by the eight largest agencies in the state between July and December of 2018.
- Open Justice website
About the project
The idea to explore this data came from listenting to the March 3rd, 2020 episode of Pod Save the People (https://overcast.fm/+I9KRmEwJc/04:12) where they mention this article from The Appeal: https://theappeal.org/san-francisco-da-to-announce-sweeping-changes-on-sentencing-policy-and-police-stops/.
Boudin will announce a second directive today, also reviewed by The Appeal, on what are known as pretextual stops, in which an officer stops someone for a minor offense or infraction, such as a traffic violation, in order to conduct an unrelated search for items like guns or drugs.
According to the new policy, the DA’s office will not prosecute possession of contraband cases when the contraband was collected as a result of an infraction-related stop, “where there is no other articulable suspicion of criminal activity.” Any deviations from the policy should be made in writing and require approval from the DA or a chief of the criminal division.
Additionally, the ban includes cases in which a person consented to a search “because of the long-standing and documented racial and ethnic disparities in law enforcement requests for consent to search,” according to the directive.
In the episode Sam mentions the black and brown folks are being stopped and searched at higher rates and that a lot of these searches are what they call 'consent searches' which means that the police actually report no justification at all for searching the person other than asking that person if they can search them and the person allegedly giving consent.
This gross racial disparity is hearbreaking but not surprising.
When I first set out to explore the dataset myself I found it cumbersome to work with. The dataset is composed of 1.8 million rows and 143 columns. It is around 650 mb which already makes it hard for those are not analyzing data programmatically (with paid resources like stata or sas or open-sourced ones like python and R). This information is not designed to be explored easily but there are tools that can help with that.
The goal of this project is to deploy a
datasette instance serving this data so that anyone can
explore this data more easily.
About the data
AB 953 tasked the Board with eliminating racial and identity profiling and improving diversity and racial and identity sensitivity in law enforcement by investigating and analyzing law enforcement policy and data. Data statutorily required to be collected by peace officers include both person-level (e.g. race/ethnicity) and stop-level (e.g. time of stop) information. Reporting agencies were required to begin collecting stop data in waves. Specifically, each agency that employs 1,000 or more peace officers issued its first round of reports on or before April 1, 2019; agencies that employ 667 or more but less than 1,000 peace officers shall issue its first round of reports on or before April 1, 2020; agencies that employ 334 or more but less than 667 peace officers shall issue its first round of reports on or before April 1, 2022 and, each agency that employs one or more, but less than 334 peace officers, shall issue its first round of reports on or before April 1, 2023.
|Data can be found on the California's DOJ Open Data website||https://openjustice.doj.ca.gov/data|
|Direct link to data (641.4 MB)||https://data-openjustice.doj.ca.gov/sites/default/files/dataset/2020-01/RIPA%20Stop%20Data%202018.csv|
|2020 RIPA Board Annual Report||https://oag.ca.gov/sites/all/files/agweb/pdfs/ripa/ripa-board-report-2020.pdf|
|RIPA stop data regulations||https://oag.ca.gov/sites/all/files/agweb/pdfs/ripa/stop-data-reg-final-text-110717.pdf|
The data being served on this project was downloaded from the Open Justice website on April 27th, 2020 at 7:30 AM.
|Those interested in using the data should read the accompanying README file and the stop data regulations (both are included in the docs folder)|
The dataset is 1.8 million rows and 143 columns (around 650 MB). Each
stop instance has an attached
DOJ_RECORD_ID and each
person stopped is assigned a
PERSON_NUMBER. It contains
stops from the 8 largest law enforcement agencies (LEA) in California
(those employing 1,000 or more peace officers - also known as Wave 1).
This is too large to serve as one table. It is also unnecessarily
cumbersome. There's a
datasette instance serving it
In order to make it more accessible the dataset was "broken
down" into different tables. Columns related to one another had a
shared prefix (gender variables start with
race/ethnicity variables start with
RAE_) so we created
tables with each group of prefixes. The code used to do this can be
found on the
script. The tables are as follows:
The "broken down" dataset can be found at http://ripa-2018-db.herokuapp.com.
In order to make this data more usable "out of the box"
we're adding supplemental tables to the datasette instance. These
tables should be small -ish in comparison. Right now these
include the codes and definitions for
AGE_GROUPS. All these tables have the suffix
_codes in their name. This information was extracted from
the accompanying data
A tool for exploring and publishing data
Datasette is a tool for exploring and publishing data. It helps people take data of any shape or size and publish that as an interactive, explorable website and accompanying API.
Datasette is aimed at data journalists, museum curators, archivists, local governments and anyone else who has data that they wish to share with the world. It is part of a wider ecosystem of tools and plugins dedicated to making working with structured data as productive as possible.
datasette is the engine powering this project. In
short, it grabs a sqlite database and creates an
interactive, explorable website and accompanying API. To
prepare the data we also used
tool from the datasette ecosystem which grabs CSV files and creates
sqlite databases from them.
To learn about the motivation for this project, you can read the About the project section.
This section is for a more technical description of the project. The data preparation and deployment of datasette is pretty straight-forward and it can be divided into three phases:
- Aquiring the data
- Preparing the data
- Serving the data
Acquiring the data
The data was retrieved from the Open Justice website of the California Department of Justice: https://openjustice.doj.ca.gov/data. The website provides a link to download the data as of May 5th, 2020. To learn more about the data itself you can read the About the data section.
Preparing the data
The original CSV file is over 650 MB in size so the very first step
slice it into 15 CSV files so that each could be uploaded to
GitHub with this repository. However, because the dataset is too large
to serve as a single table (1.8 million rows by 143 columns) it was
also broken down into smaller tables. This means we took
related variables (based on their suffixes) and extracted them into
their own tables. For example, variables related to gender like
G_MULTIGENDER were extracted from the "main"
table and were added to a gender table in the
database. These can be joined back to the main table using the
UNIQUE_ID assigned to them.
Each observation or row of this dataset is assigned a
DOJ_RECORD_ID and a
PERSON_NUMBER. These are
unique to the stop and the person(s) stopped respectively. This means
we could combine them to create a UNIQUE_ID for each row which we
could use to join tables together. However, this ends up being a 22
character string which is unnecessarily large. To facilitate things,
each row is assigned a numeric id starting at 1,000,000. Starting at
one million is completely arbitrary, we could have started at 0 but
because there's 1.8 million rows we made the decision to have each
numeric id be seven digits. This numeric
us join tables together and is not a big addition to
the database in terms of memory.
UNIQUE_ID is created we can extract columns
from the "main" table into their own tables and save those
as individual CSV files. We then use
create a sqlite database where each CSV is a table. In this step, we
also include the Appendix B Table 3.csv file obtained
also from the DOJ's website and any other supplemental tables we
might have created to accompany the dataset.
Serving the data
After preparing the data and creating the sqlite database we use datasette to serve it as an interactive website and API. This is as easy as running
However, we customize our datasette instance a bit.
We include a title, description, data source URL, and some extra CSS and JS assets. You can explore datasette/metadata.json to find the specifics.
We also include canned queries, queries included by default in our instance that are displayed in the main page and come with their own URL to facilitate access. These queries are included because they are useful or interesting facts found in the data. Some of them are queries that compute specific facts published in the 2020 Annual Report.
If you run into an interesting fact using this data please submit the
query as a GitHub Issue and tag it as a
We also modify some templates from datasette, specifically
query.html. The first was
modified to include some metadata in the
<head> (website preview description and such). The
second was modified to include a button below the SQL code box to
submit the query a user just ran as a suggestion on GitHub to
facilitate the sharing of these.
We also change some default options for datasette:
To facilitate the updating of this database we automate the whole process using GitHub Actions.
[src/data](src/data) folder we include the python
scripts that slice the original dataset into 15 pieces, rebuild the
dataset to create the
UNIQUE_ID variable and extract
similar variables based on their suffixes and saved those as
individual CSV files.
These scripts are orchestrated using shell scripts included in the datasette folder.
After running these scripts we have a database ready to serve and the configurations we want for our datasette instance. We then deploy to heroku which we can do directly from datasette.
You can follow each step our Action takes on the main.yml file.
. ├── AUTHORS.md ├── LICENSE ├── README.md ├── .binder ├── .github <- All things GitHub │ └── workflows <- GitHub Actions ├── datasette <- All scripts related to building and deploying │ ├── static <- Static assets (favicon, custom css, etc) │ └── templates <- Any templates to overwrite datasette's defaults. ├── data │ ├── external <- Data from third party sources. │ ├── interim <- Intermediate data that has been transformed. │ ├── processed <- The final, canonical data sets for modeling. │ └── raw <- The original, immutable data dump. ├── docs <- Documentation, e.g., doxygen or scientific papers (not tracked by git) ├── notebooks <- Jupyter/Rmarkdown notebooks └── src <- Source code for this project ├── apps <- scripts for apps (flask, streamlit) ├── data <- scripts and programs to process data ├── tools <- Any helper scripts go here └── visualization <- Scripts for visualisation of your results, e.g., matplotlib, ggplot2 related.
How to contribute
You can submit SQL queries as GitHub Issues (please tag them as
suggestions) to be included as canned queries in future
deployments. These canned queries are displayed in the main page and
come with their individual URLs to facilitate sharing and use as APIs.
At the moment, we're including all the queries necessary to
recreate the facts published in the 2020 Annual Report. If you use
find something intersting using this data, please submit your query so
that others can use it in their work and build on it.
Suggested supplemental tables
There are many ways to enrich the data we already have. One example
_codes tables. We extracted the code - definition
tables from the README file and included them as tables in the sqlite
database so that the data can be more useful "out of the
box". For example,
RAE_FULL comes with codes 1-8 for
|4||Middle Eastern/South Asian|
This is a tiny table of just 8 rows which can provide massive help for anyone working with the data.
You can see it in use here: https://ripa-2018-db.datasettes.cimarron.io/ripa-2018-db/race-ethnicity-by-reason-for-stop
If you have any ideas for other supplemental tables that we should include please open a GitHub Issue and let us know!
Optimizing datasette / sqlite
If you know of some other ways we could optimize performance on our queries or on the database itself, please let us know. We're not SQL experts and it could help a lot of people.
Ideas, thoughts, tips, suggestions, complaints.
If you have any other idea or thought or question about this project please open up a GitHub Issue to start a conversation.