On this page:
To request the full text of items for analysis, you must submit a dataset request which includes a list of requested item IDs from the JSTOR bibliographic metadata file.
A document's bibliographic metadata provides information about the document, such as title and year of publication. You may use it to identify the items you need for your research.
Downloading the JSTOR bibliographic metadata file
To download the JSTOR bibliographic metadata file:
- Log into the JSTOR text analysis support page with your personal JSTOR account.
- If you don't already have a JSTOR account, you may register for a JSTOR account for free.
- Under Download metadata of JSTOR content available for text analysis, read and agree to the Terms and Conditions and then select the Download JSONL button.
Creating an item ID list
Once you've downloaded the JSTOR bibliographic metadata file and understand the data available in it, you'll need to create a list of JSTOR item IDs to submit as part of your dataset request.
Item ID list formatting requirements
Prior to submission, your ID list must meet the following requirements:
- A UTF-8 plain text file with the extension .txt
- One JSTOR item ID per line
- No additional text (no header, commas, or other separators)
- Maximum of 1,500,000 lines
While you can use a variety of tools to create this list, because of the size of the metadata file when uncompressed, we recommend using tools that can stream the file rather than load it entirely into memory. See an example of stream processing using Python.
Other free and open source utilities that we recommend are:
- R and the jsonlite package for streaming JSON data
- The command line utility jq
Example: Using Python to filter metadata and create an item ID list
In the following example, we want to request all items that don’t require review to download, and whose titles contain the word “Rembrandt”.
import gzip import json item_ids = [] # Using gzip.open it is possible to read the file line by line without loading it all into memory with ( gzip.open( "jstor_metadata_2025-05-15.jsonl.gz", # Replace with the name of the most recent metadata file you downloaded from https://www.jstor.org/ta-support/metadata "rt", ) as f ): for line_number, line in enumerate(f, start=1): # Print the current line number to the console so you can see the processing progress print(f"\rProcessing line: {line_number}", end="", flush=True) data = json.loads(line) # Put your filtering logic here # Check if the item will require staff review if data.get("review_required") is False: # Check if the item has a title if data.get("title") is not None: # Check if the title contains the term "Rembrandt" if "Rembrandt" in data["title"]: item_ids.append(data["item_id"]) print() # Move to the next line after processing # Write the item IDs to a file with open("rembrandt_item_ids.txt", "w") as f: for item_id in item_ids: f.write(f"{item_id}\n")
Place this file in the same directory as the downloaded metadata file, update the file name in the Python code to match the filename you downloaded (it updates daily), and run using:
python jstor_filter.py
This will produce the TXT file rembrandt_item_ids.txt, which is ready to upload with your JSTOR text analysis dataset request.
Working with JSON Lines
The JSTOR bibliographic metadata file is made available in JSON Lines file format, compressed with gzip. JSON Lines is a file format for storing structured data, where each line is a valid JSON value.
The file contains one JSON object per line, each representing the bibliographic metadata for one JSTOR item. This allows programs to process the file one record at a time without needing to load the entire dataset into memory at once.
Visit JSON Lines for documentation on the JSON Lines file format.
JSON data dictionary
You'll encounter the following fields in the JSTOR bibliographic metadata JSON Lines file. Use this data to identify what content you may want the full-text of for your research.
Field | Type | Description | Example |
---|---|---|---|
item_id | UUID | Unique identifier for a JSTOR item. | 2c0018ee-094b-3f3c-b677-2b56c0f73b7e |
review_required | Boolean | If this document is included in a dataset request, will that request require ITHAKA staff review? Items where review_required is False are generally items that are Open Access. | true |
ithaka_doi | String | A DOI-like identifier for a JSTOR item. Note that these are not guaranteed to be DOIs resolvable by CrossRef. | 10.2307/25057135 |
identifiers | Dictionary with the following fields: print_isbn online_isbn print_issn online_issn ssid catsid journal_code |
Note that some fields are identifiers for a parent item, such as ISBNs for the book that contains a book chapter, or ISSNs for the serial title that contains a research article. | {"print_isbn":null,"online_isbn":null,"print_issn":"00128163","online_issn":null,"ssid":null,"catsid":null,"journal_code":"earlamerlite","aluka_doi":null} |
title | String | Title of the item. Note that title strings may contain subtitles or whitespace such as line breaks. |
Book Review : Captivity and Sentiment: Cultural Exchange in American Literature, 1682-1861
Michelle Burnham |
isPartOf | String | Title of the parent item, if it exists. For book chapters, this is the book title. For research articles, this is the journal title. | Early American Literature |
creators_string | String | Text string with names of the item creators, which may include authors as well as editors. | Christopher Castiglia, Michelle Burnham |
creators |
List of dictionaries with the following fields: last_name: string order: integer |
When possible, JSTOR attempts to parse the raw creator string information to separate out creators as structured data. | [{"first_name":"Christopher","last_name":"Castiglia","order":"1"},{"first_name":"Michelle","last_name":"Burnham","order":"2"}] |
publishers | List of strings | Publisher(s) of the item. | ["University of North Carolina Press"] |
published_date | String date in the format YYYY-MM-DD | Publication date of the item. | |
languages | List of strings | Language(s) of the item. Note that these language codes are not fully normalized, and contain a blend of ISO 639-1 and ISO 639-2 codes. | ["eng"] |
discipline_names | List of strings | JSTOR discipline headings for the item, usually set at the serial title level. See JSTOR: Browse by Subject. | ["Language & Literature","American Studies"] |
issue_number | String | Issue number, if known. Note that this field can contain punctuation. | 3 |
issue_volume | String | Volume number, if known. Note that this field can contain punctuation. | 33 |
content_type | String | High-level categorization of the content types available on JSTOR (Books, Journals, Research Reports aka "Academic content" and Primary Sources), applied by JSTOR. | article |
content_subtype | String | A more granular classification of article or book part types available on JSTOR, applied by JSTOR or by journal publishers who submitted content to JSTOR via CSP/JHP. Also used as a more granular classification of contributed primary source content published through JSTOR Digital Stewardship Services, applied by JSTOR. | book-review |
c5_data_type | String | JSTOR maps content types to COUNTER data types for purposes of COUNTER reports provided to subscribing libraries. See Section 3.3.2 of the COUNTER Code of Practice Release 5.1. | Journal |
c5_section_type | String | Section type was deprecated with COUNTER Release 5.1. For reference, see Section 3.3.3 of the COUNTER Code of Practice Release 5.0.3. | Article |
ccda_resource_type | String | High-level categorization of the primary source content types available on JSTOR, applied by JSTOR. Each of the more granular Resource Types (see ccda_resource_subtype) also rolls up to a broad Content Type which enables the narrowing of search results through facets and determines the format for item downloads on JSTOR. See JSTOR Content and Resource Types. | null |
ccda_resource_subtype | String | A more granular classification that reflects the nature or genre of the content of the resource. Cataloged by the contributing institution from a JSTOR controlled list. Each Resource Type also rolls up to a broad Content Type (ccda_resource_type) which enables the narrowing of search results through facets and determines the format for item downloads on JSTOR. See JSTOR Content and Resource Types. | null |
contributed_content | Boolean | Is this item part of content contributed to JSTOR outside of the regular journal archive collections? | false |
collections | List of strings | One or more JSTOR collections that this item belongs to. | ["Arts & Sciences V Collection","Corporate & For-Profit Collection"] |
licensing_status | String | More details about the licensing status of the item when it is known. | open_access:CC BY-NC |
url | String | Stable URL for the item on JSTOR. | www.jstor.org/stable/10.2307/25057135 |