- Version: 0.0.12.
- Released: 2026/06/11
- Author(s): Bryan Gee (UT Libraries, University of Texas at Austin; bryan.gee@austin.utexas.edu; ORCID: 0000-0003-4517-3290)
- Contributor(s): None
- License: GNU GPLv3
- README last updated: 2026/06/11
This repository includes scripts that are designed for reporting and assessment purposes for the Texas Data Repository (TDR). They are intended for both institution-level and TDR-level analysis.
- dataverse-file-assessment.py: Despite the name, this script retrieves metadata at the dataset and collection levels as well as at the file level. This is accomplished through multiple API calls through the Search API and the Native API and includes metadata that is not present in the monthly institutional reports generated by TDL (though conversely, some information available in those reports, like metadata on unpublished datasets for all institutions, cannot be retrieved from these scripts). The script can either be run at the institution-level, in which case a regular liaison user would be able to retrieve unpublished collections, datasets, and files, or it can be run at the pan-TDR level. Core parts of the codebase have been modified or created in parallel with the Metadata Re-Curation Workflow such that the current version differs a fair bit from previous versions.
- dataverse-file-assessment.ipynb: The same as the first file, just in Jupyter format.
- tcdl-graphs.ipynb: Jupyter notebook with code to generate graphics for annual TCDL usage reporting. Right now, it also contains all of the code in tcdl-institutional-reports.ipynb for a single-shot run.
- tcdl-institutional-reports.ipynb: Jupyter notebook with code to generate institution-specific slide-decks for annual TCDL usage reporting.
- affiliation-map-primary.csv: This is a file developed for automated re-curation workflows that may be integrated into this dataset/file assessment workflow as well. It provides a mapping of every unique listed affiliation in the UT Austin dataverse (as of 2026/05/28) to a ROR identifier (if one exists).
- funder-map-primary.csv: This is a file developed for automated re-curation workflows that may be integrated into this workflow as well. It provides a mapping of every unique listed funder across all TDR published datasets (as of 2026/05/28) to a ROR identifier (if one exists).
- utils.py: This file contains all of the functions needed for the scripts. As a note, this is a function file being used by the developer across many different projects, so it includes many functions irrelevant to this repository, which is why only the necessary ones are imported in the scripts. It should not be modified except by users with detailed knowledge of Python and this workflow.
This script will return eight direct output files (listed in the order in which they are generated):
- date_institution_all-deposits.csv: a dataset-level dataframe with an entry for every dataset that is returned from the Search API. For users with the appropriate permissions, this can include unpublished and deaccessioned datasets. This dataframe is merged with one of the TDL data dumps for additional dataset-level metadata.
- date_institution_all-files-deduplicated.csv: a file-level dataframe with an entry for each file retrieved from the search process. If you are only retrieving published records, this will have '-PUBLISHED' appended to the end of the filename.
- date_institution_all-datasets-combined.csv: a dataset-level dataframe that is constructed by aggregating all file-level information into dataset-level entries and then merging it with one of the TDL data dumps for additional dataset-level metadata.
- date_institution_all-dataverses.csv: a collection-level dataframe with an entry for every collection that is returned from the Search API. For users with the appropriate permissions, this can include unpublished and deaccessioned collections. This dataframe is merged with one of the TDL data dumps for additional collection-level metadata.
- date_institution_SUMMARY-unique-format.csv: a dataframe with a summary of the number of unique datasets in which each file format occurs.
- date_institution_SUMMARY-annual-size.csv: a dataframe with a summary of the total file size of files created in a given year.
- date_institution_all-datasets-combined-with-dataverses.csv: a dataset-level dataframe that is essentially files 3 and 4 combined. If you enable metrics retrieval, these will be appended to this file.
- date_institution_all-datasets-combined-with-dataverses-PUBLISHED.csv: the same as file 7 but only for published datasets.
If you already have the file called affiliation-map-primary.csv, the script will also generate a file called affiliation-map-primary-TEMP.csv. This file is generated by collecting all unique affiliations in the latest run, combining that with the existing file, and de-duplicating (keeping the previous entries, at least some of which will have been ROR-matched). The idea is to build a continually growing reference file for your local TDR instance, so after editing the -TEMP file to add any new ROR matches, you should manually save it as affiliation-map-primary.csv to overwrite the older version so that the next time the script runs, it will pull that new version. If you don't have the affiliation-map-primary.csv file to start, the first time you run this script, it will save the unique affiliation dataframe as that filename, and then you can start building the database for ROR matching. The same is true for the funder-map-primary.csv file. It is not necessary to edit these files if you do not want to; the idea would be for someone to be responsible for centrally maintaining the currency of these maps.
Right now, this script is not set to write any outputs, as generated graphics can be directly copied out of the Jupyter notebook interface and into any desired program. In the future, the script will be set up to save these images.
This script will return one PowerPoint file for each institution with the filename formatted as {institution}report{date}.pptx. The slide-decks are in 16:9 format to facilitate direct adaptation of the pan-TDR graph code, which is mostly set to output plots at 14 x 7 dimensions.
These scripts can be freely re-used, re-distributed, and modified in line with the associated GNU GPLv3 license. If a re-user is only seeking to replicate a UT-Austin-specific output or to retrieve an equivalent output for a different institution, the script will require very little modification - essentially only the defining of affiliation parameters will be necessary. A superuser could have greater functionality in some instances, but superuser-specific functionality has largely not been developed because I have no way to test it.
API keys and numerical API query parameters (e.g., records per page, page limit) are defined in a env.json file. The file included in this repository called env-template.json should be populated with API keys and any other user/institution-specific information and renamed.
Users will need to create accounts for Dataverse in order to obtain personalized API keys, add those to the env-template.json file, and rename it as env.json.
A Boolean variable called test, defined in the env.json file, can be used to create a 'test environment.' If this setting is set to TRUE, the script is set to only retrieve a handful of pages of the full response. It is useful for testing new functionality and trouble-shooting, provided that any bugs are not edge cases that would be unlikely to be retrieved in a small sample size.
Following requests to implement manual rate limiting, large batches of iterative API calls have had manually rate limiting implemented in the code (via time.sleep commands). This should not be modified.
In addition to the technical infrastructure needed to run this script, two different files provided by TDL are necessary:
- dataverse-reports-YYYYMMDD: this folder contains the biweekly (now monthly?) reports run for each institution. The primary script here will concatenate all of the datasets and dataverses by importing each file's relevant sheets and will output a single concatenated file for each into that same folder.
- Dataverse-users-YYYYMMDD.xlsx: this Excel file contains all users in the system and cannot be reproduced by concatenating the 'users' tab from the biweekly reports. It is only necessary for the graphing components - there are no additional data retrieval components involved with this. This should be converted to a CSV for import.
- logos: this folder contains PNG or JPG images of each institution's logo. This is not shared on GitHub for trademark purposes and can either be requested from this repository's maintainer (Bryan) or recreated yourself by adding a logos subfolder within the same directory as the script and adding images with the name {institution}_logo. For standardization, you should use the TDR collection abbreviation (e.g., 'utexas' for UT Austin) that is used as the alias for your institution's collection.
These scripts make use of common modules that should either come pre-installed with a standard installation of Python or that are widely used and maintained:
- ast
- csv
- datetime
- io
- json
- matplotlib
- numpy
- os
- pandas
- pillow (PIL)
- python-pptx
- re
- requests
- sys
- time
- urllib