By Evan Jacobs and Desiree Caldera

Video Presentation:



The Entertainment Technology Center (ETC) at Carnegie Mellon University (CMU) is a two-year Masters program founded in 1998 that focus on building transformational games, innovating by design, and interactive storytelling. With one preparatory semester and three project semesters, teams of 40% technical, 40% art, and 20% of an alternatively focused students by composition, participate in 14-week project development. Our research advisor and professor Dr. Eric Kaltman has been given copies of their project development files and folders up until this last Spring 2022 semester to perform analysis and research. This data comprises all project folders, data, source code, and all assets.

There has been some work prior to our involvement in this research. This can be split up into two parts, the ETC Past Project Listing portion and the DROID file analysis. The ETC Past Project Listing is a major endeavor of obtaining all information about those projects, such as project name, semester, year, advisors, members, client(s), technologies used, meta categories, etc. Much of the data was given to us in an unorganized fashion, and we needed to determine many of those details by investigation.

The DROID file analysis focused on the use of a tool developed by the UK National Archives called DROID (Digital Record Object Identification Program), which uses file extensions, binary signatures and more to identify files to their PRONOM file format identifier. PRONOM is the technical registry detailing information about the structure of those file formats and the software which uses them. Thus, the DROID file analysis portion was focused on the development of a methodology to analyze the data.  Multithreaded Python scripts were created to analyze the files as quickly as possible to create comma-separated-value (CSV) files for each project or semester. This part of the work also combined the ETC Past Project Listing with this analysis to initially create a SQLite3 database file of more than 6GB of metadata.

Two sets of analysis have already been performed, a partial content analysis [1] and a full content profile [2] culminating in two research papers. The partial  content analysis consisted of four projects, including one animation project and three games. The animation project consisted of 275GB of 293GB of project data. They discovered 222 distinct file extensions, 119 were not identified. In the full content profile, they found 9.2 million files, and about 4,982 file extensions. 3,425 file extensions were not identified, about four million files or 44% of the data.


The data visualization tool we plan to use is Tableau Desktop. However, Tableau Desktop does not support SQLite3 very well, so we had to move the data to a better supported database, MySQL. This in turn means we had to modify the data analysis scripts created previously to insert the data into our new database. The raw data now consists of approximately 17TB and more than 13 million files and folders. 

We had to reimagine the data pipeline for our research since we decided to use Tableau. One database we were familiar with, was free, worked with Tableau, and we knew had a good amount of online support available was MySQL. Luckily then, we only had to make a few small modifications to the Python insertion scripts. These scripts would combine the data from the ETC Past Project Listing and the meta data extracted using the DROID wrapper before inserting it into the MySQL database.

The MySQL database is setup in a LAMP stack currently hosted behind a VPN on campus. A LAMP stack is a Linux server hosting an Apache http webserver, the MySQL database, supports Python scripting, and uses PHPMyAdmin. SSL certificates were created to allow easier access to the data for us through JetBrains DataGrip and Tableau. The database is setup in three stages to preserve an unmodified dirty copy, a staging area to clean the data, and a currently clean version of the meta data. The first is a set of dirty tables where the data is directly inserted into from the Python insertion scripts. The data is then copied to the staging tables where cleaning scripts are run from DataGrip. Once the cleaning is completed, we move the data to the clean tables where it can be used in Tableau. DataGrip is used to move the data, clean the data, and create tables from the data that can then be used in Tableau. Once we have the data, we need aggregated into tables, we connected Tableau to them and began creating visualizations. Some of the visualizations we want to create include, file extensions by count, file format by aggregate size, and unidentified file extension by aggregate size. Some of the statistics we wanted to calculate include, total file count, distinct file extension count, distinct count of unidentified file formats, and size of unidentified file formats. For levels of detail, we wanted to look at by single project, project type, and project semester/year.


Our analysis covered over 17TB of data, and over 13 million files. Of those, we found 5383 file extensions, 3788 file extensions were not identified.

The driving question for our research was how do ETC projects and their composition change over time? Within the process of answering our main question, we discovered that our data set has the potential of answering a smaller subset of research questions such as:

  • Which file types are coined rare across all projects?
  • What types of redundancies do we find across all projects? (ex. duplicate copies of the same file)
  • How do project sizes change over time?

We found about 90.5% redundancy across all projects and semesters, and the highest by semester is Spring 2012 with 92.6% redundancy.

Our main goal was to display these results through visualizations that provide an easier and cohesive way to investigate the data. The obstacle that we faced throughout our research was the size of the dataset. The size of dataset forced many changes and adjustments in the ways we structured the view in DataGrip to lower the runtime execution. The lower the runtime execution in DataGrip, the lower the runtime execution time in Tableau. The constraints of the data set size and memory consumption involuntarily put us and our data visualization work on halt at some points. Despite our initial drawbacks, with the help of Dr. Kaltman, we were able to optimize the data set which, in turn, provided a reliable data source for Tableau. Within the remaining weeks of our research we were able to begin our data visualization processes as displayed below. The resources we created serve as milestones in our research as ETC have not had the opportunity to organize and analyze their data for themselves in the past. We hope the results of our work deem helpful to professionals at Carnegie Mellon University and future researchers here at CI.

ETC data analysis at Project Level
ETC data analysis at Semester Level
ETC data analysis at Redundancy


Database creation and data visualization take much grit and patience in order for these tools to make an impact. The state of the data visualization depends on the state of the data set. Working with raw data that has not been processed is an extensive challenge of its own and inevitably requires much trial and error. Through these processes, we were able to obtain new skills and encounter interesting findings. We have found that in many ways we are able to calculate and visualize the trends and changes within ETC projects. There is much value in the work of database management and development. We hope that future researchers can also discover these values as well while taking our research further.


We would like to thank Dr. Kaltman for bringing us on board and supporting our progress throughout this semester. We would also like to thank the previous student researchers that were able to guide us and help us problem solve. We are thankful to have been apart of this research and the SHFT (Software History Futures and Technologies) team.

Further Reading




Literature Cited

[1] E. Kaltman, “Preliminary Analysis of a Large-Scale Digital Entertainment Development Archive: A Case Study of the Entertainment Technology Center’s Projects,” 2019 IEEE International Conference on Big Data (Big Data), 2019, pp. 3109-3115, doi: 10.1109/BigData47090.2019.9006606.

[2] E. Kaltman, R. Lorelli, A. Larson and E. Wolfe, “Organizing a Content Profile for a Large, Heterogeneous Collection of Interactive Projects,” 2021 IEEE International Conference on Big Data (Big Data), 2021, pp. 2231-2239, doi: 10.1109/BigData52589.2021.9671904.

Leave a Reply

Your email address will not be published. Required fields are marked *