By Evan Jacobs and Desiree Caldera

Video Presentation:

Poster:

Introduction

The Entertainment Technology Center (ETC) at Carnegie Mellon University (CMU) is a two-year Masters program founded in 1998 that focus on building transformational games, innovating by design, and interactive storytelling. With one preparatory semester and three project semesters, teams of 40% technical, 40% art, and 20% of an alternatively focused students by composition, participate in 14-week project development. Our research advisor and professor Dr. Eric Kaltman has been given copies of their project development files and folders up until this last Spring 2022 semester to perform analysis and research. This data comprises all project folders, data, source code, and all assets.

There has been some work prior to our involvement in this research. This can be split up into two parts, the ETC Past Project Listing portion and the DROID file analysis. The ETC Past Project Listing is a major endeavor of obtaining all information about those projects, such as project name, semester, year, advisors, members, client(s), technologies used, meta categories, etc. Much of the data was given to us in an unorganized fashion, and we needed to determine many of those details by investigation.

The DROID file analysis focused on the use of a tool developed by the UK National Archives called DROID (Digital Record Object Identification Program), which uses file extensions, binary signatures and more to identify files to their PRONOM file format identifier. PRONOM is the technical registry detailing information about the structure of those file formats and the software which uses them. Thus, the DROID file analysis portion was focused on the development of a methodology to analyze the data.  Multithreaded Python scripts were created to analyze the files as quickly as possible to create comma-separated-value (CSV) files for each project or semester. This part of the work also combined the ETC Past Project Listing with this analysis to initially create a SQLite3 database file of more than 6GB of metadata.

Two sets of analysis have already been performed, a partial content analysis [1] and a full content profile [2] culminating in two research papers. The partial  content analysis consisted of four projects, including one animation project and three games. The animation project consisted of 275GB of 293GB of project data. They discovered 222 distinct file extensions, 119 were not identified. In the full content profile, they found 9.2 million files, and about 4,982 file extensions. 3,425 file extensions were not identified, about four million files or 44% of the data.

Methodology

The data visualization tool we plan to use is Tableau Desktop. However, Tableau Desktop does not support SQLite3 very well, so we had to move the data to a better supported database, MySQL. This in turn means we had to modify the data analysis scripts created previously to insert the data into our new database. The raw data now consists of approximately 17TB and more than 13 million files and folders. 

We had to reimagine the data pipeline for our research since we decided to use Tableau. One database we were familiar with, was free, worked with Tableau, and we knew had a good amount of online support available was MySQL. Luckily then, we only had to make a few small modifications to the Python insertion scripts. These scripts would combine the data from the ETC Past Project Listing and the meta data extracted using the DROID wrapper before inserting it into the MySQL database.

The MySQL database is setup in a LAMP stack currently hosted behind a VPN on campus. A LAMP stack is a Linux server hosting an Apache http webserver, the MySQL database, supports Python scripting, and uses PHPMyAdmin. SSL certificates were created to allow easier access to the data for us through JetBrains DataGrip and Tableau. The database is setup in three stages to preserve an unmodified dirty copy, a staging area to clean the data, and a currently clean version of the meta data. The first is a set of dirty tables where the data is directly inserted into from the Python insertion scripts. The data is then copied to the staging tables where cleaning scripts are run from DataGrip. Once the cleaning is completed, we move the data to the clean tables where it can be used in Tableau. DataGrip is used to move the data, clean the data, and create tables from the data that can then be used in Tableau. Once we have the data, we need aggregated into tables, we connected Tableau to them and began creating visualizations. Some of the visualizations we want to create include, file extensions by count, file format by aggregate size, and unidentified file extension by aggregate size. Some of the statistics we wanted to calculate include, total file count, distinct file extension count, distinct count of unidentified file formats, and size of unidentified file formats. For levels of detail, we wanted to look at by single project, project type, and project semester/year.

Results

Our analysis covered over 17TB of data, and over 13 million files. Of those, we found 5383 file extensions, 3788 file extensions were not identified.

The driving question for our research was how do ETC projects and their composition change over time? Within the process of answering our main question, we discovered that our data set has the potential of answering a smaller subset of research questions such as:

  • Which file types are coined rare across all projects?
  • What types of redundancies do we find across all projects? (ex. duplicate copies of the same file)
  • How do project sizes change over time?

We found about 90.5% redundancy across all projects and semesters, and the highest by semester is Spring 2012 with 92.6% redundancy.

Our main goal was to display these results through visualizations that provide an easier and cohesive way to investigate the data. The obstacle that we faced throughout our research was the size of the dataset. The size of dataset forced many changes and adjustments in the ways we structured the view in DataGrip to lower the runtime execution. The lower the runtime execution in DataGrip, the lower the runtime execution time in Tableau. The constraints of the data set size and memory consumption involuntarily put us and our data visualization work on halt at some points. Despite our initial drawbacks, with the help of Dr. Kaltman, we were able to optimize the data set which, in turn, provided a reliable data source for Tableau. Within the remaining weeks of our research we were able to begin our data visualization processes as displayed below. The resources we created serve as milestones in our research as ETC have not had the opportunity to organize and analyze their data for themselves in the past. We hope the results of our work deem helpful to professionals at Carnegie Mellon University and future researchers here at CI.

ETC data analysis at Project Level
ETC data analysis at Semester Level
ETC data analysis at Redundancy

Conclusion

Database creation and data visualization take much grit and patience in order for these tools to make an impact. The state of the data visualization depends on the state of the data set. Working with raw data that has not been processed is an extensive challenge of its own and inevitably requires much trial and error. Through these processes, we were able to obtain new skills and encounter interesting findings. We have found that in many ways we are able to calculate and visualize the trends and changes within ETC projects. There is much value in the work of database management and development. We hope that future researchers can also discover these values as well while taking our research further.

Acknowledgements

We would like to thank Dr. Kaltman for bringing us on board and supporting our progress throughout this semester. We would also like to thank the previous student researchers that were able to guide us and help us problem solve. We are thankful to have been apart of this research and the SHFT (Software History Futures and Technologies) team.

Further Reading

ETC @ CMU: https://www.etc.cmu.edu/

DROID: https://www.nationalarchives.gov.uk/information-management/manage-information/preserving-digital-records/droid/

PRONOM: https://www.nationalarchives.gov.uk/PRONOM/

Literature Cited

[1] E. Kaltman, “Preliminary Analysis of a Large-Scale Digital Entertainment Development Archive: A Case Study of the Entertainment Technology Center’s Projects,” 2019 IEEE International Conference on Big Data (Big Data), 2019, pp. 3109-3115, doi: 10.1109/BigData47090.2019.9006606.

[2] E. Kaltman, R. Lorelli, A. Larson and E. Wolfe, “Organizing a Content Profile for a Large, Heterogeneous Collection of Interactive Projects,” 2021 IEEE International Conference on Big Data (Big Data), 2021, pp. 2231-2239, doi: 10.1109/BigData52589.2021.9671904.

Leave a Reply

Your email address will not be published. Required fields are marked *

Development Blog #1

August 29th, 2022

  • Met with capstone team to develop a game plan for the rest of the semester
  • Created CI Sync development blog
  • Revisited the work our team had completed last semester

Development Blog #2

September 5th, 2022

  • Met with capstone team to analyze repositories together
  • Created tasks for each member
  • I analyzed the code in the main repository we will be working with and incorporating into a SQLite Server: ETC_Data_Analysis_Main
    • CSV Generator: A project folder that contains droid.py file. This droid.py file makes use of DROID and creates csv’s given a file system hierarchy
    • Database: A project folder that contains blocks of code that take in the csv’s generated by droid.py to create a database
    • THESE TWO project folders will help us create the data we need for visualizations in Tableau
  • Read up on the documentation for these repositories
  • Was able to generate a mock database with already existing csv’s

Tasks:

  • Study the repository even further and get the code in the following project folder CSV Generator and Database working again.

Development Blog #3

September 12th, 2022

  • The code for driod.py in CSV Generator is finally working
    • Installed droid, droid is now runs on a working directory of project folders
    • The process of doing so was a bit difficult as there were program and system dependencies needed that were not detailed in the documentation
    • There was much research done as far as how to install/work around these dependencies
    • In the process, I was able to familiarize my self with the code in driod.py even further as I had to really debug where the errors were coming from
    • I documented the processes and the steps and I hope to incorporate into the already existing documentation

Tasks:

  • Run the output of droid.py into databasegenerator.py, this should not take long
  • See how we can incorporate this into sqlite server
  • Talk to Evan

Development Blog #4

September 19th, 2022

  • This week consisted of getting databasegenerator.py to run. In contrast to my previous thoughts, this in fact took a long time.
    • I ran into errors that I was not quite able to debug effectively. There is not much documentation on Droid on line that was able to point me in the right direction.
    • I attempted to run csvgenerator.py file differently, this did not seem to help my case
    • Brainstormed with Evan of different ideas
    • Professor Kaltman was able to help debug the issue, databasegenerator.py is now running

Tasks:

  • Run this process on the server, take notes on the processes while doing so

Development Blog #5

September 26th, 2022

  • This week consisted of transferring the repository onto the sever and figuring out how to use command lines in the process
  • This took many trail and error, Evan went ahead and joined me on this task as well
    • There is a line of code that deals with setting the hash type that for some reason was being overlooked in the code when being ran in the server
    • This was being overlooked as well when being ran in general, we had to access the GUI to set the hash settings
  • We studied the code to see how the setting were being configured in the code. We found resources that were able to point us in the right direction. We made a small edit to the line of code and we were able to fix the hash error
  • We ran csvgenerator on the sever and we finally got csvs to generate

Tasks:

  • See how we can do data analysis on sigfried

Development Blog #6

October 3rd, 2022

  • I attempted to execute data analysis with sigfried and it partially worked
    • I was able to figure out how to tell it to use droid while also setting the correct hash setting
    • I came to a stop as there was not a way to tell sigfried to recrusively make csvs for each project folder within the folder hierarchy
    • I tried to create a script that would trigger sigfried to do this recursively but that did not work as well
    • We made the collective decision that, for the sake of time, to move away from working with sigfried and start creating the database

Tasks:

  • Start working on the database with Evan

Development Blog #7

October 10th, 2022

  • We began making more csvs on the newer 2019 semester data
  • We looked into how we can clean the data/organize the database
  • I drew up how our potentials statistics view could look like
  • Not much progress, baby steps of cleaning

Tasks:

  • Further develop the database

Development Blog #8

October 17th, 2022

  • Further database creation
  • I revisted the tabluea workbook to see how we may want to display the data
  • Was out sick for some time during this week. Not much work was completed on my end

Tasks

  • Further develop database

Development Blog #9

October 24th, 2022

  • Began analyzing the tables that Evan created
    • I went ahead and ran some test and trail runs to understand the database a bit better
    • I revisited data visualization notes to imagine how we can categorize fields and create calculuations
    • We wanted to have a data analysis at project level, technologies level, and year level
      • I attempted to query the data in this way. Ex. there is a field type called “mime type” that describes the type of technology of the project. I began parsing this field so that only a single technology appears. Did not get so far. Database is huge.

Tasks

  • Work on the etc statistics view

Development Blog #10

October 31st, 2022

  • Created the ETC Statistics View
    • This view contains the fields and statistics (calculations) that we would want to visualize in tableau
    • I created multiple ctes that calculate file counts, redundancy, project size and some more
    • We can definitely optimize the code a bit more to lower runtime execution. I filtered the data to a single project so that I could test these calculations without having to wait for it to query the entire database.
    • As of right now the view is still making use of ETC_DIRTY

Tasks

  • Optimize the data possibly
  • Run the view on the entire data set to see how we should move foward

Development Blog #11

November 7th, 2022

  • Worked more on the
  • It was taking a long time to finish querying the data
  • We went ahead and combines CTE’s as much as I was able to
    • Some trail and error here. We use a GROUP BY function to group the calculations by project that brought up some errors.
  • Joined evan to see how he cleaned the dataset

Tasks

  • Look more how we can optimize the data set / database further as the execution time is still very high

Development Blog #12

November 21st, 2022

  • DataGrip was still very much taking a long time to query the data
  • We connected the database with Tabluea, the execution time in tableau went up to 300 minutes before closing unexpectedly
  • Our data visualization process was put on hold for last week, hopefully we are able to build visualizations before wednesday. Professor kaltman was able to help us index the database to where it executes everything much quicker than before

Tasks

  • Create visualizations

Leave a Reply

Your email address will not be published. Required fields are marked *