Kaggle metadata using Datasette

Understanding metadata about Competitions, Datasets, Kernels, Users.
2024/03/09
Kaggle Meta Datasette sqlite-utils Data analysis CloudRun

About Kaggle meta

Kaggle's public data on competitions, users, submission scores, and kernels.

Per https://www.kaggle.com/datasets/kaggle/meta-kaggle:

Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity. Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.

I am using Datasette, sqlite-utils tools to do some analysis on this data.

Datasette

Per https://datasette.io/, Datasette is a tool for exploring and publishing data. It helps people take data of any shape, analyze and explore it, and publish it as an interactive website and accompanying API.

Installation

https://docs.datasette.io/en/stable/installation.html

sqlite-utils

CLI tool and Python library for manipulating SQLite databases

This library and command-line utility helps create SQLite databases from an existing collection of data.

Installation

https://sqlite-utils.datasette.io/en/stable/

Loading CSV Datasets into SQLite

I crafted some PySpark scripts for analyzing datasets and extracting key attributes.

My objective was to efficiently discern Notebooks, Competitions, and Datasets data. Although some Kaggle meta files contain User IDs, determining the actual user requires access to the Users dataset. The Spark notebooks perform the following tasks:

Merge Users data with Competitions, Achievements, Datasets, and Kernels.
Generate clickable URLs for easy access to Kaggle.com pages with a single click.

GitHub Repo - https://github.com/ryandam9/data-analysis-with-spark/tree/master/meta-kaggle

Then, using sqlite-utils , created a Database kaggle_metadata.db.

#!/bin/bash

# Location where original & formatted files are present.
base_path="/home/rk/Desktop/data/kaggle-meta"

sqlite-utils insert kaggle_metadata.db tags "${base_path}/Tags.csv" --csv
sqlite-utils insert kaggle_metadata.db competition_tags "${base_path}/CompetitionTags.csv" --csv

sqlite-utils insert kaggle_metadata.db users "${base_path}/formatted/Users_fmt.csv" --csv
sqlite-utils insert kaggle_metadata.db competitions "${base_path}/formatted/Competitions_fmt.csv" --csv
sqlite-utils insert kaggle_metadata.db kernels "${base_path}/formatted/Kernels_fmt.csv" --csv
sqlite-utils insert kaggle_metadata.db datasets "${base_path}/formatted/DatasetVersions_fmt.csv" --csv

exit 0

Viewing the data locally

cd /home/rk/Desktop/analysis-using-datasette/kaggle_metadata
datasette kaggle_metadata.db --setting sql_time_limit_ms 3500

Publishing to Cloud Run

https://docs.datasette.io/en/stable/publish.html

cd /home/rk/Desktop/analysis-using-datasette/kaggle_metadata
datasette publish cloudrun kaggle_metadata.db --service=kaggle-metadata-service

I encountered the following errors while creating a CloudRun service.

Error #1

Command 'gcloud builds submit --tag gcr.io/datasette-xxx-service' returned non-zero exit status 1.

Enabled GCR on this project by logging to GCP console.

Error #2

Memory limit of 512 MiB exceeded with 556 MiB used. Consider increasing the memory limit, see https://cloud.google.com/run/docs/configuring/memory-limits

Raised the container memory allocation to 1024 MiB by modifying the YAML configuration file through the CloudRun console.