Kaggle's public data on competitions, users, submission scores, and kernels.
Per https://www.kaggle.com/datasets/kaggle/meta-kaggle:
Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity. Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.
I am using Datasette, sqlite-utils tools to do some analysis on this data.
Per https://datasette.io/, Datasette is a tool for exploring and publishing data. It helps people take data of any shape, analyze and explore it, and publish it as an interactive website and accompanying API.
CLI tool and Python library for manipulating SQLite databases
I crafted some PySpark scripts for analyzing datasets and extracting key attributes.
My objective was to efficiently discern Notebooks, Competitions, and Datasets data. Although some Kaggle meta files contain User IDs, determining the actual user requires access to the Users dataset. The Spark notebooks perform the following tasks:
GitHub Repo - https://github.com/ryandam9/data-analysis-with-spark/tree/master/meta-kaggle
Then, using sqlite-utils
, created a Database kaggle_metadata.db
.
#!/bin/bash
# Location where original & formatted files are present.
base_path="/home/rk/Desktop/data/kaggle-meta"
sqlite-utils insert kaggle_metadata.db tags "${base_path}/Tags.csv" --csv
sqlite-utils insert kaggle_metadata.db competition_tags "${base_path}/CompetitionTags.csv" --csv
sqlite-utils insert kaggle_metadata.db users "${base_path}/formatted/Users_fmt.csv" --csv
sqlite-utils insert kaggle_metadata.db competitions "${base_path}/formatted/Competitions_fmt.csv" --csv
sqlite-utils insert kaggle_metadata.db kernels "${base_path}/formatted/Kernels_fmt.csv" --csv
sqlite-utils insert kaggle_metadata.db datasets "${base_path}/formatted/DatasetVersions_fmt.csv" --csv
exit 0
cd /home/rk/Desktop/analysis-using-datasette/kaggle_metadata
datasette kaggle_metadata.db --setting sql_time_limit_ms 3500
cd /home/rk/Desktop/analysis-using-datasette/kaggle_metadata
datasette publish cloudrun kaggle_metadata.db --service=kaggle-metadata-service
I encountered the following errors while creating a CloudRun service.
Command 'gcloud builds submit --tag gcr.io/datasette-xxx-service' returned non-zero exit status 1.
GCR
on this project by logging to GCP console.Memory limit of 512 MiB exceeded with 556 MiB used. Consider increasing the memory limit, see https://cloud.google.com/run/docs/configuring/memory-limits