Thursday, February 29, 2024
HomeData Science 100 selections of open data datasets -Let's make use of huge...

[Saved version] 100 selections of open data datasets -Let’s make use of huge amounts of data!

With the rise of AI, the importance of data is being recognized again, and various data are openly released and can be used easily.

The existence of open data is essential for machine learning that learns from huge amounts of data, and it is necessary to select and utilize open data according to the AI ​​you want to build.

On the other hand, open data alone cannot build AI with competitive advantage. It is important to build a unique AI by combining macro open data and independently collected micro data.

When building a service that utilizes open data, the key is to improve the UX (user experience) of the service and how unique data can be obtained, so don’t rely too much on open data.

This time, we classified 100 open data datasets into 6 categories. Look for data that can be used to build your own service or AI.

Table of Contents 

  • Top 100 open data datasets
    • Dataset Catalog/Dataset Summary
    • image
    • movie
    • audio
    • text
    • economy and finance
  • in conclusion

Top 100 open data datasets

Dataset Catalog/Dataset Summary

  • DATA GO JP

A data catalog site for open data published by the Japanese government for the purpose of guiding and cross-searching public data that can be used for secondary purposes.

  • National Informatics Research Data Repository

This is a joint use project of datasets operated by the Dataset Joint Research and Development Center (DSC) of the National Institute of Informatics (NII). We provide data of researchers at private companies and universities for researchers.

  • Link Data

A site that supports conversion and publication of various table data.

  • Kaggle

Kaggle is a platform for competing predictive models and analytics, with various datasets available for download.

  • Harvard Dataverse

This dataset is published by Harvard University. Nearly 500 datasets that can be used for machine learning are released.

  • UC Irvine Machine Learning Repository

A dataset published by the University of California, Irvine. About 400 datasets have been published.

  • Data.gov (US)

Open data from various US government agencies. You can download various data from 14 topics.

  • e-stat

This is a government statistics portal site where you can view Japanese statistics.

  • r/datasets

A bulletin board site for sharing datasets managed by redit.com.

  • Rakuten dataset

This dataset is published by Rakuten Institute of Technology. Rakuten product reviews and text images with annotations are available.

  • Google Dataset Search

Google’s dataset search service. Officially released in 2020.

  • facebook research Dataset

This dataset is published by Facebook Research.

  • Registry of Open Data on AWS

Public datasets on Amazon AWS. We publish datasets that can be used for learning image classification and natural language processing. It is also possible to work with AWS.

  • Microsoft Research Open Data

You can search and download Microsoft open data. It is possible to work with Azure.

  • arXivTimes datasets

It summarizes the datasets that can be used when performing machine learning by category.

  • awesome-public-datasets

This is also a summary of the datasets that can be used when performing machine learning.

  • Core Data

Among the datasets, we have compiled high-quality and easy-to-use open data.

  • ARIZONA STATE UNIVERSITY

A dataset compiled by the University of Arizona.

  • Network Repository

This is a compilation of datasets that can be used for network analysis, etc.

  • Yahoo Labs Webscope

This dataset is published by Yahoo! Labs.

image

  • Label Me

Annotated dataset for semantic segmentation.

  • MegaFace

Face recognition mixed with noise data and a large dataset used in a face recognition algorithm open competition at the University of Washington.

  • MNIST

A dataset of handwritten digit images that is said to be the first machine learning beginner to use.

  • CIFAR-10 / CIFAR-100

CIFAR-10 consists of 60000 32×32 color images in 10 classes, with 6000 images per class, 50000 training images and 10000 test images. CIFAR-100 has 100 classes of 600 images each, with 500 training and 100 test images per class.

  • Pascal VOC Dataset

It provides standardized image datasets for object class recognition, a common set of tools for accessing datasets and annotations.

  • Fashion-MNIST

This is a data set labeled with 10 categories such as T-shirts, pants, dresses, and sneakers as fashion images.

  • Deep Fashion

This is a fashion image dataset consisting of over 800,000 and 50 categories.

  • Food 101

A dataset containing 101,000 food images labeled with 101 categories.

  • Google Open Image V4

A Google-published dataset of up to 9 million images annotated with image-level labels, object bounding boxes, object segmentation masks, and visual relationships.

  • ImageNet

There is a data set of more than 14 million images, and if you search for a string, the class that matches the search word will appear, making it easy to acquire data.

  • CoPhIR

More than 100 million images in the image dataset from the image site Flicker.

  • Flickr Logos dataset

We provide annotated city image data and logo image data published by the National Technical University of Athens.

  • Tiny Images Dataset

The TinyImages dataset consists of 79,302,017 images, each of which is a 32×32 color image.

  • SUN dataset

Extensive Scene Recognition (SUN) database with 899 categories and 130,519 images for scene recognition and classification.

  • COCO – Common Object in Context

COCO is a large-scale object detection, segmentation and captions dataset.

  • Daimler Urban Segmentation Dataset

A dataset recorded in urban traffic. It consists of 5000 corrected stereo image pairs with a resolution of 1024 × 440. 500 frames (every 10 frames of the sequence) contain pixel-level semantic class annotations for 5 classes (ground, building, vehicle, pedestrian, sky).

  • DAGM 2007

This is a data set for competition at a symposium in Germany. It is an image data set for detecting defects such as scratches on the surface of industrial products.

  • Natural Adversarial Examples

A dataset intentionally prepared for machine learning models to make mistakes.

  • rois-codh/kmnist

Unlike handwritten digits MNIST, this is a dataset of broken handwritten digits and kanji.

  • The Oxford-IIIT Pet Dataset

A 37-category pet image dataset containing about 200 images per class.

  • Stanford Drone Dataset

An image dataset taken from a Stanford University drone.

  • CelebA Dataset

A large face attribute dataset containing over 200,000 celebrity images with 40 attribute annotations.

  • Face Forensics

A dataset for detecting fake images of people generated by Deepfakes, Face2Face, etc.

  • Indoor Scene Recognition

An indoor image dataset for indoor scene recognition models. It contains 67 indoor categories and a total of 15620 images.

movie

  • YouTube-8M Dataset

A dataset of 8 million YouTube videos tagged with 4800 Knowledge Graph entities published by the Google research team. We collect human-validated labels on about 237K segments in 1000 classes.

  • YouTube-Bounding Boxes Dataset

Here is a large dataset with bounding box labeled videos. The dataset consists of approximately 380,000 15-20 video segments extracted from 240,000 different publicly available YouTube videos, automatically selecting objects in natural settings without editing or post-processing can do.

  • Kinetics

This is a video data set published by Deep Mind, in which approximately 650,000 videos are labeled with actions such as human-object interactions such as playing musical instruments, and handshakes.

  • UCF101-Action Recognition Data Set

Provided by the University of Central Florida, UCF101 is a video recognition dataset with 101 action categories collected from YouTube. An extension of the UCF50 dataset with 50 action categories. The 101 action-category videos with behavioral classifications were grouped into 25 groups, and each group consisted of 4-7 action videos.

  • 20BN-JESTER DATASET V1

This is a video dataset with hand gesture labels published by twentybn. About 150,000 videos are labeled with 27 hand gestures.

  • Moments in Time Dataset

This is a project jointly researched by MIT and IBM, and is a video dataset in which action labels are attached to 3-second videos.

  • EPIC KITCHENS

A video dataset in which kitchen tasks are labeled with actions. Published by research teams at the University of Bristol, the University of Toronto, and the University of Catania.

  • STAIR Actions

A video captions dataset. There are 399,233 Japanese captions that describe the content of 798,220,000 videos.

  • BDD100K: A Large-scale Diverse Driving Video Database

This is a driving video dataset published by the AI ​​Lab (BAIR) at the University of California, Berkeley. A 10-second video is labeled with road object bounding boxes, drivable areas, and lane markings.

  • Atomic Visual Actions (AVAs)

This is a dataset published by Google for recognizing human actions, and 80 types of labels such as walking and flying actions are attached to 57,000 videos.

audio

  • The NES Music Database

A dataset for building an automatic music composition system, published by a postdoctoral fellow at Stanford University. A total of 5278 songs with 397 titles are included.

  • The Largest MIDI Collection on the Internet

A large dataset of MIDI data published on reddit.

  • NSynth Dataset

A data set containing about 300,000 single notes from 1,006 instruments, published by the open source research project Magenta.

  • AudioSet

The 10-second sounds released by Google are labeled with human voices, animal sounds, and musical instruments.

  • ToyADMOS-dataset

Machine operation sound dataset of over 12,000 samples of approximately 540 hours of normal machine operation sounds and abnormal sounds collected by four microphones at a sampling rate of 48 kHz.

  • The Flickr Audio Caption Corpus

The Flickr 8k Audio Captions Corpus contains 40,000 audio captions on 8,000 natural images and is a dataset for unsupervised audio pattern discovery.

  • The Spoken Wikipedia Corpora

A corpus of aligned English, German, and Dutch Wikipedia article audio files published by the University of Hamburg, Germany. Contains hundreds of hours of aligned audio. Annotations can be mapped to the original HTML.

  • Speech Commands Dataset

A voice dataset for speech recognition for Tensorflow, published by Google. It contains 65,000 1 second long pronunciations of 30 short words.

  • MUDB18

Dataset for speech separation. We provide music track data of 150 genres such as drums, bass, and vocals.

  • Mozilla Common Voice

Approximately 1,400 hours of voice data from 42,000 contributors, 18 languages, and 18 languages ​​have been released from the voice data set collection project “Common Voice” developed by Mozilla.

  • Voice Actor Statistics Society of Japan

This is a speech corpus data set published by the Voice Actor Statistics Society of Japan, consisting of uniquely constructed phoneme-balanced sentences and their speech corpus read aloud by three professional female voice actors in three patterns.

  • Speech Resource Consortium

This page summarizes various speech corpus lists.

text

  • Resources for Natural Language Processing

The Kurohashi, Kawahara, and Murawaki Laboratory at Kyoto University publishes this information on tools and datasets for natural language processing.

  • Aozora Bunko

We publish data of works whose copyright has expired and works licensed by the author.

  • Aozora Bunko Morphological Analysis Data Collection

You can obtain CSV data that has undergone morphological analysis for the works of Aozora Bunko.

  • Japanese translation data

Bilingual corpus, bilingual dictionaries, etc. that can be used to build a machine translation system are available.

  • SNOW T15: Easy Japanese Corpus

This is a bilingual corpus that rewrites 50,000 sentences into easy Japanese (plain Japanese vocabulary). This corpus also has an English translation, making it a bilingual corpus that corresponds to each sentence in English, Japanese, and easy Japanese.

  • Twitter Japanese Sentiment Analysis Dataset

We analyze tweet reputation information by crowdsourcing and publish the analysis results.

  • SNOW D18 Japanese Emotion Dictionary

A dictionary of Japanese emotional expressions. About 2,000 expressions are given 48 categories of emotions, such as fun and familiarity, respect/preciousness, excitement, joy, and sadness.

  • libedor news corpus

We publish a corpus of articles to which livedoor’s Creative Commons license is applied.

  • Cookpad dataset

We publish data on 1.72 million recipes and menus posted on Cookpad.

  • DBpedia English

A community project that extracts information from Wikipedia and publishes it as LOD (Linked Open Data).

  • Web data: Amazon reviews

We publish about 35 million Amazon reviews.

  • niconico dataset

Metadata of approximately 16.7 million videos posted from the start of Nico Nico Douga to November 8, 2018, along with approximately 3.8 billion comment data.

  • Wikipedia Links data

The full text of Wikipedia is published as a dataset.

  • Common Crawl

Crawl data for over 5 billion web pages.

  • Web Data Commons

Data extracted from Common Crawl structured data.

  • babi

A dataset of various tasks such as question answering, dialogue, and language models from Facebook AI Research.

  • PAWS

This dataset is for overcoming paraphrases where the meaning changes if the word order and syntactic structure are slightly different.

  • Tokyo Metropolitan University Natural Language Processing Laboratory (Komachi Lab)

Corpus, dictionary, and evaluation dataset provided by Tokyo Metropolitan University Natural Language Processing Laboratory (Komachi Lab).

economy and finance

  • Quandl

A wide variety of financial and economic data sets can be retrieved. There are many articles on data acquisition in Python.

  • Bitcoin Historical Data

Bitcoin data at 1-minute intervals from January 2012 to August 2019 published on Kaggle.

  • World Bank Open Data

You can easily search and download World Bank data.

  • US macroeconomic data

It publishes data on employment, economic output, and other macroeconomic variables.

  • US Stock Data

US stock market data since 2009.

  • finance-vix

CBOE Volatility Index (VIX) time series dataset containing daily opening, closing, high and low prices of the S&P 500.

  • Dow Jones Index Data Set

Published by the University of California, Irvine, you can get weekly Dow Jones Industrial Average stock prices.

  • Econ Data

Economic time series data published by the University of Maryland. It publishes thousands of economic time series produced by numerous US government agencies and distributed in a variety of formats and media.

  • Ministry of Finance Government Bond Interest Rate Information

Japan’s interest rate information (since 1974) is open to the public.

  • Asset Marco

It publishes over 20,000 macroeconomic indicators for over 120 countries.

  • Eurostat Comext

We create materials and statistics related to the EU, and you can obtain trade data.

  • Nikkei Average Profile

Published by Nihon Keizai Shimbun, Nikkei Stock Average, Nikkei Asian Index, JPX Nikkei Index 400, etc. can be acquired.

  • IMF DATA

It publishes a series of time-series data on IMF lending, exchange rates, and other economic and financial indicators.

  • Google Finance

Get securities information such as stock prices and exchange rates. . You can get it from Google Sheets function.

  • Financial Data Finder

You can get various financial data such as stocks, exchange rates, bond assets, etc. Published by Ohio State University.

  • EDINET

As an electronic information disclosure system for disclosure documents based on the Act, which links the host computer used by the Cabinet Office, the computer used by the submitting company, and the computer of the financial instruments exchange (and the Financial Instruments Firms Association), securities reports , securities registration statements, large-volume holding reports, and other disclosure documents.

In conclusion

New data will continue to be released, so I would like to acquire knowledge and technical skills for data utilization.

Especially for those who are learning machine learning technology, it would be a good idea to actually build a model using open data.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Recent Posts

Most Popular

Recent Comments