BOORU CHARS OPEN DATASET is an attempt to consolidate and arrange available character-centric
almost-SFW anime/CG/game art in localized form suited both for batch processing and visual estimation
This release of BOORU CHARS consist of :
- 1.593.429 sample images, mogrified to 1280px (1024px for 1х1)
- grouped into 18 volumes (directories) by aspect ratio and year
- zipped by 1000 images according to statistics similarity
- with verbose file naming %website% - %id% - %copyright% ~ %characters% (%artist%)
- with some tags placed into EXIF
- several tab-separated texts with metadata
- post/image info (for samples, originals and from imageboard) 1.593.429 rows
- collected tag info with some addons - 35.222.997 rows
- listing for 32 torrents total with 3.839.005 pictures (almost 5 ТБ - the basis of dataset)
- an example of dataset usage for body objects detection and character “assembling”
- detector notAI-tech NudeNet results for 2 volumes and composed output listings
- ~4000 most interesting visualisations (samples with detections drawn)
- some verbose descriptions
- readme_RU/EN with code examples and a lot of references
- several zipped Excels with analytic results and SQL-s
- some illustrative screenshots
The main features of dataset are:
- several sources but unique image identification %website% + %id%
- original images can be found in torrents (nyaa, rutracker)
- selective regrab of originals possible if source website available
- careful deduplication with relative website priorities, high to low (mostly)
- segmentation by chronology (estimated year of release) and by aspect ratio
- “artbook pages” 7x10 (+/- 4%)
- “wide pages” 3x4 (+/- 10%)
- “squares” 1x1 (+/- 20%)
- “wallpapers and computer screens” 3x2 (+/- 40%)
- “tall pages” 2x3 (+/- 40%) folder name contains 1x2
- rather high original images technical and visual quality
- width>=900 height>=900 MPixels>=1.2
- most of comixes, lineart, overtexted images excluded, no photo, almost no characterless scenes
- not completely SFW (a little bit sotfcore ecchi here and there)
Earlier version of this dataset (2019, 512px) has to be treated as obsolete.
I hope that this half-terabyte of data worth more than the same size chia coin mining pool.
NOTE-1 several standalone not SFW datasets at Sukebei also with sample images, metadata and some analysis done.
NOTE-2 neural network architecture YOLO seems to be very good for art. I already have promising results, stay tuned.
NOTE-3 there is similar BOORU CHARS 2015 dataset for "early art"
NOTE-4 next BOORU CHARS 2022 release over volumes: 2021 b-c-d and 2022 a-b
Comments - 1
SomaHeir
Thanks!!