BOORU CHARS OPEN DATASET is an attempt to consolidate and arrange available character-centric
almost-SFW anime/CG/game art in local form suited both for batch processing and visual estimation
**This release of BOORU CHARS consist of :**
- **1.593.429 sample images**, mogrified to 1280px (1024px for 1х1)
* grouped into 18 volumes (directories) by aspect ratio and year
* zipped by 1000 images according to statistics similarity
* with verbose file naming **%website% - %id% - %copyright% ~ %characters% (%artist%)**
* with some tags placed into EXIF
- several tab-separated texts with metadata
* post/image info (for samples, originals and from imageboard) 1.593.429 rows
* collected tag info with some addons - 35.222.997 rows
* listing for 32 torrents total with 3.839.005 pictures (almost 5 ТБ - the basis of dataset)
- an example of dataset usage for body objects detection and character "assembling"
* detector [notAI-tech NudeNet](https://github.com/notAI-tech/NudeNet) results for 2 volumes and composed output listings
* ~4000 most interesting visualisations (samples with detections drawn)
- some verbose descriptions
* **readme_RU/EN** with code examples and a lot of references
* several zipped Excels with analytic results and SQL-s
* some illustrative screenshots
You can find more and actual info [at Github directory](https://github.com/aperveyev/booru_processor/tree/master/BC_2021)
**The main features of dataset are:**
- several sources but unique image identification **%website% + %id%**
* original images can be found in torrents (nyaa, rutracker)
* selective regrab of originals possible if source website available
- careful deduplication with relative website priorities, high to low (mostly)
- segmentation by chronology (estimated year of release) and by aspect ratio
* "artbook pages" **7x10 (+/- 4%)**
* “wide pages” **3x4 (+/- 10%)**
* “squares” **1x1 (+/- 20%)**
* “wallpapers and computer screens” **3x2 (+/- 40%)**
* "tall pages" **2x3 (+/- 40%)** folder name contains 1x2
- rather high original images technical and visual quality
* width>=900 height>=900 MPixels>=1.2
* most of comixes, lineart, overtexted images excluded, no photo, almost no characterless scenes
- not completely SFW (a little bit sotfcore ecchi here and there)
Earlier version of this dataset [(2019, 512px)](https://nyaa.si/view/1206322) has to be treated as obsolete.
I hope that this half-terabyte of data worth more than same size chia coin mining.
NOTE several standalone [not SFW datasets at Sukebei](https://sukebei.nyaa.si/user/AlexPUA) also with sample images, metadata and some analysis done.
NOTE-2 neural network architecture [YOLOv5](https://github.com/ultralytics/yolov5) seems to be very good for art. I already have [promising results](https://www.kaggle.com/printcraft/anime-and-cg-characters-detection-using-yolov5), stay tuned.