30 Real World Data Sets AI will Learn
Data sources AI can (and will) train from
Random Observation/Comment #933: I’m always surprised at how much data I create.
//Generated with ChatGPT off of the conversation around these data capture points. Look at that little see lemons hidden gem.
Why this List?
AI is only as interesting as the data we feed it. I’m pretty sure “everything on the public internet” and “every video we’ve ever uploaded” is already trained, so companies will start thinking more about the localized data we create for everyday decision making. Some of these will be in your IOT devices or even the activities you do on your apps for dynamic data.
If Clembot is going to be your OS, it needs to ingest the messy reality of your physical and digital existence. The shift is moving from Big Data (the whole web) to Deep Data (the specific you).
The Infrastructure & Ambient Layer
Public camera videos (Smart-city CCTV) - This is the typical “Spy Movie” layer. Tracking urban flow, density, and the literal heartbeat of a city in real-time.
Smart home sensors - Beyond “lights on/off.” My smart home may also look into ambient temperature, humidity, and occupancy patterns that predict when you’re getting sick or stressed.
Connected car telemetry - Every hard brake, lane drift, and honking of the horn becomes a risk profile for dynamic insurance or a training set for autonomous fleets.
Smart Grid utility pings - Your electricity and water usage are a fingerprint of your household’s daily ritual. Do you unplug your devices or turn off all the lights before going to bed?
Acoustic environment mapping - Microphones in everything (smart speakers to elevators) learning to distinguish the sound of a falling glass or other sharp sounds that could be from a break-in.
Satellite & Drone multi-spectral imagery - Tracking global crop yields, parking lot fullness, and even illegal construction from orbit.
The Biological & Wellness Layer
Fitness tracker biometrics - Heart Rate Variability (HRV) and SPO2 levels mapping your emotional resilience and sleep architecture from your Apple Watch or Oura ring.
Anonymized genomic sequences - The ultimate long-tail data probably leaked from 23andme. A “digital twin” of your biology that predicts pharmaceutical efficacy before you take a pill could be an interesting optimization of customized drugs.
Microbiome snapshots - Data from smart toilets or mail-in kits tracking the “gut-brain axis” to adjust your AI-suggested diet. This is pretty creepy.
Prosthetic & Exoskeleton feedback - High-speed data on how humans compensate for physical limitations by refining robotic motor control. I’m pretty sure the exoskeleton route will be another wave of integration that sticks.
Continuous Glucose Monitors (CGM) - Real-time metabolic reactions to every snack, can teach AI the true cost of these cheat snacks. I do remember The Island movie where diet and exercise are obsessively used to maintain top physical condition (as shells to richer twins).
The Behavioral & Intent Layer
Eye-tracking heatmaps - Training models on what actually grabs human attention versus what we claim to care about. This could be a lot easier if you just turn on your camera and the AI optimized visuals are connected.
Social graph micro-reactions - There’s specific mobile behaviors like dwell time, hover intent, and the “scroll-past” speed that defines your subconscious preference while on the app. I guarantee that this is already being tracked for social media addiction. It’s even more creepy when they make connections to the people you’re physically meeting.
E-commerce receipts and returns - Mapping the “Buyer’s Remorse” cycle and the “Economic Heartbeat” of a zip code is probably inside of Amazon data already.
Mobile game usage patterns - Analyzing how you handle frustration, reward cycles, and strategic thinking in low-stakes environments. I’m sure this is somewhere on Candy Crush. I don’t see why this wouldn’t happen as an overlay in the Android OS for collecting data for your Gemini Nano.
Calendar & Productivity metadata - Not just what you do, but the “fragmentation” of your day. AI learning when you’re actually “deep working” and adjusting my overall phone notifications. Are human routines so hard to understand?
Voice tonality - Not the words you say to your smart speaker, but the stress in your voice when you say them. I really enjoyed the hume.ai demo about the dimensions of voice and responses in this digital therapist that take it into account.
Browser “Tab Graveyards” - The 50 open tabs you never close. I think this clearly represents my aspirational self and unfinished intent.
The Professional & Industrial Layer
Supply chain logistics pings - Every RFID tag on a pallet training a model on global trade bottlenecks. We tried to do this on the blockchain at some point.
Codebase commit histories - Learning not just the code, but the logic of how humans solve bugs over time. I’m pretty sure the codebase commit process is out of control right now.
Legal & Compliance logs - Millions of hours of “boring” regulatory filings training AI to be the ultimate paralegal. This is definitely happening within the legal space. I can rent a legal assistant for a fraction of the price with private AI employees.
Agricultural sensor arrays - Soil pH, moisture, and nitrogen levels training models to maximize every square inch of dirt for my wife’s little garden.
Customer Service “Rage” transcripts - Training LLMs on how to de-escalate humans by analyzing millions of failed support calls. I think it’s interesting the customer support has moved from a hub-and-spoke model to a decentralized independent generic model. Your AI assistant can already handle all generic customer service requests for your product and write the developer experience for you.
The Deep Local & Niche Layer
Digital Waste/Trash sorting - I don’t think we’d have cameras everywhere, but I did just find out that they put cameras on the garbage trucks so you can’t say the truck didn’t make an attempt to pick up trash from your house with timestamps. There are probably optical sensors at recycling plants learning the lifecycle of consumer packaging.
AR spatial maps - Your phone or smart glasses mapping the 3D geometry of your living room so AI knows where the “physical” couch is. I think the Meta Quest 3 already gives quite a bit of information secretly. I’ll try not to play naked.
Micro-payment flows - The $0.50 tips and digital “coffee” buys that represent the gig economy’s actual velocity. This is definitely in x402 or MPP.
Public Wi-Fi handshake density - Tracking how crowds move through malls or airports without needing individual GPS.
Restaurant POS data - Not just what people eat, but the “modifier” data (e.g., “no onions”) that signals shifting cultural tastes.
Smart Appliance “Health” - The vibration frequency of your fridge compressor training a model to predict mechanical failure.
The “Shadow” Archive - All the photos you didn’t post. The blurry, accidental, and “ugly” photos that represent the un-curated human experience.
~See Lemons Creating Data


