Developers Should Care About the Data They Make

Neel Blair
5 min readFeb 2, 2022

We’re told data is going to make the future better. We’re told data, machine learning, and AI will reshape the way we do all these really important things. Data will do this, because it will be reused, mined, combined, and maximized beyond its original purpose. The promise our data holds is vast. Organizations of every stripe — governments, schools, companies, non-profits — all promise to get smarter using data, and tell us that data is their most valuable resource.

And yet, many of those who generate that data — <movie trailer voice> “the key to the future… ” — seem to care very little about the state of the data they collect or create. Imagine data is a crop grown on a farm — those harvesting the crops (data scientists / data engineers) are doing their best to harvest a great bounty. But it seems those planting the crops (developers) are just sort of chucking seeds out in the fields, with no regard for how easy, economical, efficient, or just plain crazy-making the harvest will be.

Let me use a very simple, very basic example. One that has data scientists the world over regularly pulling their hair out. Say you’re collecting literally any kind of time-based data. Maybe it’s the time of the last bank transaction, maybe it’s the time stamp of the last time a user logged into a system, maybe the the last time some temperature gauge was read. Could be anything. Dates can be recorded in any of a dizzying array of formats.

2/1/2022 11:07:32P

February 1st, 2022 at 11:07PM

1/2/2022 23:07:32

1643785738000

(all the above are effectively the same)

There’s Universal Time Code, Epoch Milliseconds, European format, American format, H:M:S, AM/PM, 24 hour time… and on and on.

Time-series data is one of those things that data science can really draw a lot of value from. Trends, forecasts, and all manner of decision support can be driven using time-series data. So if we’re really going to change the world with data and drive all this value, the choices we make around formatting date and time data are important. A person creating or collecting time-stamped data has a choice to make about format. A choice that will influence every downstream attempt at using that data to drive value. And how do we make this important choice about date format, enabling everyone downstream to maximize the value of the data we’re creating or collecting?

Whatever’s easy. Or whatever we know best. Or we base it on whatever smart-seeming article we read last week. Close our eyes and throw a dart at a board, maybe. Shout over to the next cubicle, or ping over to the Slack Channel, and ask, taking whatever answer comes back first.

At least, that’s what it feels like on the other side. If I could reduce, even by a tiny bit, the amount of time spent wrangling weirdly inconsistent date formats… I would have finished Game of Thrones by now. “But Neel,” you are saying, “GoT ended in 2019!” I KNOW! I’m still over here trying to figure out why a single flat file from a single source system covering just 2 years of data changed date formats at random intervals 5 DIFFERENT *&#$ing TIMES!

My annoyance and absurdly long streaming backlog aside, there’s a simple point I’m trying to make. As a data scientist/engineer/manager, I’m constantly told how excited everyone is to extract the hidden, valuable insights in their data. I’m told what a high priority data harvesting, learning from data, and decision support with data are to whatever org I’m working for. But it seems like all this energy and intention showed up a little too late to solve one of the biggest problems — the fact that whoever or whatever created the data in the first place didn’t invest much thought in how it would be consumed later.

Now, I get it — coding standards are boring… style guides are inflexible. They aren’t updated often enough, they’re rigid, our project is a special snowflake, the proverbial square peg mashing itself into a round hole. OK.

But.

Someone is going to come along later, seeing the special genius that our creation represents, and they’re going to want to extract the maximum value from that genius creation. And then that someone is going to CURSE OUR NAMES for weeks as they try to make sense of the mess we left in our wake. And nobody wants that. Developers are the kind of people who take pride in a system they designed working flawlessly through time, becoming a reliable cog in the great machine of whatever organization they’re a part of.

So just a bit of thought toward those of us downstream will be appreciated, and could drive untold value. Data originated by our systems are going to end up in another person’s hands one day (maybe a future version of you), and a small amount of care and discipline in the early stages of project development can assure whatever we’re building is maximized into the future. Are you going to remember what sp21mo means in 2 years when you’re sifting through logs or databases for treasure? Me neither. We’re all glad we kept the character count low so we don’t have to type as much every time we call that variable, but future us will have no idea what that means. And c’mon — autofill is a thing everywhere now. We can tab, it won’t kill us.

When I dig into yet another tower-of-babel data source, I just imagine a disembodied voice, intoning “Look all ye upon my 3 character column names and despair, for you shall never know my secrets. The variables BUH, NUH, and GUH are all integers. We have no idea what they mean, but they can be typed quickly. So there’s that.”

Date formats matter. Naming conventions matter. Field and variable names matter. Documentation matters. And even if you aren’t lucky enough to work in a place with amazing coding standards and style guides and naming conventions that are comprehensible and easy to use, a short consultation from a data scientist, DBA, or other partner would make things a lot easier for everyone downstream.

To go back to my farming analogy — if we know we’re going to want to harvest this crop later, maybe we could place our seeds with more care from the beginning. We’ll all have more to eat later.

--

--