The Data Debt: Why Small Shortcuts in Year One Become Research-Integrity Problems in Year Four

Somewhere in the third year of a PhD, a remarkably common scene plays out in front of a laptop. The student is trying to write up a chapter that rests on an analysis they ran sixteen months ago. The figure is in their thesis draft. The conclusion the figure supports is in the paper they are revising. The data file that produced the figure is in a folder called stuff inside another folder called stuff_new on a hard drive whose backup last completed in April. The student opens the data file. They cannot remember what the columns mean. They cannot find the script that produced the figure. The version of the script they do find produces a slightly different figure. They sit, briefly, in the quiet horror of realising that the past six months of work rests on an analysis they can no longer reconstruct.

This is data debt, and almost every PhD student takes some on. Some take on a manageable amount and pay a small ongoing tax during write-up. Some take on enough that they spend months of their final year reconstructing what should have been a five-minute lookup. A few discover, painfully, that the reconstruction is not possible at all. The amount of debt taken on, and therefore the size of the bill that comes due, is determined almost entirely by habits formed in year one — habits that feel like overhead at the time and like salvation later.

Why the cost is invisible until it isn’t

The reason data debt accumulates is that the cost of every individual shortcut is small and immediate, while the benefit of doing things properly is large and deferred. Setting up a versioned folder structure and a naming convention takes an hour and pays back in a year. Writing a one-paragraph notebook entry about why you chose those parameters takes three minutes and pays back two years later when a reviewer asks. None of this work has any return until something has gone wrong, and the things that go wrong (a missing column, a lost script, a conflict between two near-identical files) happen rarely enough that the student rarely makes the connection between the moment of pain and the choice that caused it.

Senior researchers have made the connection so many times that they consider basic data hygiene a non-negotiable part of doing science. PhD students, who have only just started building their own evidence about which shortcuts hurt, almost universally underweight it. The result is a quiet wave of avoidable problems that hit, predictably, in the months between submitting a chapter and trying to revise it.

The ledger

Same calendar, different bills

Every shortcut taken in the first year becomes a specific kind of problem by the fourth. The bill is paid in the months between submitting the thesis and trying to find the data behind it.

Year-one shortcut

What it becomes by year four

“I’ll remember which file is the final one.”

Seventeen files named final_v2_REAL_USE_THIS_actual.csv. No way to tell which fed which figure.

“I’ll write up the methods later.”

Cannot reconstruct, eighteen months on, what parameters were used in the analysis the chapter rests on.

“I know what these columns mean.”

Two years later, no idea what var_x_adj_2 was meant to encode, and the person who would have known has graduated.

“Cloud sync is enough.”

A sync conflict in March wiped three months of work. The backup that was supposed to catch it last completed in November.

“The raw data is on the lab server.”

Server was retired in the move. Data not migrated. The chain from raw measurement to published figure can no longer be reconstructed.

None of these failures is rare. All of them have prevention costs measured in hours and consequences measured in months.

Five layers that pay back

A good data-management system has roughly five layers. None of them is technical; none of them requires software a student doesn’t already have. What separates the students who write up smoothly from the students who do not, more often than intelligence or ambition, is whether they have built these five things and use them daily.

The first is a naming convention. Every file in the project has a name that tells you what is in it, when it was made, and what version it is, without you having to open it. data_final_v2.csv is not a naming convention; it is a confession. experiment03_subjects_clean_2026-08-12.csv is.

The second is versioning. For code, this means git, even for a project no one else will ever see — because the cost of git is small and the alternative is folders called analysis_old, analysis_old_old, and analysis_USE_THIS_ONE. For data, it means a record of when each dataset was generated, what produced it, and what it superseded.

The third is metadata and notebook discipline. The single highest-leverage habit in PhD research is the lab notebook — electronic or paper — that records, at the moment of doing the work, what was done and why. Six months later, the student who kept the notebook can answer their reviewer in five minutes. The student who did not spends a week reconstructing.

The fourth is backup with redundancy. The convention, taught for decades, is three copies, on two different kinds of media, with one held off-site. A laptop alone is none of these. A laptop plus institutional cloud sync is two of them. A laptop plus institutional cloud plus a personal external drive kept at home is the three-two-one rule, and it has quietly saved more theses than any other single habit.

The fifth is a readme and a retention plan. A short text file at the top level of the project, written once and updated occasionally, that explains the folder structure, the naming convention, the location of the raw data, and what the project is. Future-you, opening the project in year four, becomes a stranger. The readme is the message you leave for that stranger.

The system that pays back

Five layers, built once, used daily

None of these requires software a student doesn’t already have. What separates the smooth write-up from the painful one is whether the layers are in place by the end of year one.

1 A naming convention File names that tell you what, when, and which version, without opening anything.
2 Versioning Git for code, even for solo projects. For data, a record of what was generated when, by which script, superseding what.
3 Notebook and metadata discipline A record, made at the moment of doing the work, of what was done and why. The highest-leverage habit in PhD research.
4 Backup with redundancy Three copies, on two kinds of media, with one off-site. The three-two-one rule has saved more theses than any other single habit.
5 A readme and retention plan A short text file at the project root explaining what is here, where the raw data lives, and what the conventions are. A message to future-you, who has become a stranger.

Built in an afternoon in year one. Pays back, in hours saved and integrity preserved, every month afterwards.

What the debt looks like when it comes due

The most common failure mode is the one that opened this piece: a result whose origin cannot be traced. The second is the version conflict — two near-identical scripts, both runnable, producing different figures, with no record of which produced the one in the paper. The third is the lost backup — a hard-drive failure or a sync conflict that takes out a chapter’s worth of work. The fourth, and the worst, is the integrity question — a reviewer or examiner asking how a result was produced, and the student being unable to demonstrate the chain from raw data to figure in a way that would survive scrutiny.

The last of these is the one that quietly haunts the literature. The vast majority of papers that fail post-publication reproducibility checks do not fail because the underlying science was wrong; they fail because the chain of intermediate steps was not recorded clearly enough to verify, and by the time the question is asked, the people involved no longer remember. None of these failures is rare. All of them have prevention costs measured in hours and consequences measured in months.

A daily habit, not a year-four cleanup

Data debt cannot be paid back at the end of the PhD. The information that would have made the debt manageable — what was done, in what order, with what parameters, on what date — is lost in the moments that pass without it being recorded. The only way to handle it is to build the habits early and run them daily.

The students who do this rarely talk about it; they simply hand in cleanly. The students who do not rarely admit how much of their final year was spent in folders called stuff, looking for a script they were certain they had saved somewhere. Research integrity, in the day-to-day sense the phrase actually carries, is not built in the dissertation defence. It is built in the names of files saved on a Tuesday in November of the first year.