“Just when I thought I was out, they pull me back in!” With a sly grin that I’d soon come to recognize, Paul Ginsparg quoted Michael Corleone from The Godfather. Ginsparg, a physics professor at Cornell University and a certified MacArthur genius, may have little in common with Al Pacino’s mafia don, but both are united by the feeling that they were denied a graceful exit from what they’ve built.
Nearly 35 years ago, Ginsparg created arXiv, a digital repository where researchers could share their latest findings—before those findings had been systematically reviewed or verified. Visit arXiv.org today (it’s pronounced like “archive”) and you’ll still see its old-school Web 1.0 design, featuring a red banner and the seal of Cornell University, the platform’s institutional home. But arXiv’s unassuming facade belies the tectonic reconfiguration it set off in the scientific community. If arXiv were to stop functioning, scientists from every corner of the planet would suffer an immediate and profound disruption. “Everybody in math and physics uses it,” Scott Aaronson, a computer scientist at the University of Texas at Austin, told me. “I scan it every night.”
Every industry has certain problems universally acknowledged as broken: insurance in health care, licensing in music, standardized testing in education, tipping in the restaurant business. In academia, it’s publishing. Academic publishing is dominated by for-profit giants like Elsevier and Springer. Calling their practice a form of thuggery isn’t so much an insult as an economic observation. Imagine if a book publisher demanded that authors write books for free and, instead of employing in-house editors, relied on other authors to edit those books, also for free. And not only that: The final product was then sold at prohibitively expensive prices to ordinary readers, and institutions were forced to pay exorbitant fees for access.
The “free editing” academic publishers facilitate is called peer review, the process by which fellow researchers vet new findings. This can take months, even a year. But with arXiv, scientists could post their papers—known, at this unvetted stage, as preprints—for instant and free access to everyone. One of arXiv’s great achievements was “showing that you could divorce the actual transmission of your results from the process of refereeing,” said Paul Fendley, an early arXiv moderator and now a physicist at All Souls College, Oxford. During crises like the Covid pandemic, time-sensitive breakthroughs were disseminated quickly—particularly by bioRxiv and medRxiv, platforms inspired by arXiv—potentially saving, by one study’s estimate, millions of lives.
While arXiv submissions aren’t peer-reviewed, they are moderated by experts in each field, who volunteer their time to ensure that submissions meet basic academic standards and follow arXiv’s guidelines: original research only, no falsified data, sufficiently neutral language. Submissions also undergo automated checks for baseline quality control. Without these, pseudoscientific papers and amateur work would flood the platform.
In 2021, the journal Nature declared arXiv one of the “10 computer codes that transformed science,” praising its role in fostering scientific collaboration. (The article is behind a paywall—unlock it for $199 a year.) By a recent count, arXiv hosts more than 2.6 million papers, receives 20,000 new submissions each month, and has 5 million monthly active users. Many of the most significant discoveries of the 21st century have first appeared on the platform. The “transformers” paper that launched the modern AI boom? Uploaded to arXiv. Same with the solution to the Poincaré conjecture, one of the seven Millennium Prize problems, famous for their difficulty and $1 million rewards. Just because a paper is posted on arXiv doesn’t mean it won’t appear in a prestigious journal someday, but it’s often where research makes its debut and stays openly available. The transformers paper is still routinely accessed via arXiv.
For scientists, imagining a world without arXiv is like the rest of us imagining one without public libraries or GPS. But a look at its inner workings reveals that it isn’t a frictionless utopia of open-access knowledge. Over the years, arXiv’s permanence has been threatened by everything from bureaucratic strife to outdated code to even, once, a spy scandal. In the words of Ginsparg, who usually redirects interview requests to an FAQ document—on arXiv, no less—and tried to talk me out of visiting him in person, arXiv is “a child I sent off to college but who keeps coming back to camp out in my living room, behaving badly.”
Ginsparg and I met over the course of several days last spring in Ithaca, New York, home of Cornell University. I’ll admit, I was apprehensive ahead of our time together. Geoffrey West, a former supervisor of Ginsparg’s at Los Alamos National Laboratory, once described him as “quite a character” who is “infamous in the community” for being “quite difficult.” He also said he was “extremely funny” and a “great guy.” In our early email exchanges, Ginsparg told me, upfront, that stories about arXiv never impress him: “So many articles, so few insights,” he wrote.
At 69 years old, Ginsparg has the lean build of a retired triathlete, his knees etched with scars collected over a lifetime of hiking, mountain climbing, and cycling. (He still leads hikes on occasion, leaving younger scientists struggling to keep up.) His attire was always relaxed, as though he’d just stepped off the Camino de Santiago, making my already casual clothes seem overdressy. Much of our time together was spent cycling the town’s rolling hills, and the maximum speed on the ebike I rented could not keep up with his efficient pedaling.
Invited one afternoon to Ginsparg’s office in Cornell’s physics building, I discovered it to be not “messy,” per se, because that suggests it could be cleaned. Instead, the objects in the room seemed inert, long since resigned to their fate: unopened boxes from the 1990s, piles of Physics Today magazines, an inexplicable CRT monitor, a tossed-aside invitation to the Obama White House. New items were occasionally added to the heap. I spotted a copy of Stephen Wolfram’s recent book, The Second Law, with a note from Wolfram that read, “Since you can’t find it on arXiv :)” The only thing that seemed actively in use was the blackboard, dense with symbols and equations related to quantum measurement theory, sprawling with bra-ket notation.
As he showed me around the building and his usual haunts, Ginsparg was gregarious, not letting a single detail slip by: the nesting patterns of local red-tailed hawks, the comings and goings of the dining staff, the plans for a new building going up behind his office. He was playful, even prankish. Midway through telling me about a podcast he was listening to, Ginsparg suddenly stopped and said, “I like your hair color, by the way, it works for you”—my hair is dyed ash gray, if anyone cares—before seamlessly transitioning to a story about a hard drive that had failed him.
The drive, which he had sent for recovery, contained a language model, Ginsparg’s latest intellectual fascination. Among his litany of peeves is that, because arXiv has seen a surge in submissions in recent times, especially in the AI category, the number of low-quality papers has followed a similar curve—and arXiv has nowhere near enough volunteers to vet them all. Hence his fussing with the drive, part of a quest to catch subpar submissions with what he calls “the holy grail crackpot filter.” And Ginsparg thinks, as he often has in arXiv’s three-decade history, that the quality would not be up to snuff if he doesn’t do it himself.
Long before arXiv became critical infrastructure for scientific research, it was a collection of shell scripts running on Ginsparg’s NeXT machine. In June 1991, Ginsparg, then a researcher at Los Alamos National Laboratory, attended a conference in Colorado, where a fateful encounter took place.
First came a remark from Joanne Cohn, a friend of Ginsparg’s and a postdoc at the Institute for Advanced Study in Princeton, who maintained a mailing list for physics preprints. At the time, there was no centralized way to access these preprints. Unless researchers were on certain mailing lists—which were predicated on their affiliations with prestigious institutions—or knew exactly whom to contact via email, they had to wait months to read new work in published journals.
Then came an offhand comment from a physicist worried about his computer’s storage filling up with emailed articles while he was traveling.
Ginsparg, who had been coding since high school, asked Cohn if she’d considered automating the distribution process. She hadn’t and told him to go ahead and do it himself. “My recollection is that the next day he’d come up with the scripts and seemed pretty happy about having done it so quickly,” Cohn told me. “It’s hard to communicate how different it was at the time. Paul had really seen ahead.”
Hearing tales from and about Ginsparg, you can’t help but see him as a sort of Forrest Gump figure of the internet age, who found himself at crucial junctures and crossed paths with revolutionary figures. As an undergrad at Harvard, he was classmates with Bill Gates and Steve Ballmer; his older brother was a graduate student at Stanford studying with Terry Winograd, an AI pioneer. The brothers both had email addresses and access to Arpanet, the precursor to the internet, at a time when few others did.
After earning his PhD in theoretical physics at Cornell, Ginsparg began teaching at Harvard. A career there wasn’t to be: He wasn’t granted tenure—Harvard is infamous for this—and started looking for a job elsewhere. That’s when Ginsparg was recruited to Los Alamos, where he was free to do research on theoretical high-energy physics full-time, without other responsibilities. Plus, New Mexico was perfect for his active lifestyle.
When arXiv started, it wasn’t a website but an automated email server (and within a few months also an FTP server). Then Ginsparg heard about something called the “World Wide Web.” Initially skeptical—“I can’t really pay attention to every single fad”—he became intrigued when the Mosaic browser was released in 1993. Soon after, Ginsparg built a web interface for arXiv, which over time became its primary mode of access. He also occasionally consulted with a programmer at the European Organization for Nuclear Research (CERN) named Tim Berners-Lee—now Sir Tim “Inventor of the World Wide Web” Berners-Lee—whom Ginsparg fondly credits with grilling excellent swordfish at his home in the French countryside.
In 1994, with a National Science Foundation grant, Ginsparg hired two people to transform arXiv’s shell scripts into more reliable Perl code. They were both technically gifted, perhaps too gifted to stay for long. One of them, Mark Doyle, later joined the American Physical Society and became its chief information officer. The other, Rob Hartill, was working simultaneously on a project to collect entertainment data: the Internet Movie Database. (After IMDb, Hartill went on to do notable work at the Apache Software Foundation.)
Before arXiv was called arXiv, it was accessed under the hostname xxx.lanl.gov (“xxx” didn’t have the explicit connotations it does today, Ginsparg emphasized). During a car ride, he and his wife brainstormed nicer-sounding names. Archive? Already taken. Maybe they could sub in the Greek equivalent of X, chi (pronounced like “kai”). “She wrote it down and crossed out the e to make it more symmetric around the X,” Ginsparg said. “So arXiv it was.” At this point, there wasn’t much formal structure. The number of developers typically stayed at one or two, and much of the moderation was managed by Ginsparg’s friends, acquaintances, and colleagues.
Early on, Ginsparg expected to receive on the order of 100 submissions to arXiv a year. It turned out to be closer to 100 a month, and growing. “Day one, something happened, day two something happened, day three, Ed Witten posted a paper,” as Ginsparg once put it. “That was when the entire community joined.” Edward Witten is a revered string theorist and, quite possibly, the smartest person alive. “The arXiv enabled much more rapid worldwide communication among physicists,” Witten wrote to me in an email. Over time, disciplines such as mathematics and computer science were added, and Ginsparg began to appreciate the significance of this new electronic medium. Plus, he said, “it was fun.”
As the usage grew, arXiv faced challenges similar to those of other large software systems, particularly in scaling and moderation. There were slowdowns to deal with, like the time arXiv was hit by too much traffic from “stanford.edu.” The culprits? Sergey Brin and Larry Page, who were then busy indexing the web for what would eventually become Google. Years later, when Ginsparg visited Google HQ, both Brin and Page personally apologized to him for the incident.
The biggest mystery is not why arXiv succeeded. Rather, it’s how it wasn’t killed by vested interests intent on protecting traditional academic publishing. Perhaps this was due to a decision Ginsparg made early on: Upon submission, users signed a clause that gave arXiv nonexclusive license to distribute the work in perpetuity, even in the event of future publication elsewhere. The strategic move ensured that no major publishers, known for their typically aggressive actions to maintain feudal control, would ever seriously attempt to shut it down.
But even as arXiv’s influence grew, higher-ups at Los Alamos never particularly championed the project—which was becoming, one could argue, more influential than the lab itself. (This was, of course, long past the heyday of Oppenheimer depicted in Christopher Nolan’s middling 2023 docudrama.) Those early years at Los Alamos were “dreamlike and heavenly,” Ginsparg emphasized, the best job he ever had. But in 1999, a fellow physicist at the lab, Wen Ho Lee, was accused of leaking classified information to China. Lee, a Taiwanese American, was later cleared of wrongdoing, and the case was widely criticized for racial profiling. At the time, the scandal led to internal upheaval. There were travel restrictions to prevent leaks, and even discussions about subjecting employees to lie detector tests. “It just got glummer and glummer,” Ginsparg said. It didn’t help that a performance review that year labeled him “a strictly average performer” with “no particular computer skills contributing to lab programs.” Also, his daughter had just been born, and there weren’t schools nearby. He was ready to leave.
Ginsparg stops short of saying he “brought” arXiv with him, but the fact is, he ended up back at his alma mater, Cornell—tenured, this time—and so did arXiv. He vowed to be free of the project within “five years maximum.” After all, his main job wasn’t supposed to be running arXiv—it was teaching and doing research. At the university, arXiv found a home within the library. “They disseminate material to academics,” Ginsparg said, “so that seemed like a natural fit.”
A natural fit it was not. Under the hood, arXiv was a complex software platform that required technical expertise far beyond what was typically available in a university library. The logic for the submission process alone involved a vast number of potential scenarios and edge cases, making the code convoluted. Ginsparg and other early arXiv members I spoke to felt that the library failed to grasp arXiv’s significance and treated it more like an afterthought.
On the library’s side, some people thought Ginsparg was too hands-on. Others said he wasn’t patient enough. A “good lower-level manager,” according to someone long involved with arXiv, “but his sense of management didn’t scale.” For most of the 2000s, arXiv couldn’t hold on to more than a few developers.
There are two paths for pioneers of computing. One is a life of board seats, keynote speeches, and lucrative consulting gigs. The other is the path of the practitioner who remains hands-on, still writing and reviewing code. It’s clear where Ginsparg stands—and how anathema the other path is to him. As he put it to me, “Larry Summers spending one day a week consulting for some hedge fund—it’s just unseemly.”
But overstaying one’s welcome also risks unseemliness. By the mid-2000s, as the web matured, arXiv—in the words of its current program director, Stephanie Orphan—got “bigger than all of us.” A creationist physicist sued it for rejecting papers on creationist cosmology. Various other mini-scandals arose, including a plagiarism one, and some users complained that the moderators—volunteers who are experts in their respective fields—held too much power. In 2009, Philip Gibbs, an independent physicist, even created viXra (arXiv spelled backward), a more or less unregulated Wild West where papers on quantum-physico-homeopathy can find their readership, for anyone eager to learn why pi is a lie.
Then there was the problem of managing arXiv’s massive code base. Although Ginsparg was a capable programmer, he wasn’t a software professional adhering to industry norms like maintainability and testing. Much like constructing a building without proper structural supports or routine safety checks, his methods allowed for quick initial progress but later caused delays and complications. Unrepentant, Ginsparg often went behind the library’s back to check the code for errors. The staff saw this as an affront, accusing him of micromanaging and sowing distrust.
In 2011, arXiv’s 20th anniversary, Ginsparg thought he was ready to move on, writing what was intended as a farewell note, an article titled “ArXiv at 20,” in Nature: “For me, the repository was supposed to be a three-hour tour, not a life sentence. ArXiv was originally conceived to be fully automated, so as not to scuttle my research career. But daily administrative activities associated with running it can consume hours of every weekday, year-round without holiday.”
Ginsparg would stay on the advisory board, but daily operations would be handed over to the staff at the Cornell University Library.
It never happened, and as time went on, some accused Ginsparg of “backseat driving.” One person said he was holding certain code “hostage” by refusing to share it with other employees or on GitHub. Ginsparg was frustrated because he couldn’t understand why implementing features that used to take him a day now took weeks. I challenged him on this, asking if there was any documentation for developers to onboard the new code base. Ginsparg responded, “I learned Fortran in the 1960s, and real programmers didn’t document,” which nearly sent me, a coder, into cardiac arrest.
Technical problems were compounded by administrative ones. In 2019, Cornell transferred arXiv to the school’s Computing and Information Science division, only to have it change hands again after a few months. Then a new director with a background in, of all things, for-profit academic publishing took over; she lasted a year and a half. “There was disruption,” said an arXiv employee. “It was not a good period.”
But finally, relief: In 2022, the Simons Foundation committed funding that allowed arXiv to go on a hiring spree. Ramin Zabih, a Cornell professor who had been a long-time champion, joined as the faculty director. Under the new governance structure, arXiv’s migration to the cloud and a refactoring of the code base to Python finally took off.
One Saturday morning, I met Ginsparg at his home. He was carefully inspecting his son’s bike, which I was borrowing for a three-hour ride we had planned to Mount Pleasant. As Ginsparg shared the route with me, he teasingly—but persistently—expressed doubts about my ability to keep up. I was tempted to mention that, in high school, I’d cycled solo across Japan, but I refrained and silently savored the moment when, on the final uphill later that day, he said, “I might’ve oversold this to you.”
Over the months I spoke with Ginsparg, my main challenge was interrupting him, as a simple question would often launch him into an extended monolog. It was only near the end of the bike ride that I managed to tell him how I found him tenacious and stubborn, and that if someone more meek had been in charge, arXiv might not have survived. I was startled by his response.
“You know, one person’s tenacity is another person’s terrorism,” he said.
“What do you mean?” I asked.
“I’ve heard that the staff occasionally felt terrorized,” he said.
“By you?” I replied, though a more truthful response would’ve been “No shit.” Ginsparg apparently didn’t hear the question and started talking about something else.
Beyond the drama—if not terrorism—of its day-to-day operations, arXiv still faces many challenges. The linguist Emily Bender has accused it of being a “cancer” for the way it promotes “junk science” and “fast scholarship.” Sometimes it does seem too fast: In 2023, a much-hyped paper claiming to have cracked room-temperature superconductivity turned out to be thoroughly wrong. (But equally fast was exactly that debunking—proof of arXiv working as intended.) Then there are opposite cases, where arXiv “censors”—so say critics—perfectly good findings, such as when physicist Jorge Hirsch, of h-index fame, had his paper withdrawn for “inflammatory content” and “unprofessional language.”
How does Ginsparg feel about all this? Well, he’s not the type to wax poetic about having a mission, promoting an ideology, or being a pioneer of “open science.” He cares about those things, I think, but he’s reluctant to frame his work in grandiose ways.
At one point, I asked if he ever really wants to be liberated from arXiv. “You know, I have to be completely honest—there are various aspects of this that remain incredibly entertaining,” Ginsparg said. “I have the perfect platform for testing ideas and playing with them.” Though he no longer tinkers with the production code that runs arXiv, he is still hard at work on his holy grail for filtering out bogus submissions. It’s a project that keeps him involved, keeps him active. Perhaps, with newer language models, he’ll figure it out. “It’s like that Al Pacino quote: They keep bringing me back,” he said. A familiar smile spread across Ginsparg’s face. “But Al Pacino also developed a real taste for killing people.”
Let us know what you think about this article. Submit a letter to the editor at mail@wired.com.