As AI Lowers the Cost of Research, Adjudication and Attention Will Become the Bottlenecks
Whatever one thinks of current frontier LLMs, they are almost certainly the worst ones we will ever use (METR 2026). Given current capabilities and this trajectory, quantitative social science research will likely be transformed by LLMs over the next few years.1 This note aims to pull apart two ways that this transformation will occur and draws out implications of each. I assume that the goal of research is the production of socially useful knowledge. When we apply AI to this task, the first change is that AI reduces the cost of many research tasks, thereby raising the value of the scarce remaining tasks. Human time and attention will flow to bottlenecks where AI cannot reduce costs as dramatically, implying that what it means for humans to do research will change. This is already occurring. The second change is that at some point I expect that the bottleneck in the research ecosystem will shift from production to consumption. If the cost of research production falls enough, then the main constraint will become how we adjudicate research claims, prioritize work for replication, and filter research for attention. I will spend more time on this second, more speculative aspect.
With the rise of LLMs, large parts of research production (data cleaning, routine coding, application of standard identification strategies, mathematical exposition, paper editing) are now very inexpensive. This is a huge change from only a few years ago. Taste in question selection, access to private or difficult-to-collect data, fieldwork capacity, and the judgment required to decide which analyses are actually worth trusting are not currently amenable to automation. Value is now shifting toward these scarce factors of research production. A world in which competent empirical execution is cheap is a world in which proprietary data and research taste become relatively more valuable. I think this argument is by now familiar, and I think it is both correct and incomplete.2
If we see falling costs across many factors of research production, with the effect that humans start increasingly focusing their effort on the remaining stubborn tasks, then we should see overall research output rise. If this occurs then it seems quite likely that in short order the true bottleneck on the creation of socially valuable knowledge will not be in research production at all. Instead, the bottleneck will be in research consumption. In this world, I expect two constraints will start to bind. The first is adjudication capacity, or simply figuring out which claims are worth trusting. One approach to this is to read each paper carefully, but everyone wants to defer at least some of this task to someone else. The second is the task of filtering research according to various criteria such as importance. We currently use the peer reviewed journal system for both of these tasks.
However, it is quite expensive to evaluate a paper submitted to a journal, and so this system works best when it is also expensive to produce a research paper. If the cost of production falls, then the cost of evaluation will start to bind. One can easily bolt on solutions to this that force authors to internalize the cost of review, and this “spam filtering” approach is an appropriate response if one views the new research production as low quality. However, I think this is a mistake because I expect lots of the new and cheap research to be good. Ideally, we want ways to adjudicate quality and direct attention that make sense in a world where research production is cheap. With its expensive processes, the peer reviewed journal system as currently set up is not well positioned to do this. The peer reviewed journal system also, for all its merits, has some serious problems (e.g. Briggs et al. 2026). So we soon may be in a world where we have a lot of good research being produced but no good way to adjudicate it or filter it for attention. Now may be an especially good time to rethink how we adjudicate claims and filter research. The rest of this paper takes each of these two constraints in turn, starting with adjudication.

When we encounter a scientific claim we have to decide how much to believe it. This is the problem of adjudication. There are multiple aspects to this problem, with the most critical ones probably being:
- Is the claim coherent?
- Is the claim tested properly and are the tests interpreted correctly?
- Would the test results be similar if we tested it again on new data?
The first question is one of theory. The second question is empirical and includes checking the link between concepts and measured variables, checking if code has bugs, examining if the assumptions underlying the analysis seem reasonable, checking if the results are sensitive to reasonable changes in analysis, and checking interpretations of the results and how they connect back to the theory. Some parts of this second task are currently more amenable to automation than others: code execution, robustness checks, and consistency checks look increasingly tractable, while judgment about measurement, identification, and interpretation are currently harder to automate but this could change. The third part is about external validity.3
This three-part distinction is useful because the first two parts of claim adjudication depend much more heavily on manipulating existing information than the third does, and so they are much more amenable to automation. The third part requires replication on new data and so is inherently bottlenecked by having to interact with the real world. This implies that direct replication4 will end up being a key constraint on our ability to adjudicate claims over time.
Our current research production system has large issues. For example, in my discipline of political science most statistical tests are radically underpowered relative to best estimates of true effect sizes (Arel-Bundock et al. 2026) and we have extremely strong selection of significant results over null results (Briggs et al. 2026). The end effect of this is that published significant estimates are something like 3x larger than results in replication exercises (Aczel et al. 2026). For AI to improve claim adjudication, this is the benchmark that it has to beat.
My expectation is that in the near-term, AI will harm our ability to adjudicate claims for three reasons. First, the production of plausible-looking papers should scale with AI capability while our ability to do direct replication scales with data collection. This means that the stock of unreplicated-but-cited claims will (continue to) grow. Second, when creating theory and running very many analyses is cheap, I expect that we will see a rise in papers with coherent theory and compelling empirics that are nevertheless overfit to their data and do not reflect the author’s genuine best guess about an effect in the world. This can be understood as a variant of p-hacking where all elements of the research process are engineered to produce a compelling article instead of uncovering the truth. When creating theory and testing was expensive, we could use the fact that a paper existed as evidence that the author thought this question was at least worth checking. When creating theory and testing is cheap, the existence of a paper is much less informative about the author’s beliefs. If we think that the author’s intuition had any value in finding general claims, then this is a loss. Third, in the near-term we will still use the traditional journal system for adjudication and filtering, and so selection on significance, for example, will likely not change.
After the near-term, the effects of AI on claim adjudication seem harder to predict. If we stick with the idea that AI makes information processing tasks incredibly cheap then some kinds of claims will get tested much more frequently. For example, claims that rely on regularly collected surveys could be re-tested cheaply as new survey waves are produced. This could genuinely improve our ability to adjudicate claims. Automatic re-testing on existing data can be done en masse without needing to prioritize, and the interesting design problem there is making such re-testing routine rather than deciding which claims deserve it. Replicating other kinds of research, however, will require expensive data collection, and so we will need to prioritize which claims are worthy of such investment. That prioritization is itself a filtering problem, and filtering is the subject of the next section.
Most academics today filter research predominantly using brands: journal names, institutional affiliations, or the literal names of professors. These are useful shortcuts that allow one to direct their attention to where it is most useful given a voluminous research production system.5 Brands work here for the same reason they work elsewhere: they are cheap heuristics that become valuable when evaluating each alternative is expensive. I stop at McDonald’s on road trips not because it is excellent but because its quality is known and evaluating each unfamiliar restaurant I am driving past would cost more than the meal is worth. Academic brands do similar work for research consumers who have limited time. Nevertheless I think that we can do better than using brands in this way, and that the first step to doing so is realizing that “directing attention” is at least three separate problems.
The first problem is finding work that is technically correct. This filtering task requires the first two steps of claim adjudication as I described above combined with some way to certify work as passing some test. This is the part of filtering that seems most likely to move toward scalable automation as AI improves, though probably unevenly across different kinds of claims and methods.
The second filtering problem is surfacing which work is most important to replicate. This is a resource allocation question, and one could imagine it being quite important for grant funders, for example. One can imagine various approaches to surfacing this information, from prediction markets (replicating where there is the most informed disagreement about which results will replicate) to algorithmic approaches such as the use of PageRank over citation networks.
The final filtering problem is one of finding “the best” work for blue sky funding, tenure and promotion, and other evaluative purposes, or for readers. This task is irreducibly about values and so will have to be quite personalized to different research consumers.
Right now, we solve claim adjudication and filtering using the journal system. It is amazing that the journal system works as well as it does given the different goals we ask of it. I imagine the best approach going forward is not a new single institution doing all of the above goals but a three layer system composed of:
- An open infrastructure layer containing papers, data, and code,
- A linking protocol,
- Many different aggregation portals, each designed by different communities for their own goals.
Probably the best analogy for how this could work is the world wide web, with websites, HTTP, and search engines. We can see hints of how this might work with platforms like ArXiv, OSF, and the Harvard Dataverse. Journals could still exist as important specialized filtering tools, but they would build upon a more basic infrastructural layer. The goal is not so much to replace journals, but to unbundle what they currently do so that separate and more diverse mechanisms can address each goal. Because we want to enable diverse types of filtering and different filtering problems have different structures, no single institution should try to solve all of them. This is the basic idea that leads to my suggestion of an open and layered approach.
Given the three-layer framing above, here are specific steps worth exploring now. To get ahead of claim adjudication, it would be useful for us to be developing infrastructure that allows us to extract more signal per unit of new data. LLM tools for reproducing research are already moving us in that direction (Xu and Yang 2026). A second step is to make our public data repositories more conducive to programmatic and LLM interactions. A third step would be to have authors more clearly state the kind of future data that would be most appropriate for future testing.
To get ahead of attention filtering, we should be experimenting with ways to build on top of the working paper sources that already exist. For example, we might already have the data required to read all psychology papers on OSF, identify which ones are cited the most by highly cited papers, and then figure out which of those lack replication. This would be a proof-of-concept of how to direct attention towards the studies where replication is highest value. The ultimate goal here is to not have a new and better ranking scheme, but to have an open standard where it is cheap for people to experiment with new ways to do rankings, with the aim of growing an ecosystem.
Such an open system could allow for new ways of linking research that were previously impossible. One can imagine literatures being linked in large DAGs of claims with associated data, analyses, and replications all queryable and machine-readable. One can imagine living literature reviews that update automatically as new research is published. One can imagine having one’s own agents do personalized rankings based on complex understandings of one’s own priorities. Much like the web spawned many interesting permutations based on an open standard, an open research base layer could enable many new experiments in filtering.
Information manipulation is going to get cheaper much faster than interacting with the world will. I expect that this will lead to the bottleneck in research shifting from production to consumption. We should be preparing for this shift now by building infrastructure that allows us to adjudicate claims and filter work in a world where production is cheap. The current journal system is not well positioned to do this, and so we should be experimenting with new institutions that can.
Note: This is a short position paper written for the Notes on the Future of Quantitative Social Science workshop held at the University of Toronto in May 2026.
Works Cited
& Politics 10 (3). https://doi.org/10.1177/20531680231187271.
Footnotes
For discussion of what is possible right now, see Hall (2026).↩︎
For an excellent summary of research in this vein, see Imas and Shukla (2026).↩︎
External validity includes questions of generalization to new time periods, and so is a perennial question of empirical research (Munger 2023).↩︎
“Robust replication” or testing on already-gathered data should become almost entirely free over time. For example, see Xu and Yang (2026).↩︎
For example, between 2016 and 2022 total journal output as tracked by Scopus and Web of Science grew by 47% (Hanson et al. 2024). In political science, the number of Scopus-indexed journals nearly tripled from 2000 and 2022 (Kaiser et al. 2023). Growth in economics has also been explosive (Aigner et al. 2025).↩︎