The 2019-nCoV pandemic resulted in severe personal and business disruption. In latest developments more and more individuals and organisations are questioning the accuracy and validity of the Covid-19 data and prediction models published in the media. A forensic data analyst confronted with these concerns will, inter alia, consider using a Benford analysis of the available data to determine if evidence exists that support these concerns.

Many analysts in the fields of audit, forensic and accounting sciences have heard of Benford’s Law. I have not seen many practical applications of benford analysis but, in the few cases where I did, it almost always delivered results that were interesting but difficult to interpret.

I believe the difficulty stems from my lack of understanding the mathematical basis of Benford’s Law. As an auditor I grasp the concept and how it can assist in the identification of data manipulation but I lack the mathematical understanding to effectively interpret the results of a Benford analysis. In fact, from a review of articles written by mathematicians, it seems they also believe the exact reason why a certain set of numbers follow this law and others not, still require further study and understanding.

But let us start at the beginning. Quoting loosely from an article published in the American Scientist (T.P Hill – 1998), the astronomer and mathematician Simon Newcomb published a two-page article in the American Journal of Mathematics (1881) reporting a mathematical phenomenon. Newcomb described his observation that books of logarithms in the library were quite dirty at the beginning and progressively cleaner throughout. From this he inferred that fellow scientists using the logarithm tables were looking up numbers starting with 1 more often than numbers starting with 2, numbers with first digit 2 more often than 3, and so on.

After a short heuristic argument, Newcomb concluded that the probability that a number has a particular first significant digit (that is, first non-zero digit) d can be calculated as follows:

*Prob (first significant digit= d) = log10 (1 + 1/d), d = 1,2, … ,9*

In particular, his conjecture was that the **first digit is 1 about 30 percent of the time and 9 only about 4.6 percent of the time**. That the digits are not equally likely to appear comes as something of a surprise, but to claim an exact law describing their distribution is indeed striking.

Newcomb’s article went unnoticed, and 57 years later General Electric physicist Frank Benford, apparently unaware of Newcomb’s paper, made exactly the same observation about logarithm books and also concluded the same logarithm law. Evidence indicates that Benford spent several years gathering data, and the table he published in 1938 in the Proceedings of the American Philosophical Society was based on 20,229 observations from such diverse data sets as areas of rivers, American League baseball statistics, atomic weights of elements and numbers appearing in Reader’s Digest articles. Benford’s article received much wider attention and led to the name “Benford’s Law” {end of quote}

I first got interested in Benford’s Law at a conference in Germany where I met accounting professor Mark Ngrini, originally from South Africa but based in the United States at the time. He did his Ph.D. thesis on Benford’s Law and delivered a very interesting presentation of the practical value of the law in accounting data. Since then I have always looked for opportunities to use Benford’s law in data analysis. I use Arbutus Analyzer that provides probably the most advanced Benford functionality.

During the Covid-19 pandemic, under lockdown, I naturally started collecting Covid-19 stats and, like many other data analysts, soon found myself tracking and charting the spread of the virus and its impact globally.

Then, as auditors do, I wondered if the data upon which me and the world rely upon, are complete and accurate and if there might be a way to test this.

I’m sure by now you know where I’m going with this and yes, I thought about Benford’s Law. Is it possible that the global Covid-19 statistical data can be tested to determine the level of reliance that can be placed upon it? Is this data appropriate for a Benford analysis?

The answer is I don’t know. Specialist like Hill and Nigrini might weigh in on these questions. But I am curious and decided to develop a program in Arbutus that imports the daily Covid-19 data from the internet and perform daily analytics on this data. I wanted to see if this data conforms to Benford’s Law. Specifically, I wanted to see if Benford’s Law can be used to ask questions about the accuracy of Covid-19 statistics.

In short, I decided on the following approach:

- Extract daily Covid-19 data from GitHub (https://github.com/CSSEGISandData/COVID-19 ), using Python and Arbutus Analyzer.

- Run the Arbutus Benford command, daily, on every country’s data from 1 January 2020.
- I decided to use new confirmed cases as the value to be analysed and I only analysed the first digit. I decided on new cases because it is more random than total cases, which has a natural rising nature. It is also the number that I decided will show manipulation better over a time-series database.

My first graphical result is incredibly positive. From a table of all global entries of new confirmed cases for each day, which at the moment contains 19,467 records, an almost perfect Benford’s Law conformance is displayed:

Legend: **Top line (blue)** = Upper bound; **Middle line (red)** = Actual count; **Bottom line (green)** = Lower bound

In addition, this conformance seems to exist from very early in the time series database. It leads me to conclude that the **global **Covid-19 data don’t show clear signs of materially manipulated data. However, and more significantly, the graph shows that the data is most probably appropriate for a Benford analysis.

Obviously, the next question to answer is whether individual country data also conforms to Benford’s law. Prof. Ngrini told me that he used, inter alia, the US census database to test Benford’s Law during his Ph.D studies. Overall, he showed an almost perfect Benford match but, when drilled down to individual counties, some significant non-conformances were identified. Closer inspections of these exceptions indicated a high probability of manipulated or “guessed” data.

My approach to compare individual country data with each other is reasonably straightforward. On a daily basis I total the zstat per country and sort the resultant table on the zstat value in descending order. This provides me with the most non-conforming countries at the top of the table.

This is where it gets interesting and contentious. Although this list changes over time, right now, the top non-conforming countries can be seen in the list on the left in the graph below. South Africa’s graph is shown as evidence that many countries’ data do closely conform to Benford’s law:

The next graph shows the top non-conforming countries individually:

Interestingly, the top countries not conforming to Benford’s law all have a similar low actual count of the number 1. Also interestingly, Hill mentions in his 1989 article, with reference to Ngrini’s findings, *“….the evidence to date, it appears that both fraudulent and random-guess data tend to have far to few numbers beginning with 1 and to many numbers beginning with 6…..*”. A perfect example of such a combination in the Covid-19 data can be found in the Sweden graph:

However, in our global Covid-19 data we have far to few numbers beginning with 1 and to many numbers starting with 8, then followed by numbers starting with 6.

The objective of this article was never to question any country’s Covid-19 reported data. This analysis proofs nothing yet. It only indicates that asking questions of the data is probably warranted.

I wanted to know if Benford’s Law can be applied to documented Covid-19 data. I am of the opinion that it can. I have no doubt that purists might argue about my approach or methodology but from my perspective as an auditor and forensic analyst, I would ask appropriate questions based upon my findings to date.

These questions include:

- Do the results point to weaknesses in my methodology and is new confirmed cases the most appropriate value to measure? A derivative number such a logarithm or standard deviation should be tested to confirm results.
- Why do the global Covid-19 numbers match Benford’s law so closely?
- Why do some countries non-conform significantly more than others?
- Why is the number 1 in most cases understated? Possible 1000 reported as 800 etc.?
- If data were manipulated, what would be possible reasons (understatement or overstatement)? E.g. understatement to support lifting of lockdown or overstatement to support Covid-19 expenditure/lockdown?

Finally, it must be mentioned that the results shown in this article change daily. New countries drift to the top of the list over time. I will keep on monitoring this phenomenon until the virus has been contained.

I believe it is important to use our collective knowledge to ensure transparency and to investigate information published in the media. From a personal perspective, it is encouraging to see laws of nature enabling us to ask independent questions of data not conforming to those laws. I imagine Benford’s Law and other mathematical laws will increasingly form part of artificial intelligence applications.

REFERENCES:

Bogomolny, A. “Benford’s Law and Zipf’s Law.” https://www.cut-the-knot.org/do_you_know/zipfLaw.shtml.

Hill, T. P. “The First Digit Phenomenon.” *Amer. Sci.* **86**, 358-363, 1998.

Nigrini, M. J. *The Detection of Income Tax Evasion Through an Analysis of Digital Frequencies.* Ph.D. thesis. Cincinnati, OH: University of Cincinnati, 1992.

Nigrini, M. “I’ve Got Your Number.” *J. Accountancy* **187**, pp. 79-83, May 1999. https://www.aicpa.org/pubs/jofa/may1999/nigrini.htm.

Nigrini, M. *Digital Analysis Using Benford’s Law: Tests Statistics for Auditors.* Vancouver, Canada: Global Audit Publications, 2000.