Microsoft Excel blamed for gene name errors in the scientific literature


Microsoft Excel is the world’s most popular spreadsheet software and its usage spans across industries. A new research study published on Genome Biology claims that Excel auto-correct issues have affected approximately one-fifth of genomics journal papers. When used with default settings, Excel converts gene names to dates and floating-point numbers.

For example, gene symbols such as SEPT2 (Septin 2) and MARCH1 [Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase] are converted by default to ‘2-Sep’ and ‘1-Mar’, respectively. Furthermore, RIKEN identifiers were described to be automatically converted to floating point numbers (i.e. from accession ‘2310009E13’ to ‘2.31E+13’).

Actually, this is not a newly discovered issue. The issue of Excel inadvertently converting gene symbols to dates and floating-point numbers was originally described in 2004. Since most common users will expect Excel to auto-correct SEP2 to 2-Sep, Microsoft has decided not to change its behaviour. But the gene symbol conversion is problematic because these files are an important resource in the genomics community that are frequently reused. This study screened 35,175 supplementary Excel files, finding 7467 gene lists attached to 3597 published papers. They confirmed gene name errors in 987 supplementary files from 704 published articles.

It seems there is no direct way to permanently deactivate automatic formatting of dates in Excel and this issue also occurs in other popular spreadsheet programs such as LibreOffice Calc or Apache OpenOffice Calc. This research study was conducted to raise awareness of this problem among genomics academic community.

Read the full report here.

