Martin Heller
Contributor

What is a compiler? How source code becomes machine code

feature
Jan 20, 20239 mins
Programming LanguagesSoftware Development

Find out how source code is compiled for different programming languages and computer architectures, including the evolution from FORTRAN to CLR and JIT compilers.

stop watch on a racetrack
Credit: mpix foto/Shutterstock

A compiler is a computer program that translates from one format to another, most often from a high-level computer language to byte code and machine code. Compilers come in a number of variations, which we will explore in this article.

Compilers, transpilers, interpreters, and JIT compilers

Compilers often translate source code for a high-level language, such as C++, to object code for the current computer architecture, such as Intel x64. The object modules produced from multiple high-level language files are then linked into an executable file.

Compilers intended to produce object code for architectures that differ from the one running the compiler are called cross-compilers. It is common to use cross-compilers running on desktop (or larger) computers to produce executables for embedded systems. Compilers that translate from one high-level language to another, such as from TypeScript to JavaScript or from C++ to C, are called transpilers. Compilers written in the language that they are compiling are called bootstrap compilers.

Compilers for languages intended to be machine-independent, such as Java, Python, or C#, translate the source code into byte code for a virtual machine, which is then run in an interpreter for the current architecture. The interpreter may be boosted by a just-in-time (JIT) compiler, which translates some of the byte code into native code instructions at runtime. JIT compilers sometimes introduce runtime startup delays, which are usually outweighed by the increased speed later in the run, especially for CPU-intensive code. One approach to reducing the startup lag for JIT-compiled executables is to use an ahead-of-time (AOT) compiler when building the executable image.

Historically, there have been interpreters that didn’t use byte code, such as the BASIC interpreter that came with early personal computers. They tended to be slower at runtime than interpreters that ran compact byte code, and much slower at runtime than compiled native code. However, they were often very productive for the overall software development life cycle, since programmers could quickly code, test, debug, modify, and re-run the code.

Let’s dive into the characteristics of some of the more prominent high-level language compilers.

FORTRAN

FORTRAN (Formula Translator, spelled Fortran from 1977 on) was the first successful high-level language, intended for scientific and engineering applications. The FORTRAN I compiler was developed from 1954 to 1957 for the IBM 704 by an all-star team led by John W. Backus, who was also a co-designer of the IBM 704 itself. It was an optimizing compiler written in assembly language, amounting to 23K instructions. The FORTRAN I compiler did significant optimizations: it tackled parsing arithmetic expressions and applying operator precedence, performed copy propagation and dead-code elimination, hoisted common subexpressions to eliminate redundant computations, optimized do... loops and subscript computations, and optimized index-register allocation.

Currently, there are over a dozen FORTRAN compilers, four of which are open source, and many of which are free even though they are offered commercially.

LISP

John McCarthy designed LISP (List Processor) at MIT and published the specification in 1960; it was and is closely associated with the artificial intelligence (AI) community. Shortly after the specification was published, Steve Russell realized that the LISP eval function could be implemented in machine code, and did so for the IBM 704 (to McCarthy’s surprise); that became the first LISP interpreter. Tim Hart and Mike Levin at MIT created the first LISP compiler, in LISP, in 1962; the compiler itself was compiled by running Russell’s LISP interpreter on the compiler source code. Compiled LISP ran 40 times faster than interpreted LISP on the IBM 704. That was the earliest bootstrapped compiler; it also introduced incremental compilation, which allows compiled and interpreted code to intermix.

There have been numerous compilers and interpreters for later versions of LISP and its descendants, such as Common Lisp, Emacs Lisp, Scheme, and Clojure.

COBOL

COBOL (Common Business-Oriented Language) was designed by a committee, CODASYL, starting in 1959 at the prompting of the US Department of Defense, and based on three existing languages: FLOW-MATIC (designed by Grace Hopper), AIMACO (a FLOW-MATIC derivative), and COMTRAN (from Bob Bemer of IBM). The original goal of COBOL was to be a portable high-level language for general data processing. The first COBOL program ran in 1960.

In 1962, a Navy study found that COBOL compiled 3 to 11 statements per minute. This improved over the years as the language specs and compilers were updated; by 1970, COBOL was the most widely used programming language in the world.

Currently, there are four major surviving COBOL compilers: Fujitsu NetCOBOL compiles to .NET intermediate language (byte code) and runs on the .NET CLR (common language runtime); GnuCOBOL compiles to C code that can then be compiled and linked; IBM COBOL compiles to object code for IBM mainframes and midrange computers and the code is then linked, similar to the early COBOL compilers; Micro Focus COBOL compiles either to .NET or JVM (Java virtual machine) byte code.

ALGOL

Doctors Edsger Dijkstra and Jaap Zonneveld wrote the first ALGOL 60 compiler in X1 assembly language over nine months between 1959 and 1960, at the Mathematical Centre in Amsterdam. The X1, designed in-house, was built by the new Dutch computer factory Electrologica. ALGOL (Algorithmic Language) itself was a huge advancement in computer languages for science and engineering over FORTRAN, and was influential in the development of imperative languages, such as CPL, Simula, BCPL, B, Pascal, and C.

The compiler itself was about 2,000 instructions long, and the runtime library (written by M.J.H. Römgens and S.J. Christen) was another 2,000 instructions long. The compiler loaded from paper tapes, as did the program source code and the libraries. The compiler took two passes through the code; the first (the prescan) to gather identifiers and blocks, and the second (the main scan) to generate object code on another paper tape. Later, the process was sped up by using a “store” (probably a magnetic drum) instead of paper tape. There were eventually about 70 implementations of ALGOL 60 and its dialects.

ALGOL 68 was intended to replace ALGOL 60 and was extremely influential, but was so complex that it had few implementations and little adoption. Languages influenced by ALGOL 68 include C, C++, Bourne shell, KornShell, Bash, Steelman, Ada, and Python.

PL/I

PL/I (Programming Language One) was designed in the mid-1960s by IBM and SHARE (the IBM scientific users group) to be a unified language for both scientific and business users. The first implementation, PL/I F, was for the IBM S/360, written entirely in System/360 assembly language, and shipped in 1966. The F compiler consisted of a control phase and a large number of compiler phases (approaching 100). There were several later implementations of PL/I at IBM, and also for Multics (as a systems language) and the DEC VAX.

Pascal

Niklaus Wirth of ETH in Zürich was a member of the standards committee working on the successor to ALGOL 60 and submitted a smaller language, ALGOL W, which was rejected. Wirth resigned from the committee, kept working on ALGOL W, and released it in simplified form in 1970 as Pascal. Wirth initially tried to implement the Pascal compiler in FORTRAN 66, but couldn’t; he then wrote a Pascal compiler in the C-like language Scallop, and then an associate translated that into Pascal for boot-strapping.

Two notable offshoots of Wirth Pascal are the Pascal P-system and Turbo Pascal. The Zürich P-system compiler generated “p-code” for a virtual stack machine which was then interpreted; that led to UCSD Pascal for the IBM PC, and to Apple Pascal. Anders Hejlsberg wrote Blue Label Pascal for the Nascom-2, then reimplemented it for the IBM PC in 8088 assembly language; Borland bought it and re-released it as Turbo Pascal. Later, Hejlsberg ported Turbo Pascal to the Macintosh, added Apple’s Object Pascal extensions, and ported the new language back to the PC, which eventually evolved into Delphi for Microsoft Windows.

C

C was originally developed at Bell Labs by Dennis Ritchie between 1972 and 1973 to construct utilities running on Unix. The original C compiler was written in PDP-7 assembly language, as was Unix at the time; the port to the PDP-11 was also in assembly language. Later, C was used to rewrite the Unix kernel to make it portable.

C++

C++ was developed by Bjarne Stroustrup at Bell Laboratories starting in 1979. Since C++ is an attempt to add object-oriented features (plus other improvements) to C, Stroustrup initially called it “C with Objects.” Stroustrup renamed the language to C++ in 1983, and the language was made available outside Bell Laboratories in 1985. The first commercial C++ compiler, Cfront, was released at that time; it translated C++ to C, which could then be compiled and linked. Later C++ compilers produced object code files to feed directly into a linker.

Java

Java was released in 1995 as a portable language (using the marketing slogan “Write once, run anywhere”) that is compiled to byte code for the JVM and then interpreted, similarly to the Pascal P-system. The Java compiler was originally written in C, using some C++ libraries. Later JVM releases added a JIT compiler to speed up the interpreter. The current Java compiler is written in Java, although the Java runtime is still written in C.

In the GraalVM implementation of Java and other languages, an AOT compiler runs at build time to optimize the byte code and reduce the startup time.

C#

C# was designed by Anders Hejlsberg, who had left Borland for Microsoft, in 1999, and implemented by Mads Torgersen in C and C++ for the CLR in 2000. C# compiles to CLR byte code (intermediate language, or IL) and is interpreted and JIT-compiled at runtime. The C# compiler, CLR, and libraries are now written in C#, and the compiler is bootstrapped from one version to another.

Part of the impetus for developing C#, based on the timing, may have been Microsoft’s inability to license Java from Sun, although Microsoft denies this. Hejlsberg says that C# was influenced as much by C++ as it was by Java; in any case, the Java and C# languages have diverged significantly over the years.

Conclusion

As you’ve seen, language compilers are often first implemented in an existing, lower-level language, and later re-implemented in the language that they are compiling, to enable portability and upgrades via bootstrapping. On the other hand, high-level languages are increasingly compiled to byte code for a virtual machine, which is then interpreted and JIT-compiled. When faster runtime speed is needed, however, library routines are often written in C or even assembly language.

Martin Heller
Contributor

Martin Heller is a contributing editor and reviewer for InfoWorld. Formerly a web and Windows programming consultant, he developed databases, software, and websites from his office in Andover, Massachusetts, from 1986 to 2010. More recently, he has served as VP of technology and education at Alpha Software and chairman and CEO at Tubifi.

More from this author