If you are new to C you may have wanted to write a function that takes an array argument. You may have learned the syntax to define an array looks like this:
int numbers[5] = {0};And it might then occur to you that if you wanted to pass that array as an argument to a function that it would have similar syntax:
void process_numbers(int numbers[5]);But you would soon discover that your assumptions won’t work and your code has some weird errors in it that you don’t understand. This is because array arguments in C decay into a pointer to the first element of the array. This post gets into what that means, why it works that way, and how to write code that avoids problems.
What Code Gets Generated
Let’s start with an example. Here we have a function that prints out the numbers it finds in the array arguments. You might imagine a more complex procedure. We’re more interested in what happens to the arguments:
#include <stdio.h>
#define NUMBERS_MAX_LEN 10
void process_bad(int numbers[NUMBERS_MAX_LEN]) {
for (size_t i = 0; i < NUMBERS_MAX_LEN; ++i) {
printf("Processing: %d\n", numbers[i]);
}
}On my machine I will compile this with gcc 13.3.0 using gcc -c array1.c. We can then look at the compiled code with objdump -d array1.o. This will dissassemble the binary data in the object file
and show us the code the compiler generated in (roughly) translated
pseudo-assembly.
Now, if you’re new to C and you haven’t gotten used to reading assembly yet, don’t worry. We’re not interested today in understanding this code. Skim it and move on.
array1.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <process_bad>:
0: f3 0f 1e fa endbr64
4: 55 push %rbp
5: 48 89 e5 mov %rsp,%rbp
8: 48 83 ec 20 sub $0x20,%rsp
c: 48 89 7d e8 mov %rdi,-0x18(%rbp)
10: 48 c7 45 f8 00 00 00 movq $0x0,-0x8(%rbp)
17: 00
18: eb 30 jmp 4a <process_bad+0x4a>
1a: 48 8b 45 f8 mov -0x8(%rbp),%rax
1e: 48 8d 14 85 00 00 00 lea 0x0(,%rax,4),%rdx
25: 00
26: 48 8b 45 e8 mov -0x18(%rbp),%rax
2a: 48 01 d0 add %rdx,%rax
2d: 8b 00 mov (%rax),%eax
2f: 89 c6 mov %eax,%esi
31: 48 8d 05 00 00 00 00 lea 0x0(%rip),%rax # 38 <process_bad+0x38>
38: 48 89 c7 mov %rax,%rdi
3b: b8 00 00 00 00 mov $0x0,%eax
40: e8 00 00 00 00 call 45 <process_bad+0x45>
45: 48 83 45 f8 01 addq $0x1,-0x8(%rbp)
4a: 48 83 7d f8 09 cmpq $0x9,-0x8(%rbp)
4f: 76 c9 jbe 1a <process_bad+0x1a>
51: 90 nop
52: 90 nop
53: c9 leave
54: c3 ret
Now let’s look another way you could write the same function:
#include <stdio.h>
#define NUMBERS_MAX_LEN 10
void process_better(int* numbers) {
for (size_t i = 0; i < NUMBERS_MAX_LEN; ++i) {
printf("Processing: %d\n", numbers[i]);
}
}The only change we’ve made is to the argument’s type signature.
Instead of int numbers[NUMBERS_MAX_LEN] we specify int* numbers.
This is a more honest specification. And is also equivalent to
processing_bad as far as the generated code is concerned:
array2.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <process_better>:
0: f3 0f 1e fa endbr64
4: 55 push %rbp
5: 48 89 e5 mov %rsp,%rbp
8: 48 83 ec 20 sub $0x20,%rsp
c: 48 89 7d e8 mov %rdi,-0x18(%rbp)
10: 48 c7 45 f8 00 00 00 movq $0x0,-0x8(%rbp)
17: 00
18: eb 30 jmp 4a <process_better+0x4a>
1a: 48 8b 45 f8 mov -0x8(%rbp),%rax
1e: 48 8d 14 85 00 00 00 lea 0x0(,%rax,4),%rdx
25: 00
26: 48 8b 45 e8 mov -0x18(%rbp),%rax
2a: 48 01 d0 add %rdx,%rax
2d: 8b 00 mov (%rax),%eax
2f: 89 c6 mov %eax,%esi
31: 48 8d 05 00 00 00 00 lea 0x0(%rip),%rax # 38 <process_better+0x38>
38: 48 89 c7 mov %rax,%rdi
3b: b8 00 00 00 00 mov $0x0,%eax
40: e8 00 00 00 00 call 45 <process_better+0x45>
45: 48 83 45 f8 01 addq $0x1,-0x8(%rbp)
4a: 48 83 7d f8 09 cmpq $0x9,-0x8(%rbp)
4f: 76 c9 jbe 1a <process_better+0x1a>
51: 90 nop
52: 90 nop
53: c9 leave
54: c3 ret
The reason why we don’t need to fully understand every line of this
assembly code is because if we simply compare these two outputs we
find that they are identical. This is because of what’s been
called, pointer decay. When we specify int numbers[NUMBERS_MAX_LEN] as a function argument the compiler treats
the argument type as a pointer to the first element of the array. When
we wrote int* numbers it happens to be the same thing which is why I
call it a more honest specification. Under the hood, this is how
the C compiler treats the argument.
Why Does It Decay
When Dennis Ritchie, the inventor of C, was developing the language his prior art for his work was B and BCPL. In those languages there were no arrays in the sense that we know them today. Instead there was the “cell,” which was basically a single machine word. Arrays were constructed from cells by packing addresses and values into them. Dennis Ritchie did something different with C after experimenting with extending B: he added arrays and soon after structs.
Arrays in C were different because they were allocated regions of memory instead of composed from cells. This required Ritchie to add type information to the pointer in order for the compiler to compute offsets into the region from the region’s base address. This allowed C compilers to scale the size of arrays without incurring run-time overhead.
But when Ritchie started thinking about structs he ran into a problem: what do you do about structs that contain arrays?
It was at this point that we learn where pointer decay came from. In Dennis Ritchie’s own words from his paper, The Development of the C Language*:
The solution constituted the crucial jump in the evolutionary chain between typeless BCPL and typed C. It eliminated the materialization of the pointer in storage, and instead caused the creation of the pointer when the array name is mentioned in an expression. The rule, which survives in today’s C, is that values of array type are converted, when they appear in expressions, into pointers to the first of the objects making up the array.
This rule he mentions turns up in function arguments as well. And ultimately it turns out to be a good thing. Function arguments copy by value. When you pass a pointer that is a word-sized value and is cheap to copy. If you were able to pass a whole array… that could be a very expensive copy and is probably not what you want your program to do!
What Is Wrong
You might have been tempted to write instead:
#include <stdio.h>
#define NUMBERS_MAX_LEN 10
void process_bad(int numbers[NUMBERS_MAX_LEN]) {
for (size_t i = 0; i < sizeof(numbers); ++i) {
printf("Processing: %d\n", numbers[i]);
}
}Briefly this code has a few problems:
sizeofreturns the number bytes of the argument, which is a pointer- the
NUMBERS_MAX_LENin the argument is ignored, the type is actuallyint* - this subtle conversion of types can lead to code which looks reasonable but is wrong
Fortunately more recent versions of GCC treat taking the sizeof (numbers) as an error. Linus
had praised the warning that eventually became the full error. At
least the worst mistakes will be caught by that.
However, process_bad will still compile without error just fine, due
to pointer decay, and should be avoided in my opinion. Pointers do
not carry size information in their type. Even when working on a
program where everyone is experienced with such nuances in C there’s
the potential for errors to sneak in when passing array arguments this
way. It’s not documentation, as some will claim, it’s lies. And
lying code is bad code.
What to Write Instead
A common solution to this is to pass the size information as an argument along with the array:
void process_even_better(int* numbers, size_t numbers_len) {
for (size_t i = 0; i < numbers_len; ++i) {
printf("Processing: %d\n", numbers[i]);
}
}Or use a struct:
struct numbers_t {
size_t len;
int* values;
};
void process_with_struct(struct numbers_t numbers) {
for (size_t i = 0; i < numbers.len; ++i) {
printf("Processing: %d\n", numbers.values[i]);
}
}Typically this would be used for dynamically-sized arrays and catch
most errors. For fixed-size arrays it may still be useful to stick
with the approach in processing_better using the constant as the
bound. In that case be sure to document this constraint; perhaps
using run-time assertions in a debug build to try and catch breaking
changes.
In the end, this is a small wart in C. It’s worth knowing about so
that you can avoid it. There’s much to say about pointers and how C
could be improved by fixing it’s type system. They could carry their
size and the type. Maybe they could banish void. Other languages
are carrying out these experiments and maybe one of them will become a
de facto C replacement. You should check them out.
But I think C is going to be around for a while. At least until the majority of popular operating systems are written in one of the successor languages. Given how excited the world of commercial software development seems to be on innovating in the space of operating systems… that could be a very long time into the future.
Until then I hope you found this article useful and happy hacking out there!