ps:
If you use an off-the-shelf bootloader, you may find that it performs
much of the core initialisation for you so you will then only need a
fairly conventional C runtime start-up.
Also to get the most out of an ARM9 you will want to run from RAM; this
means that your runtime code will need to be linked to run from RAM, but
load from Flash. There are a number of ways to achieve this, but again
you are largely on your own, or an existing bootloader may well help.
When I used it we had our own bootloader that decompressed the
application from Flash to RAM using zlib. The RAM application image was
compressed at build time and a utility to generate an assembler data
array from a binary image was used the output from which was then
assembled and linked to a start-up stub that ran from flash and
performed the decompression then jumped to the start address. During
the boot process, three separate C run-time environments were
established; bootloader; application decompression stub, then the
application itself. It is simplest if the core initialisation is
performed once by the bootloader, although that need not be the case.
I also recommend that you have a JTAG hardware debugger of some kind. A
degree of GDB gymnastics are required if you have a multistage bootstrap
such as I described, since the debugger treats each one as separate
binaries.