re: re: Re: A89: the a68k assembler and how it makes machine code
[Prev][Next][Index][Thread]
re: re: Re: A89: the a68k assembler and how it makes machine code
> What I want to know is how the assembler gets the hex code for instructions that you write
> down. Like how does "4e444e750000" stand for "Trap #4" and then "rts"?
> Bryan
The CPU has elementary actions that the hardware is capable of.
In the broad term a CPU does the following: Read a memory word from
address pointed by the internal register PC. Interpret the memory word
and initiate HW activities accordingly. When all associated activites
are finished, go and read a new word from memory addressed by PC.
Now these elementary action packs that can be triggered by a certain
bit pattern are called instructions. For example, when the hardware
sees the bit patter $4e93, it will do this:
substract 4 from the contents of register a7, write the result
back to a7
add 2 to the PC register and write result to two consecutive words
in memory starting at the address contained in a7
copy the contents of register a3 to register PC
read a word from memory address PC and start interpreting the bit
pattern as an instruction
it is called a subroutine call instruction with address register 3
indirect target address.
It is an atomic operation, you can not execute only parts of it.
When the CPU reads the triggering bit pattern it will do all the
things above. If you want, say, only execute the copy a3 to PC part,
then it is an other instruction, namely a jmp (a3) and it is encoded
by a different bit pattern.
The Motorola book will tell you what bit patterns trigger what
instruction and they also suggest a so-called mnemonic for each
instruction. That is, there is an elementary hardware action which
does the things described above, it has the bit pattern of .... it is
called a "subroutine call". Motorola suggest that you use the letter
combination (mnemonic) "jsr" [Jump to SubRoutine] to identify it.
They also give a suggestion for syntax, that is, they suggest that
when you want to say "address register 3 indirect" use the "(a3)"
notation or that you use the # to indicate immediate operand and so on.
Now from that to an assembler the route is relatively straightforward.
You build a big table in the assembler. The table contains pairs of
a string and a bit pattern (number):
NOP $4e71
RTS $4e74
RTE $4e73
RTD $4e77
and so on. Then the assembler reads your source, line by line. It cuts
the lines into fields, namely a label, and iinstruction mnemonic then
operands. It looks up the mnemonic in its big table. If it finds it,
say it was an RTE, then it gets $4e73 from the table and writes it out
to the output file. It also increases its internal memory counter by
two, for the bit pattern for the RTE insn occupies 2 bytes of memory.
Now what if you have an operand, that is, something after the
instruction ? Well, the assembler's table is somewhat more complex
than in the example above. Usually the operands for instructions and
the bitpatterns representing them can be divided into groups. You then
write routines that process a group each. Then you put those into your
table:
Mnemonic Code Size Oper1. Oper2
-----------------------------------------------
NOP $4e71 - - -
RTS $4e74 - - -
RTE $4e73 - - -
RTD $4e77 - - -
SWAP $4840 - Dreg1 -
SUBQ $5100 S76 Immed1_8 Alterable
and so on. Now Dreg1 means that the operand must be from d0 to d7 and
that it must be encoded as the bottom 3 bits of the insns word.
Immed1_8 means a #X where X is between 1 and 8 and it should be
encoded by bitwise AND-ing X with 7 then shifting the result to the
left by 9 bits. Alterable means a whole lot of addressing modes which
will be encoded in the bottom 6 bits and, if they encode an operand
which needs explicite addresses, then these will be stored in further
words (they are called extension words in Motorola lingvo). S76 means
that the instruction is available in byte, word and long and the size
information is encoded in bits 7 and 6 of the insn bit pattern.
So, when the asssembler sees the menmonic "SUBQ" it will look for a
size (.B .W or .L). If it can't find it, it will assume the default,
which is .W. It generates the bit pattern for that. Then it will see
that there is a first operand and calls the routine Immed1-8. The
routine will see if the first operand is in the form of #<constant>
and that the constant is between 1 and 8. If yes, it will generate the
bit pattern for that and return to the line processor. It will again
consult the table and call the Alterable routine to check the second
operand. Alterable, in turn, will try to match patterns on the
operand, like a[0-7] (that is a0 to a7) and (a[07]) and
<constant>(a[0-7],[ad][0-7]{.[wl]}) that is, a constant expression
followed by an opening parenthesis followed by a0 to a7 then a comma
then a0 to a7 or d0 to d7, optionally followed by a period and the
letter w or l then a closing paren. If it finds one of the patterns,
it will generate the 6-bit encoding for that pattern and also all the
necessary extension words. (On a 68000 the longest instruction is 5
words: one instruction word and two longwords containing addresses,
on a 68020 or 68030 it peaks at 11 words !).
An other issue is, how does it know what bit pattern belongs to
"mylabel", in the context of
mylabel:
do something
jmp mylabel
Well, obviously the bit pattern is dependent on the actual memory
address mylabel represents when the program is running. However, the
assembler when it starts assembling knows where your program will be
loaded, that is what address corrsponds to the first instruction of
your code. Since from that point on the assembler generates all
instructions, it knows how much space they take up therefore it can
keep track of the address of every instruction (and thus label).
In reality, the assembler very rarely knows the absolute address of
your code, however, it can assume a starting address of 0, calculate
with that and on top of the code belonging to your program it can also
generate a table (so-called "relocation table") which contains
all locations in the code which contain position dependent bit
patterns and some attributes of that location. Then the linker or the
loader can generate the final bit pattern for these places when the
real start address is known.
There are other issues with assemblers like segments, directives,
macros and soe on but they are not relevant to your question.
I hope the above clarifies a few things.
Regards,
Zoltan
References: