Identify processor type from raw binary code?

Question

Not really related to chips, but hopefully I'll get some directions to follow from here.

I got a chunk of code, but I don't know what processor it intended for. Are there tools available that can help me identify the code type? What statistic methods can help? Byte distribution? Pairs distribution, etc? Markov chains maybe?

This is a fun question. What kind of device are you hacking?
you could try feeding it to a couple of different disassemblers and see what happens.
Great question. Might be a better fit for StackOverflow however.

JustJeff · Answer 1 · 2012-02-09 11:19:05Z

If it was compiled from a language like C or Pascal, there would be certain standard sequences of binary you could look for. With C for instance, almost all functions start with something that saves the stack pointer to a "frame" or "link" pointer. For any given processor, there are usually only a couple of ways to do this. So you could answer "is this code for processor X" by looking for X's binary for these sequences.

That said, I had some luck differentiating between 8088, 6502 and 68000 binary just using histograms. Any given processor has certain legal instruction opcodes, and these tend to get used slightly more often than average. With a large enough chunk of binary, you can start to see certain trends. This is made difficult, however, by the fact that all the operands in a given piece of binary tend not to correlate to the given processor type, and this essentially just makes noise in your histogram data. Also, even two different programs for the same processor may have noticeably different histograms. Still, it can give you a place to start.

stevenvh · Answer 2 · 2012-02-09 10:52:26Z

Many moons ago, when there weren't that many different processor cores around, I identified Z80 code a few times through frequency analysis. For the Z80 CD is the machine code for call subroutine and C9 is return from subroutine (I'll never forget), and these are often the most occurring codes. This requires that you're acquainted with the instruction set on a machine code level, however. Having experience in assembly by hand helps (did that a lot, and I still can count backwards in hex to calculate offsets).

supercat · Answer 3 · 2012-02-09 03:32:34Z

up vote 3 down vote

If the file is for the a 12-bit or 14-bit PIC, every pair of bytes will be a 12- or 14-bit word, typically stored LSB first, with the two or four most significant bits clear.

answered Feb 9 '12 at 3:32

supercat
13.8k1634

Adam Davis · Answer 4 · 2012-02-08 23:26:18Z

This is a very interesting question. There are millions of instruction sets out there, but only a handful of very commonly used ones.

The first thing I'd look at is the origin and intended use. If you suspect it was designed in the US, you'd primarily be targeting processors with datasheets available in english, for instance. If it was designed in Asia, then there are a number of processors they use for mass manufactured devices that US engineers rarely see. Even Europe has a few processors that are more common than others.

I'd then take a look at code size and functionality (assuming you know what the code does to some degree). If it's a few megabytes of code, you can pretty much discount most embedded 8 bit processors and start looking at larger devices with external memory. If it is a few kilobytes or less, then you'd instead want to focus on smaller, cheaper devices. If the functionality is simple, it might even be code for a four bit processor.

At this point it's worthwhile looking at the memory structure. There's likely to be a program section and a data section at minimum. If it's a binary file (versus intel hex or motorola s record) then you have little insight as to where in memory certain chunks of data are being placed. A hex editor might show some patterns. If it does come in a hex or s record format, you might have more information about the memory structure of the processor it's meant for. Some processors reset at program memory location 0, some at the highest memory location. The program might include EEPROM initial values in a separate memory location. If it's meant for a secure processor (as used in banking) it might even have security keys for an odd memory location.

Depending on the language it was programmed in, you might have some additional clues. If it was programmed in C or a similar procedural language, then functions will almost always start with a sequence of instructions to save certain registers to the stack (lots of pushes) then right before returning lots of pops to return the original values from the stack. If you can do some pattern recognition, you'll find many of these sequences throughout, and may be able to determine which instructions are most likely push/pop instructions, return, etc, which could narrow your choices down a bit.

If it's an embedded device with interrupts, it may have an interrupt vector table, which will look like a bunch of jumps to different memory locations all in a large block, probably at a convenient location (address of 0x???0 for instance). Jump tables are used elsewhere for other things as well, but if you can locate a sequence of instructions that look identical except for what would be the address to jump to, you might be able to infer what a jump instruction looks like, and again narrow your choices down.

At that point, I'd start with the most common processor architectures and see if anything correlates. x86, arm, mips, 8051, avr, pic, powerpc, Z80, 68k, 6502, etc, etc, etc. There are lists of common processors and instruction sets - at least in the english speaking world - that might prove helpful.

I'm not aware of any automated tools that would help with this, but MAME emulates a great many processor architectures, and one possible method is to run the code through a number of processors and watch the registers to see if anything clicks according to what you know about the design.

"Even Europe has a few processors that are more common than others." Living in Europe, this never occurred to me. Can you give examples?
@stevenvh Due to the Acorn and Sinclair companies, 6502 and Z80 based embedded systems were very popular. And, of course, the ARM processor started at Acorn Computers.

J. Polfer · Answer 5 · 2010-05-11 18:47:21Z

Idea: do you know the age of the sourcecode, ie, around what time/year it was created?

If it was old enough, it might give you a clue as to which processor it was written for. You could take the age / year it was written and determine what processor(s) were popular around that time period, and try loading/executing the hex file on those.

On second thought, given the mass proliferation of processors in just the last 20 years, this might be a needle-in-a-haystack technique and not be very fruitful.

Toby Jaffey · Answer 6 · 2010-05-10 14:30:49Z

up vote 14 down vote

Try running it through GNU file. If it's got any standard header, it'll pick it up.

Eg.

jrt@lin:~/src$ file foo
foo: ELF 32-bit LSB executable, Atmel AVR 8-bit, version 1 (SYSV), statically linked, not stripped

edited May 10 '10 at 14:30

answered May 10 '10 at 12:35

Toby Jaffey
19.8k34798

Tried that. GNU file says it is "data". – mentalist May 10 '10 at 16:41

3

Could you post some of it? Tried searching for ASCII in it with "strings"? – Toby Jaffey May 10 '10 at 16:44

asked	3 years ago
viewed	600 times
active	1 year ago

Identify processor type from raw binary code?

6 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged assembly processor architecture or ask your own question.

Community Bulletin

Visit Chat

Identify processor type from raw binary code?

6 Answers

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged assembly processor architecture or ask your own question.

Community Bulletin

Visit Chat

Related