Learning Objectives
- Understand the x64 register set — volatile vs. non-volatile and why it matters for shellcode
- Implement correct 16-byte stack alignment before every function call
- Allocate and manage shadow space correctly in NASM
- Pass 1–4+ parameters using the Windows x64 calling convention
- Set up a working NASM build environment on Windows
Registers — Volatile vs. Non-Volatile
The x64 register set is the foundation of everything in this course. Before writing a single line of shellcode, you need to know which registers you can trust across function calls and which ones will be clobbered. Get this wrong and your shellcode will fail in subtle, maddening ways.
The Windows x64 ABI divides registers into two categories. Volatile registers — RAX, RCX, RDX, R8, R9, R10, R11 — are considered scratch registers that any function call is free to destroy. If you store a value in RCX before a call, don't expect it to be there afterward. Non-volatile registers — RBX, RBP, RDI, RSI, R12–R15, RSP — are preserved across calls by the callee. For shellcode, we rely heavily on R12–R15 to store API handles and base addresses we need to keep throughout execution.
| Register | Type | Role in Shellcode | Sub-registers (64→8-bit) |
|---|---|---|---|
| RAX | Volatile | Return value from function calls; scratch | RAX → EAX → AX → AL/AH |
| RCX | Volatile | 1st function parameter | RCX → ECX → CX → CL |
| RDX | Volatile | 2nd function parameter | RDX → EDX → DX → DL |
| R8 | Volatile | 3rd function parameter | R8 → R8D → R8W → R8B |
| R9 | Volatile | 4th function parameter | R9 → R9D → R9W → R9B |
| R10/R11 | Volatile | Scratch; sometimes used for 5th/6th params | R10/R11D/W/B |
| RBX | Non-Volatile | Loop counters, base pointers | RBX → EBX → BX → BL |
| R12 | Non-Volatile | Store DLL base addresses (kernel32, user32) | R12 → R12D → R12W → R12B |
| R13 | Non-Volatile | Store resolved API addresses | R13 → R13D → R13W → R13B |
| R14 | Non-Volatile | Store resolved API addresses | R14 → R14D → R14W → R14B |
| R15 | Non-Volatile | Store resolved API addresses | R15 → R15D → R15W → R15B |
| RSP | Non-Volatile | Stack pointer — must be 16-byte aligned before calls | RSP only |
| RIP | Special | Instruction pointer — used in PIC code for addressing | RIP only |
16-Byte Stack Alignment
This is one of the most important — and most confusing — aspects of x64 assembly for newcomers. The Windows x64 ABI requires that RSP be divisible by 16 (aligned to a 16-byte boundary) immediately before any call instruction. If you violate this, your code will likely jump to an unintended memory location and crash.
The simplest mental model: the last hex digit of RSP must be 0 before a call. Instructions that modify RSP by 8 bytes each — push and pop — flip it between aligned and unaligned. Track your pushes and pops carefully.
RSP ends in 8 (e.g., 0x...ff8) you're misaligned by 8. A single push rax or sub rsp, 8 will re-align it. If it ends in 0 you're good to call.
Shadow Space
Shadow space (also called the "home space" or "spill area") is 32 bytes (4 × 8-byte slots) that the caller must allocate on the stack before every call. Even if a function takes zero arguments, you must allocate sub rsp, 0x20 worth of shadow space. The callee uses this space to spill its register parameters to memory if needed — but that's the callee's business. Your job as the caller is just to make sure it's there.
In practice, you'll almost always see sub rsp, 0x28 or sub rsp, 0x30 before a call — the extra 8 or 16 bytes are to maintain 16-byte alignment on top of the 32-byte shadow requirement.
Calling Convention — Passing Parameters
Windows x64 uses a register-based calling convention for the first four arguments. Arguments 5 and beyond go directly onto the stack above the shadow space, at RSP+0x28, RSP+0x30, etc. This is one of the trickiest parts of the reverse shell module where STARTUPINFOA requires many fields.
-
NASM is the assembler of choice for this course. Download the Windows binary from
nasm.usand add it to your PATH. You'll also need a linker — MinGW-w64 (via winlibs) is recommended. The exact toolchain used in this course:- NASM 2.16+ —
nasm.us/pub/nasm/releasebuilds - MinGW-w64 (winlibs build) — includes
ld,gcc, and friends - x64dbg — for debugging and verifying stack state during development
- Pepper PE Viewer — for manually walking PE headers (Module 2)
Standard compile+link pipeline for a standalone executable:
💡nasm -f win64 shellcode.asm -o shellcode.objthenld -m i386pep shellcode.obj -o shellcode.exe -lkernel32For pure shellcode extraction (no PE wrapper), compile with NASM raw binary output:
nasm -f bin shellcode.asm -o shellcode.binthen extract bytes with a Python script or objdump. - NASM 2.16+ —
-
Understanding sub-registers is essential for NULL-free shellcode (Module 3) and for working with PE structure fields that mix WORD, DWORD, and QWORD sizes.
RAX= full 64-bit registerEAX= lower 32 bits — writing EAX zero-extends into RAX (important: clears upper 32 bits)AX= lower 16 bits — writing AX does NOT zero-extend upper bitsAL= lower 8 bits;AH= bits 8-15
The zero-extension behavior of 32-bit writes is a powerful NULL-free trick:
xor eax, eaxis a 2-byte instruction that zeroes all 64 bits of RAX — whereasmov rax, 0produces null bytes in the encoding.💡CLis the lower 8 bits of RCX — you'll see this used for shift/rotate amounts likeshl rax, clwhen the shift count is stored in RCX. -
x64 is little-endian. When you move a string into a register and push it to the stack, the bytes are stored in reverse order at the memory level but read correctly as a C string because the string pointer points to the lowest address.
To encode
calc.exeas a 64-bit immediate: read the characters in reverse, convert each to hex, and concatenate:e=65,x=78,e=65,.=2E,c=63,l=6C,a=61,c=63→0x6578652E636C6163💡 Python encoder:s = "calc.exe"; print(hex(int.from_bytes(s.encode(),'big')))— then reverse byte order for little-endian placement.⚠ Strings longer than 8 characters must be split across multiple pushes. Plan your string layout carefully — each push places exactly 8 bytes on the stack. -
x64dbg is the primary debugging tool for shellcode development. Essential workflow:
- Set breakpoints on your code's entry point to inspect initial register/stack state
- Single-step (
F7) through each instruction watching the register panel - Watch the stack panel — verify RSP divisibility by 16 before every CALL
- Use the memory map view to verify allocated regions (VirtualAlloc verification)
- Set hardware breakpoints on memory addresses to catch unexpected writes
The register panel in x64dbg highlights changed registers in red after each step — invaluable for spotting unexpected clobbers from function calls and confirming your non-volatile register strategy is working.
💡 Before your CALL instruction, check the last hex digit of RSP in x64dbg's register panel. Must be0. If it's8, you need one more push or sub rsp, 8.
Learning Objectives
- Navigate the PEB to locate kernel32.dll base address without any imports
- Walk the PE export directory to resolve function addresses by name hash
- Understand the AddressOfNames → Ordinals → AddressOfFunctions lookup chain
- Use Pepper PE Viewer to manually verify offsets before coding them
- Write the foundational IAT-walking stub that every shellcode in this course builds on
Why Walk the PE Table?
Position-independent shellcode can't import functions the normal way — there's no Import Address Table (IAT) at a fixed address. Instead, we locate what we need at runtime by walking the Process Environment Block (PEB) to find loaded modules, then parsing the PE export table of each module to resolve function addresses by name. This technique is used in virtually every non-trivial shellcode payload in the wild.
The chain: GS:[0x60] → PEB → PEB_LDR_DATA → InMemoryOrderModuleList → kernel32 base → PE headers → Export Directory → AddressOfNames / AddressOfFunctions
PEB Walk — Locating kernel32.dll
The PEB (Process Environment Block) is accessible at GS:[0x60] in x64. It contains a pointer to PEB_LDR_DATA at offset 0x18, which contains the InMemoryOrderModuleList at offset 0x20. This doubly-linked list contains every loaded module. The third entry is reliably kernel32.dll in standard Windows processes.
Export Directory Walk — Resolving WinExec
With the PE header address in hand, we navigate to the Export Directory and walk three parallel arrays: AddressOfNames (function name strings), AddressOfNameOrdinals (index mapping names to functions), and AddressOfFunctions (actual RVAs). The lookup: search AddressOfNames for our target string → get the ordinal index → use that to index into AddressOfFunctions → add kernel32 base to get the VMA.
-
Before writing PE-walking code, manually verify the offsets you expect to find using a PE viewer. Pepper (a free x64-capable PE viewer) is the tool used throughout this course. Load kernel32.dll into Pepper and navigate to the Export Directory.
- DOS Header →
e_lfanewfield at offset 0x3C: points to the PE signature - Optional Header → Data Directory[0] at PE+0x88: Export Directory RVA
- Export Directory:
+0x14= NumberOfNames,+0x1C= AddressOfFunctions RVA,+0x20= AddressOfNames RVA,+0x24= AddressOfNameOrdinals RVA
Cross-reference what Pepper shows against what your shellcode computes in x64dbg — if they don't match, your offset arithmetic is wrong.
💡 On Windows 11, some PE viewers report that AddressOfNames index aligns directly with AddressOfFunctions index without needing the ordinal lookup. Test on your target OS — behavior may vary. - DOS Header →
-
The three arrays in the Export Directory work together as a cross-referenced lookup table. They are parallel arrays but are not the same length — AddressOfFunctions includes exported-by-ordinal functions that have no name.
- AddressOfNames[i] → RVA of the name string for the i-th named export
- AddressOfNameOrdinals[i] → ordinal index into AddressOfFunctions for the i-th named export
- AddressOfFunctions[ordinal] → RVA of the actual function
So the flow is: scan AddressOfNames until you find your target function name → record index i → look up AddressOfNameOrdinals[i] to get ordinal j → look up AddressOfFunctions[j] to get the function RVA → add DLL base to get the VMA you can call.
⚠ Always add the DLL base address to RVAs. A common mistake is using the RVA directly as a call target — it will crash because RVAs are relative to the DLL base, not absolute addresses. -
An interesting observation noted during development of this course: on Windows 11 (tested on two separate machines), the AddressOfNames index appears to align directly with the AddressOfFunctions index — making the AddressOfNameOrdinals lookup potentially redundant for straightforward name-based resolution.
This behavior may be implementation-specific or version-dependent. The course code uses the full three-array lookup which is correct and portable across all Windows versions. If you're targeting exclusively Windows 11 and want to simplify your code, test this observation on your specific build before relying on it.
💡 This is the kind of internals observation worth writing about — small deviations from documented behavior that practitioners discover empirically. Document your findings for the community. -
With WinExec resolved into R15, executing calc.exe is straightforward. The key is correct string placement and parameter passing:
- Push a NULL terminator (using a zeroed register, not a literal 0 which creates null bytes)
- Push the "calc.exe" string bytes in little-endian order
- Set RCX = RSP (pointer to the string)
- Set RDX = 1 (SW_SHOWNORMAL)
- Allocate shadow space, call R15
At this point the shellcode contains null bytes — that's acceptable for a first test. Module 3 covers removing them for use in buffer overflows and shellcode loaders that use
strcpy-style copying.⚠ This version of the shellcode is NOT suitable for use via buffer overflow or most shellcode loaders due to embedded null bytes. Module 3 is required before operational use.
Learning Objectives
- Identify every source of null bytes in x64 assembly output using objdump
- Apply shift-based null termination to push strings without null bytes in shellcode
- Use XOR + ADD tricks for loading memory offsets that would otherwise produce nulls
- Verify null-free output and extract clean shellcode bytes
Why NULL Bytes Break Shellcode
Many classic shellcode delivery mechanisms — stack-based buffer overflows, format string exploits, and string-copy loaders — treat 0x00 as a string terminator. If your shellcode contains a null byte, the delivery mechanism stops copying at that point and your payload is truncated. NULL-free shellcode is not optional for real-world use — it's a requirement.
The good news: every instruction that produces a null byte has a null-free equivalent, just less obvious. The core techniques are shift-based string null-termination, XOR/ADD for zero values, and careful register sizing.
-
The fastest way to audit your shellcode for null bytes is
objdumpfrom the MinGW-w64 toolchain. After compiling your .asm file to a .obj, run:💡objdump -d -M intel shellcode.obj | grep -E " 00 | 00$"— any match is a null byte in your machine code that needs to be eliminated.Common null byte sources to look for:
mov rax, 0— produces 8 null bytes. Replace withxor rax, raxmov rdx, 1—mov rdx, 0x01zero-pads to 8 bytes. Usexor rdx, rdx; inc rdxpush 0— usexor rax, rax; push raxinstead- String immediates with fewer than 8 significant bytes — high bytes zero-padded → null bytes
After fixing all nulls, extract your shellcode bytes with:
objcopy -O binary shellcode.obj shellcode.binthen verify withpython3 -c "d=open('shellcode.bin','rb').read(); print('CLEAN' if b'\\x00' not in d else f'NULLS: {d.count(chr(0).encode())}')" -
This is the most elegant null-free technique in the course and worth deeply understanding. The insight: a null byte is only a problem when it appears in the machine code (the shellcode bytes themselves). A null byte that only exists in memory at runtime (on the stack after a push) is perfectly fine.
The SHL/SHR pair creates a runtime null terminator:
- Load the string with a non-null placeholder byte (e.g.,
0x90) in the highest position shl rax, 8shifts the placeholder off the top, pushing a0x00in from the bottom — but this 0x00 is not in the shellcode, it's created by the CPU during executionshr rax, 8shifts everything back, leaving the null at the top (MSB) of RAXpush raxplaces the properly null-terminated string on the stack
The SHL and SHR instructions themselves encode as non-null bytes. The null only materializes as data during execution — exactly what we need.
- Load the string with a non-null placeholder byte (e.g.,
-
Once you've confirmed no nulls with objdump, extract the raw shellcode bytes and test them in a C++ loader:
💡 Extract:for /f "tokens=1*" %a in ('objdump -d shellcode.obj') do @echo %b— or use the Python extraction script provided in the course materials.The standard test harness used throughout this course:
- Declare shellcode as
unsigned char shellcode[]with your extracted bytes - Allocate RWX memory with
VirtualAlloc(0, sizeof(shellcode), MEM_COMMIT|MEM_RESERVE, PAGE_EXECUTE_READWRITE) - Copy shellcode to the allocation with
memcpy - Cast the allocation to a function pointer and call it
If it executes correctly, your shellcode is ready for use in exploit development or payload delivery. The Module 8 loader wraps this into a production-ready harness.
- Declare shellcode as
Learning Objectives
- Encode string literals using the NOT instruction to defeat static AV string scanning
- Use the Windows Calculator to pre-compute NOT values without writing a script
- Apply XOR encoding as an alternative to NOT for string obfuscation
- Understand the limits of single-instruction encoding vs. full encryption
Why Encode Strings in Shellcode?
Static analysis tools — antivirus, YARA rules, EDR file scanning — look for recognizable strings like WinExec, calc.exe, cmd.exe, and Windows API names in binary files. If your shellcode contains these as plaintext, a simple string scan will flag it before it ever executes. The NOT instruction gives us a one-instruction encode/decode that adds zero overhead and eliminates plaintext strings entirely.
-
You don't need a script to compute NOT-encoded string values. Windows Calculator in Programmer mode handles this directly:
- Open Calculator → switch to Programmer mode (Alt+3)
- Set the word size to QWORD (64-bit)
- Enter your string's hex value (e.g.,
90636578456E6957for "WinExec" with 0x90 placeholder) - Click the NOT button — the result is your encoded value
- Verify: paste the result back in, click NOT again — you should get your original value
💡 This is the workflow used for every encoded string in the course. No Python, no script — just the calculator. Fast and reliable for pre-computing a handful of strings.For larger string sets or automated workflows, a one-liner works:
python3 -c "x=0x90636578456E6957; print(hex(~x & 0xFFFFFFFFFFFFFFFF))" -
NOT encoding defeats static string scanning — AV/EDR tools that look for plaintext strings in binary files. It does not defeat:
- Behavioral detection — the shellcode still calls the same APIs in the same order
- Dynamic analysis / sandboxing — the strings are decoded at runtime and visible in memory
- Memory scanning — after decoding, the strings exist in process memory
- YARA rules targeting the NOT-encoded values themselves (if the rule is updated)
For stronger evasion, the next layer is XOR-based payload encryption (encrypting the entire shellcode blob, not just individual strings) with runtime decryption. That topic is covered in advanced EDR evasion content beyond this 101 course.
⚠ Don't mistake string encoding for security. It's a static analysis speed bump, not a comprehensive evasion strategy. Modern EDR products with behavioral analysis will still catch shellcode that calls VirtualAlloc + memcpy + function pointer in sequence.
Learning Objectives
- Use GetProcAddress and LoadLibraryA to load user32.dll and resolve MessageBoxA
- Manage the 4-parameter calling convention for MessageBoxA in x64 assembly
- Locate ExitProcess to cleanly terminate shellcode without crashing
- Build a reusable API resolution stub for use in more complex shellcode
Beyond kernel32 — Loading Additional DLLs
Our PE-walking technique from Module 2 locates functions in kernel32.dll. But shellcode often needs APIs from other DLLs — user32.dll for UI functions, ws2_32.dll for sockets, ntdll.dll for native APIs. The solution: use GetProcAddress and LoadLibraryA (both in kernel32) to dynamically load any DLL and resolve any function at runtime. This is the API resolution pattern used in the reverse shell in Modules 6 and 7.
-
With multiple API calls in flight, register management becomes critical. The strategy used throughout this course:
R12— kernel32.dll base address (set once, never overwritten)R13— GetProcAddress function addressR14— ExitProcess / current secondary API being usedR15— primary call target (the API we're about to call)RDI— secondary DLL base (user32, ws2_32, etc.)
Before each API call, check that you haven't accidentally clobbered a non-volatile register. x64dbg's register panel makes this obvious — values that changed after a CALL in the non-volatile registers signal a bug in your convention usage (or the called API is non-conformant, which is rare but possible).
💡 When debugging, color-code your register assignments in comments: RBX=kernel32, R12=kernel32_base, etc. Shellcode debugging without this discipline is a nightmare. -
After your shellcode's payload executes, the instruction pointer needs somewhere to go. Without a clean exit, execution continues into whatever memory follows the shellcode — almost certainly crashing. Always end shellcode with a call to ExitProcess.
ExitProcess takes one parameter: the exit code (typically 0). Since it's in kernel32, we can resolve it via our PE walk or via GetProcAddress. The course resolves it via GetProcAddress after establishing the GetProcAddress function pointer.
⚠ In some shellcode scenarios (injected threads, callback-based execution), calling ExitProcess terminates the entire host process — not just your thread. In those cases, use ExitThread instead. Know your execution context before choosing your exit strategy.
Learning Objectives
- Understand the Winsock API chain required for a TCP reverse shell
- Use EXTERN declarations to link against ws2_32.lib and kernel32.lib
- Populate the STARTUPINFOA structure correctly in x64 assembly
- Redirect stdin/stdout/stderr to a socket handle for shell I/O
- Compile and link a functional reverse shell executable with MinGW-w64
The Extern Approach — Learning Before the Deep End
A full dynamic reverse shell (Module 7) requires resolving 6+ socket APIs manually via PE walking — that's 500+ lines of assembly and a significant complexity jump. Module 6 uses EXTERN declarations to link against the APIs directly, producing a clean, readable reverse shell that demonstrates the logic without the noise. Think of it as a scaffold: understand the control flow here, then Module 7 removes the training wheels.
0x6401A8C0 (192.168.1.100) and 0x5C11 (port 4444) with your actual attacker IP and port in network byte order before testing. Verify your listener is running: nc -lvnp 4444
-
STARTUPINFOA is notoriously painful in x64 assembly because its fields use mixed sizes (DWORD, WORD, QWORD) and x64 stack alignment requirements mean you can't just push them naively. The key fields for a reverse shell:
cb(DWORD, +0x00) = 0x68 (104 bytes — sizeof the structure)dwFlags(DWORD, +0x2C) =STARTF_USESTDHANDLES(0x100) — tells CreateProcess to use hStdInput/Output/ErrorwShowWindow(WORD, +0x30) = 1 (SW_SHOWNORMAL) or 0 (hidden)hStdInput(HANDLE, +0x38) = socket handlehStdOutput(HANDLE, +0x40) = socket handlehStdError(HANDLE, +0x48) = socket handle
In x86 this is much easier — no alignment padding needed. In x64, the padding between WORD fields and the next QWORD-aligned field is where most bugs hide.
💡 Use x64dbg to inspect the structure after pushing all the fields. Navigate to the RSP address and verify each field offset visually against the MSDN STARTUPINFOA documentation. -
WSASocketA and WSAConnect both take more than 4 parameters. Parameters 5+ go on the stack at
RSP+0x28,RSP+0x30, etc. (above the 4-register slots and the 0x20 shadow space). This is first instance in this course with more than 4 parameters being passed.The push order is reversed — push the last parameter first, working backward to the 5th. Then allocate shadow space with
sub rsp, 0x20and call. After the call, restore RSP by adding back (shadow space + pushed parameter bytes).⚠ Getting the RSP math wrong for 5+ parameter functions is the #1 cause of crashes in this module. Count your pushes carefully and verify stack alignment before the call in x64dbg. -
TCP socket structures use big-endian (network) byte order for IP addresses and ports. Since x64 is little-endian, you need to byte-swap before embedding the values in your shellcode.
Port 4444 conversion: 4444 decimal =
0x115Chex → big-endian bytes =0x11 0x5C→ as a WORD in little-endian storage =0x5C11IP 192.168.1.100 conversion: bytes are
0xC0 0xA8 0x01 0x64→ as a DWORD in little-endian storage =0x6401A8C0💡 Python:import socket; print(hex(socket.htonl(0xC0A80164)))— htonl handles the byte swap for you. For ports:hex(socket.htons(4444))
Learning Objectives
- Dynamically resolve all Winsock APIs (WSAStartup, WSASocketA, WSAConnect) via PE walking
- Load ws2_32.dll at runtime using LoadLibraryA resolved from kernel32
- Build a complete NULL-free reverse shell in pure position-independent x64 assembly
- Understand why this is the "final exam" of x64 shellcode development
The Final Exam
This module is the culmination of everything in the course. No externs, no shortcuts — every API is resolved at runtime using the techniques from Modules 2–5. All strings are NULL-free using the techniques from Modules 3–4. The result is position-independent shellcode that can be extracted as raw bytes and executed in any context.
The code is long — 500+ lines — but if you've worked through the previous modules it's not magic. It's the same patterns repeated: PEB walk, PE export walk, string push, API call. The only new complexity is ws2_32.dll (which must be loaded via LoadLibraryA since it's not pre-loaded in most processes) and the full STARTUPINFOA structure population without any extern help.
-
Every Windows process has kernel32.dll and ntdll.dll pre-loaded in its address space — that's why we can find kernel32 via the PEB's InMemoryOrderModuleList without calling LoadLibrary first. But ws2_32.dll (Winsock) is not loaded by default in most processes. It must be explicitly loaded before its functions can be resolved.
The implication for shellcode: you need LoadLibraryA from kernel32 before you can get WSAStartup from ws2_32. This creates a dependency order: PEB walk → kernel32 base → LoadLibraryA + GetProcAddress → LoadLibraryA("ws2_32.dll") → GetProcAddress(ws2_32, "WSAStartup") etc.
💡 If your target process is a web server, database, or any network-aware application, ws2_32 is probably already loaded. You could skip LoadLibraryA and walk the PEB for it directly — but using LoadLibraryA is safer and works in all cases. -
With this many API calls in play, disciplined register allocation is essential. The strategy used in the full course source:
R8— kernel32.dll base (set once in PEB walk, never reused)R13— GetProcAddress (preserved after initial resolution)R15— LoadLibraryA / current call target (repurposed as needed)RDI— ws2_32.dll base after LoadLibraryA callR12— socket handle (set after WSASocketA, preserved through CreateProcess)R14— ExitProcess / secondary resolved API
API addresses that are used only once are called immediately after resolution without storing in a non-volatile register — the volatile RAX return value is used directly. Only APIs called more than once (GetProcAddress, LoadLibraryA, ExitProcess) get permanent register homes.
-
After verifying the assembly compiles and runs correctly as an EXE, extract the raw shellcode bytes for use in a loader:
nasm -f bin revshell_pure.asm -o revshell.bin— if using raw binary output (no PE wrapper)- Or: compile to obj, then extract the .text section:
objcopy -O binary -j .text revshell.obj revshell.bin - Verify:
python3 -c "d=open('revshell.bin','rb').read(); assert b'\\x00' not in d, 'NULLS FOUND'; print(f'CLEAN — {len(d)} bytes')" - Format as C array:
python3 -c "d=open('revshell.bin','rb').read(); print('unsigned char sc[] = {' + ','.join(hex(b) for b in d) + '};')"
💡 The course's final shellcode runs approximately 500-600 bytes. Future optimization using tighter function lookup loops can reduce this significantly — a topic for an advanced follow-on course.
Learning Objectives
- Build a C++ shellcode loader using VirtualAlloc + memcpy + function pointer
- Understand PAGE_EXECUTE_READWRITE vs. safer VirtualProtect patterns
- Embed shellcode as a C array and as a file-read from disk
- Understand why the loader itself is the primary detection surface for modern EDR
The Standard Shellcode Loader
A shellcode loader's job is simple: get the shellcode bytes into executable memory and transfer control to them. The standard approach — VirtualAlloc with PAGE_EXECUTE_READWRITE, memcpy, then cast and call — is functional but heavily signatured by modern EDR. This module teaches the baseline that everything else builds on.
-
Allocating memory as PAGE_EXECUTE_READWRITE (RWX) in one call is the most detectable loader pattern — EDR products specifically watch for VirtualAlloc with RWX permissions followed by a write and execution. A less obvious pattern uses two-step memory management:
- Allocate with
PAGE_READWRITE(not executable) - Write shellcode into the allocation
- Call
VirtualProtectto change toPAGE_EXECUTE_READ - Execute — memory is never simultaneously writable and executable
This pattern is more EDR-friendly and closer to how legitimate JIT compilers work. It doesn't eliminate detection but raises the behavioral analysis bar.
⚠ Neither pattern defeats modern behavioral EDR. The real evasion work happens at the loader level — process injection, stomping, indirect syscalls — topics covered in advanced EDR bypass content beyond this course. - Allocate with
-
Embedding shellcode as a hardcoded C array is simple but means the loader binary itself contains the shellcode — file-scanning AV will find it. An alternative is to read shellcode from a file at runtime:
- Store shellcode in an external file (optionally encrypted)
- Open with
CreateFileA+ReadFileat runtime - Decrypt if necessary, then VirtualAlloc + execute
This separates the loader from the payload — the loader binary is clean, and the shellcode file can be fetched from a remote URL, read from an alternate data stream, or stored in the registry to further complicate forensic recovery.
💡 For lab use, embedding as a C array is fine. For red team engagements, always separate loader from payload and consider encrypting the payload at rest. -
Modern EDR products don't just scan for shellcode signatures — they monitor the behavioral sequence of API calls that loaders make. The VirtualAlloc → WriteProcessMemory/memcpy → VirtualProtect → CreateThread/CallFunction sequence is so well-known that it has dedicated behavioral detection rules in virtually every enterprise EDR product.
The shellcode itself being NULL-free and string-encoded matters primarily for static file scanning. Once you're in memory and executing, the EDR's eyes are on the loader's API call sequence, not the shellcode bytes.
Key detection telemetry a SOC analyst sees from a standard shellcode loader:
- ETW events for VirtualAlloc with executable permissions
- Sysmon Event 8 (CreateRemoteThread) if injection is used
- Network connections from the shellcode's reverse shell (Sysmon Event 3)
- cmd.exe spawned with unusual parent process (Sysmon Event 1)
⚠ This is why understanding the defender's perspective matters — knowing what EDR sees from your loader is as important as knowing how the shellcode works. The next course in this series covers advanced loader techniques and EDR evasion. -
You've now completed the full x64 Assembly and Shellcoding 101 curriculum. Here's where these skills lead:
- Advanced x64 Assembly — tighter function lookup loops, position-independent data access via RIP-relative addressing, SYSCALL-based API resolution bypassing ntdll hooks
- EDR Evasion and Shellcode Loaders — process injection techniques, DLL stomping, indirect syscalls, sleep obfuscation, and shellcode encryption
- Exploit Development — applying your shellcode in buffer overflow, use-after-free, and heap exploitation contexts
- Reverse Engineering — reading other people's shellcode in x64dbg / Ghidra with the internals knowledge from this course
💡 The best next step is writing your own variants of everything in this course from scratch — without the notes. That's when you'll know you've actually internalized x64 assembly.
Learning Objectives
- Write a TEB-based kernel32 locator that defeats EDR hooks on the standard PEB walk path
- Use Python on Windows to extract raw shellcode bytes from a compiled .obj — no Linux VM required
- Apply Bitwise NOT + XOR encoding in a single Python script to produce static-analysis-resistant shellcode
- Understand how embedding the key inside the encoded payload enables self-decoding without hardcoding its position
- Use the assembly junk-instruction inserter to produce different bytes on every compilation run
Why a New Kernel32 Walk? — Defeating EDR Hooks
The standard PEB walk from Module 2 traverses InMemoryOrderModuleList and trusts list position to find kernel32 — the third entry. This works on clean systems, but some EDR products (notably Avast) hook the initial loader modules. Walking by position can return the hooked version rather than the real kernel32 base.
The solution: instead of trusting position, search the module list for a module whose Unicode name starts with KERN. Unless the EDR names their hook KERNxx.dll, you skip right past the hook and land on real KERNEL32.DLL. This approach also starts from the TEB rather than directly from GS:[0x60] — a subtle but meaningful structural difference that adds resilience against intercepted fast paths.
Walk chain: GS:[0x30] → TEB base → [TEB+0x60] → PEB → [PEB+0x18] → PEB_LDR_DATA → [Ldr+0x10] → InMemoryOrderModuleList → iterate checking Unicode name bytes for KERN.
0x004E00520045004B encodes "K E R N" as UTF-16LE WORD pairs: K=004B, E=0045, R=0052, N=004E stored as an 8-byte little-endian immediate. A single cmp rbx, rdx checks all four characters simultaneously.
Tool 1 — findhex.py: Windows-Native Shellcode Extraction
Extracting raw shellcode bytes from a compiled .obj file historically required Linux tools. This Python script runs on Windows using the MinGW-bundled objdump. It parses the disassembly output, strips the byte columns, and outputs them as \xNN escape sequences ready to paste into a loader — no VM, no Linux, no context switch.
Tool 2 — NOT + XOR Encoder with Embedded Key
This encoder applies Bitwise NOT to every byte first, then XORs with a chosen key (default 0xAC). The result contains no recognizable API name strings and no common shellcode byte patterns. What makes it especially useful: the XOR key is embedded inside the encoded payload at position key_value % payload_length. Change the key and both the encoded bytes and the key's position in the output change — two layers of variability from one parameter.
The decoder stub reverses in order: XOR each byte with the key, then NOT — both operations are self-inverse so the decode is structurally identical to the encode.
Tool 3 — The NOT+XOR Decoder in x64 Assembly
With the shellcode encoded, the runtime decoder needs to reverse both operations in correct order: XOR each byte with the key first, then NOT each byte. Because both NOT and XOR are self-inverse, the decode loop is structurally identical to how you'd write the encode — just applied at runtime in memory rather than at script time.
The key lives inside the encoded shellcode at a known index position — printed by the encoder script. For the example below, key index 38 was chosen: mov r9b, [rel encoded_shellcode + 38]. The decoder reads the key directly from the payload, walks every byte applying XOR then NOT, then reloads the base address and jumps to the now-restored shellcode via jmp rax.
-N linker flag (--omagic) marks the .text section writable+executable. This is required for the standalone test binary because the decoder writes decoded bytes back into encoded_shellcode in-place — which lives in .text. The C++ loader below uses PAGE_EXECUTE_READWRITE VirtualAlloc memory instead, so -N is not needed there.
Tool 4 — Alpha/Mix Encoding: Converting to ASCII-Printable Shellcode
The final encoding layer converts the complete payload (decoder stub + encoded shellcode) into a mixed ASCII/hex format where each byte is expressed as its printable ASCII character if one exists, and as a \xNN hex escape otherwise. This is the "alpha/mix" format — not purely alphanumeric, but as human-readable as the byte values allow, and compatible with C string literal delivery.
The workflow: compile decoder.asm to a .obj, run findhex.py on it to extract all bytes (decoder stub bytes + inline encoded shellcode), then pass that full byte string through the alpha/mix script. The output pastes directly into a C const unsigned char shellcode[] string literal — adjacent tokens are concatenated automatically by the compiler.
The Final C++ Loader
The alpha/mix output pastes directly into the shellcode[] array. The decoder stub runs first, decodes the embedded payload in-place, then jmp rax executes the original TEB-walk calc shellcode. PAGE_EXECUTE_READWRITE is required because the decoder modifies its own payload bytes at runtime — read+execute alone is insufficient.
-
Normally a linked executable's
.textsection is marked read+execute but not writable. The decoder writes decoded bytes back toencoded_shellcode, which lives in.text. Without-N, this write triggers an access violation before a single byte is decoded.-N(also known as--omagic) tells the GNU linker to mark the text segment as writable, giving it read+write+execute permissions. This is only needed for the standalonedecoder.exetest — when using the C++ loader, the shellcode lives in aPAGE_EXECUTE_READWRITEVirtualAlloc region which is already writable by definition.💡 A quick test workflow: compile decoder.asm with nasm + ld -N, run decoder.exe, verify calc appears. Then extract bytes with findhex.py, run alpha_mix.py, paste into loader.cpp, compile and run. Same result — this confirms the full pipeline end to end before you swap in a real payload. -
Pure alphanumeric encoding (only A-Za-z0-9 bytes) requires a specialized encoder like ALPHA3 that transforms every byte to fall in that ASCII range — at the cost of roughly 2–3x payload size expansion and an additional alphanumeric decoder stub on top. The alpha/mix approach is simpler and avoids the size penalty: just express each byte as its printable ASCII character if it has one, and leave non-printable bytes as
\xNN.The benefits of alpha/mix for this use case:
- Visually obscures the payload in source code — a mix of Latin characters, symbols, and hex escapes looks far less like shellcode than a dense block of
\xNN\xNN\xNN - No payload size expansion — one byte stays one byte
- Direct C string literal compatibility — the compiler concatenates adjacent tokens automatically
- The three special-case handlers for
0x27,0x22, and0x20prevent C string literal syntax errors that would break compilation
- Visually obscures the payload in source code — a mix of Latin characters, symbols, and hex escapes looks far less like shellcode than a dense block of
-
Index 38 is just one of the valid key positions the encoder found. Any index where the encoded output byte equals the XOR key value (0xAC) is a valid choice. The encoder prints all such positions — you pick one and hardcode it as the offset in
mov r9b, [rel encoded_shellcode + N].Choosing a different valid index changes two things simultaneously: the decoder instruction bytes change (different immediate value in the MOV) and the compiler-generated stub bytes are different, giving yet more variability in the final payload signature.
You can also change the XOR key entirely — pick a different key in
not_xor_encoder.py, the encoder will produce a completely different encoded payload with different valid index positions, and you update the index in the decoder assembly accordingly. Every combination produces different bytes everywhere in the pipeline.⚠ After changing the key or index, always recompile and test the decoder standalone before embedding in a loader. A wrong index reads the wrong byte as the key, decodes to garbage, and crashes silently. -
The full pipeline from source to deployable shellcode, end to end on Windows:
- calc.asm — TEB-based kernel32 finder + WinExec, NULL-free, bypasses EDR position-based hooks
- not_xor_encoder.py — Bitwise NOT + XOR encoding with self-embedded key at deterministic index
- decoder.asm — x64 stub that reads key from embedded position, XOR+NOT decodes in-place, jmp rax executes
- findhex.py — Windows-native .obj byte extraction, no Linux VM required
- alpha_mix.py — converts binary payload to mixed ASCII/hex C string literal format
- loader.cpp — VirtualAlloc RWX + memcpy + call — runs on fully patched Windows with Defender active
No msfvenom. No Metasploit. No Linux toolchain. Every tool in this pipeline was written from scratch and runs natively on Windows.
Full Pipeline — From .asm to Final Alpha/Mix Shellcode
-
Both paths reach the PEB, but the standard
mov rax, [gs:0x60]shortcut is a well-known and well-monitored access pattern. Some EDR products hook or monitor this specific GS segment offset access to detect shellcode performing PEB walks.Going through the TEB explicitly —
gs:[0x30]for TEB base, then[rax+0x60]for PEB — is architecturally equivalent but takes a different code path. It also mirrors how Windows itself navigates these structures internally, making it harder to distinguish from legitimate code.💡 In WinDbg:dt nt!_TEB @$tebshows the TEB layout. PebBaseAddress is at offset 0x060.dt nt!_PEB @$pebshows the PEB layout — Ldr is at 0x018. -
Windows stores module names as UTF-16LE strings. Each ASCII character becomes a 2-byte WORD: the ASCII value in the low byte, 0x00 in the high byte. Reading "KERN" as four UTF-16LE WORDs:
- K = 0x004B, E = 0x0045, R = 0x0052, N = 0x004E
- In memory (little-endian):
4B 00 45 00 52 00 4E 00 - As a 64-bit little-endian immediate:
0x004E00520045004B
Loading
[rcx](first 8 bytes of the name buffer) into RBX and comparing with this immediate checks all four characters in a single instruction. It's both efficient and NULL-free.💡 To search for a different DLL: take the first 4 characters of its name, encode each as a WORD (char + 0x00), then build the 8-byte little-endian immediate. For ntdll.dll: N=0x004E T=0x0054 D=0x0044 L=0x004C →0x004C00440054004E. -
Embedding the key at
position = xor_key % len(payload)means the key byte position is a function of the key value itself. Changexor_keyfrom0xACto0x7Fand two things change simultaneously:- All encoded bytes change — different XOR key produces completely different output bytes
- The key's position in the output changes —
0x7F % lenvs0xAC % lenare almost certainly different offsets
This means a static signature targeting "the key byte is at offset N" is invalidated just by changing the key. The decoder stub computes the position formula itself, so it works for any key without modification.
⚠ Always verify the round-trip after changing the key. Some keys produce bad characters (0x00, 0x20, 0x0A, 0x0D) in the encoded output that will break delivery through string-handling functions. Test with your specific delivery mechanism. -
Even when RSP and RBP appear absent from the shellcode's explicit instructions, they always have implicit roles. RSP is the active stack pointer — always in use, always changing with every push/pop/call/ret. Inserting junk that modifies RSP would immediately corrupt the stack and crash execution.
RBP is excluded defensively — even if the shellcode doesn't explicitly use it, the compiler or linker may use frame-pointer conventions that depend on RBP being stable. It's excluded from the candidates list unconditionally:
[r for r in candidates if r not in ['rsp', 'rbp']]For the calc.asm TEB-walk shellcode, the registers available for junk injection are typically r12, r13, r14 — the non-volatile callee-saved registers not needed in the KERN search or WinExec call chain. The script prints them on stderr when it runs so you can verify.