NINA: x64 Process Injection
In this post, I will be detailing an experimental process injection technique with a hard restriction on the usage of common and "dangerous" functions, i.e. WriteProcessMemory
, VirtualAllocEx
, VirtualProtectEx
, CreateRemoteThread
, NtCreateThreadEx
, QueueUserApc
, and NtQueueApcThread
. I've called this technique NINA: No Injection, No Allocation. The aim of this technique is to be stealthy (obviously) by reducing the number of suspicious calls without the need for complex ROP chains. The PoC can be found here: https://github.com/NtRaiseHardError/NINA.
Tested environments:
- Windows 10 x64 version 2004
- Windows 10 x64 version 1903
Implementation: No Injection
Let's start with a solution that removes the need for data injection.
The most basic process injection requires a few basic ingredients:
- A target address to contain the payload,
- Passing the payload to the target process, and
- An execution operation to execute the payload
To keep the focus on the No Injection section, I will use the classic VirtualAllocEx
to allocate memory in the remote process. It is important to keep pages from having write and execute permissions at the same time so RW
should be set initially and then re-protected with RX
after the data has been written. Since I will discuss the No Allocation method later, we can set the pages to RWX
for now to keep things simple.
If we restrict ourselves from using data injection, it means that the malicious process does not use WriteProcessMemory
to directly transfer data from itself into the target process. To handle this, I was inspired by the reverse ReadProcessMemory
documented by Deep Instinct's (complex) "Inject Me" process injection technique (shared to me by @slaeryan). There exists other methods of passing data into a process: using GlobalGetAtomName
(from the Atom Bombing technique), and passing data through either the command line options or environment variables (with the CreateProcess
call to spawn a target process). However, these three methods have one small limitation in that the payload must not contain NULL characters. Ghost Writing is also an option but it requires a complex ROP chain.
To gain execution, I've opted for a thread hijacking style technique using the crucial SetThreadContext
function since we cannot use CreateRemoteThread
, NtCreateThreadEx
, QueueUserApc
, and NtQueueApcThread
.
Here is the procedure:
CreateProcess
to spawn a target process,VirtualAllocEx
to allocate memory for the payload and a stack,SetThreadContext
to force the target process to executeReadProcessMemory
,SetThreadContext
to execute the payload.
CreateProcess
There are some considerations that should be taken when using this injection technique. The first comes from the CreateProcess
call. Although this technique does not rely on CreateProcess
, there are some reasons why it may be advantageous to use this instead of something like OpenProcess
or OpenThread
. One reason is that there is no remote (external) process access to obtain handles which could otherwise be detected by monitoring tools, such as Sysmon, that use ObRegisterCallbacks
. Another reason is that it allows for the two aforementioned data injection methods using the command line and environment variables. If you're creating the process, you could also leverage blockdlls and ACG to defeat antivirus user-mode hooking.
VirtualAllocEx
Of course the target process needs to be able to house the payload but this technique also requires a stack. This will be made clear shortly.
ReadProcessMemory
To use this function in a reversed manner, we must consider two issues: passing argument five on the stack and using a valid process handle to our own malicious process. Let's look at the issue with the fifth argument first:
Using SetThreadContext
only allows for the first four arguments on x64. If we read the description for lpNumberOfBytesRead
, we can see that it's optional:
A pointer to a variable that receives the number of bytes transferred into the specified buffer. If lpNumberOfBytesRead is NULL, the parameter is ignored.
Luckily, if we use VirtualAllocEx
to create pages, the function will zero them:
Reserves, commits, or changes the state of a region of memory within the virtual address space of a specified process. The function initializes the memory it allocates to zero.
Setting the stack to the zero-allocated pages will provide a valid fifth argument.
The second problem is the process handle passed to ReadProcessMemory
. Because we're trying to get the target process to read our malicious process, we need to give it a handle to our process. This can be achieved using the DuplicateHandle
function. It will be given our current process handle and return a handle which can be used by the target process.
SetThreadContext
SetThreadContext
is a powerful and flexible function that allows reads, writes, and executes. But there is a known issue with using it to pass fastcall arguments: the volatile registers RCX
, RDX
, R8
and R9
cannot be reliably set to desired values. Consider the following code:
If we execute this code, we expect the volatile registers to hold their correct values when the target thread reaches ReadProcessMemory
. However, this is not what happens in practice:
For some unknown reason, the volatile registers are changed and makes this technique unusable. RCX
is not a valid handle to a process, RDX
is zero and R9
is too big. There is a method that I have discovered that allows volatile registers to be set reliably: simply set RIP
to an infinite jmp -2
loop before using SetThreadContext
. Let's see it in action:
The infinite loop can be executed using SetThreadContext
, then ReadProcessMemory
can be called with the correct volatile registers:
Now we need to handle the return. Note that we allocated and pivoted to our own stack. If we can use ReadProcessMemory
to read the shellcode into the stack location at RSP
, we can set the first 8 bytes of the shellcode so that it will ret
back into itself. Here is an example:
RSP
and R8
point to 000001F457C21000
. The addresses going upwards will be used for the stack in the ReadProcessMemory
call. The target buffer where the shellcode will be written is from R8
downwards. When ReadProcessMemory
returns, it will use the first 8 bytes of the shellcode as the return address to 000001F457C21008
where the real shellcode starts:
Implementation: No Allocation
Let's now discuss how we can improve by removing the need for VirtualAllocEx
. This is a bit less trivial than the previous section because there are some initial issues that arise:
- How will we set up the stack for
ReadProcessMemory
? - How will the shellcode be written and executed using
ReadProcessMemory
if there are noRWX
sections?
But why should we need to allocate memory when it's already there for us to use? Keep in mind that if any existing pages in memory are affected, care needs to be taken to not overwrite any critical data if the original execution flow should be restored.
The Stack
If we cannot allocate memory for the stack,we can find an empty RW
page to use. If there's a worry for the NULL fifth argument for ReadProcessMemory
, that can be easily solved. If we don't want to overwrite potentially critical data, we can take advantage of section padding within possible RW
pages that lie within the executable image. Of course, this assumes that there is padding available.
To locate RW
pages within the executable image's memory range, we can locate the image's base address through the Process Environment Block (PEB), then use VirtualQueryEx
to enumerate the range. This function will return information such as the protection and its size which can be used to find any existing RW
pages and if they're appropriately sized for the shellcode.
After locating the correct page, the position of the stack should be enumerated upwards from the bottom of the page (due to the nature of stacks) and a 0x0000000000000000
value should be found for ReadProcessMemory
's fifth argument. This means that we need to make sure the stack offset is at least 0x28
from the bottom plus space for the shellcode.
Here is some code that demonstrates this:
In the case where there are no RW
pages inside the executable's module, we can perform a fallback to write to the stack. To find a remote process' stack, we can do the following:
The result inside Tib
will contain the stack range addresses. With these values, we can use the code before to locate the appropriate offset starting from the bottom of the stack.
Writing the Shellcode
A main obstacle with no allocation is that we have to write the shellcode and then execute it in the same page. There is a way to do this without using VirtualProtectEx
or complex ROP chains with this special function: WriteProcessMemory.
Okay, I did say we couldn't use WriteProcessMemory
to write the data from our process to the target but I didn't say that we couldn't force the target process to use it on itself. One of the hidden mechanisms inside WriteProcessMemory
is that it will re-protect the target buffer's page accordingly to perform the write. Here we see that the target buffer's page is queried with NtQueryVirtualMemory
:
Then the page is de-protected for writing using NtProtectVirtualMemory
:
If you've noticed, WriteProcessMemory
modifies the shadow stack at the beginning of the function. In this case, we need to modify the shellcode to pad for the shadow stack:
Now we need to call both ReadProcessMemory
and WriteProcessMemory
sequentially. Going back to the return from ReadProcessMemory
, we can simply jump back to the infinite jmp
loop gadget to stall execution instead of the shellcode (it's in a non-executable page now):
This allows time for the malicious process to call another SetThreadContext
to set RIP
to WriteProcessMemory
and reuse RSP
from ReadProcessMemory
. We can read the shellcode from the same location that was copied by ReadProcessMemory
(+ 0x30
bytes to the actual shellcode) and target any page with execute permissions (again, assuming that there are RX
sections).
When WriteProcessMemory
returns, it should return into the infinite jmp
loop again, allowing the malicious process to make the final call to SetThreadContext
to execute the shellcode:
Overall, the entire injection procedure is as so:
SetThreadContext
to an infinitejmp
loop to allowSetThreadContext
to reliably use volatile registers,- Locate a valid
RW
stack (or pseudo-stack) to hostReadProcessMemory
andWriteProcessMemory
arguments and the temporary shellcode, - Register a duplicated handle using
DuplicateHandle
for the target process to read the shellcode from the malicious process, - Call
ReadProcessMemory
usingSetThreadContext
to copy the shellcode, - Return into the infinte
jmp
loop afterReadProcessMemory
, - Call
WriteProcessMemory
usingSetThreadContext
to copy the shellcode to anRX
page, - Return into the infinite
jmp
loop afterWriteProcessMemory
, - Call the shellcode using
SetThreadContext
.
Detection Artifacts
To quickly test the stealth performance, I used two tools: hasherazade's PE-sieve and Sysinternal's Sysmon with SwiftOnSecurity's configuration. If there are any other defensive monitoring tools, I would love to see how well this technique holds up against them.
PE-sieve
Something I noticed while playing with PE-sieve is that if we inject the shellcode into the padding of the .text
(or otherwise relevant) section, it will not be detected at all:
If the shellcode is too big to fit into the padding, perhaps another module might contain a bigger cave.
Sysmon Events
These are expected results using the CreateProcess
call to spawn the target process instead of using OpenProcess
. Something else to note is that the DuplicateHandle
call might trigger a process handle event with ObRegisterCallbacks
in Sysmon. This isn't the case because Sysmon does not follow the event if the handle access is performed by the process who owns that same handle. In the case with AVs or EDRs, it may be different.
Further Improvements
I wouldn't doubt that there may be some issues that I have overlooked since I really rushed this (side) project – I just had to explore this idea and see how far I could go. With regards to recovering the hijacked thread execution, it is possible and I have implemented it in the PoC, but it is dependent on the malicious process which might or might not be a good thing. ¯\_(ツ)_/¯
Limitations
One of the limitations of this technique is that the shellcode size is restricted due to the use of existing pages. The shellcode must be able to fit within the RW
stack as well as the RX
section. Although searching for modules with bigger sections is possible, it may not always be big enough. In this scenario, I would recommend using staging shellcode.
Conclusion
So it's possible to not use WriteProcessMemory
, VirtualAllocEx
, VirtualProtectEx
, CreateRemoteThread
, NtCreateThreadEx
, QueueUserApc
, and NtQueueApcThread
from the malicious process to inject into a remote process. The OpenProcess
and OpenThread
usage is still debatable because sometimes spawning a target process with CreateProcess
isn't always the circumstance. However, it does remove a lot of suspicious calls which is the goal of this technique.
Since SetThreadContext
is such a powerful primitive and crucial to this and many other stealthy techniques, will there be more focus on it? From what I can see, there is already native Windows logging available for it in Microsoft-Windows-Kernel-Audit-API-Calls ETW provider. I'm interested in seeing what the future will hold for process injection...