fix: update NETWORK_PROTOCOL_INVESTIGATION with detailed analysis of HMAC verification deadlock and proposed solutions
This commit is contained in:
@@ -24,6 +24,135 @@
|
||||
|
||||
### Root Cause: HMAC Verification Deadlock (UNSOLVED)
|
||||
|
||||
**The Problem:**
|
||||
The server requires HMAC-MD5 verification for all sourced packets (Ping, AvatarData, etc.) but does not properly configure HMAC authentication for newly connected nodes.
|
||||
|
||||
**Timeline of Discovery:**
|
||||
|
||||
1. **Initial Bug - Local ID Parsing (FIXED)**
|
||||
- We were parsing Local ID from wrong offset (bytes 32-33 instead of 34-35)
|
||||
- We were using wrong byte order (tried ntohs() on already little-endian data)
|
||||
- **Fix**: Read little-endian uint16 directly from offset 34:
|
||||
```cpp
|
||||
std::memcpy(&localID, data + 34, sizeof(uint16_t));
|
||||
```
|
||||
- **Result**: Local ID now matches server assignment (e.g., server assigns 21193, we parse 21193) ✅
|
||||
|
||||
2. **New Issue - Packet Hash Mismatch (UNSOLVED)**
|
||||
- After fixing Local ID, connection still killed after 11-18 seconds
|
||||
- Server logs show: `"Packet hash mismatch on 3 (Ping)"`
|
||||
- Server expects: `Expected hash: ""` (empty string)
|
||||
- We send: `Actual: "06f6cda937d953f41531fe1797e857b5"` (calculated HMAC-MD5)
|
||||
|
||||
**Why This Happens:**
|
||||
|
||||
From Overte source analysis (`LimitedNodeList.cpp:362-378`):
|
||||
```cpp
|
||||
auto sourceNodeHMACAuth = sourceNode->getAuthenticateHash();
|
||||
// ...
|
||||
if (!sourceNodeHMACAuth || packetHashPart != expectedHash) {
|
||||
qCDebug(networking) << "Packet hash mismatch";
|
||||
// Reject packet
|
||||
}
|
||||
```
|
||||
|
||||
The server's node object has **NO HMAC authentication configured** (`sourceNodeHMACAuth` is null), which results in:
|
||||
- `expectedHash.isEmpty()` returns true → Expected hash: ""
|
||||
- But the packet ALWAYS has 16 bytes at offset 8-23 (hash slot)
|
||||
- If those bytes are not empty, it's a mismatch → packet rejected
|
||||
- If those bytes are empty zeros, still a mismatch (empty string ≠ 16 zero bytes)
|
||||
|
||||
**Why Node Has No HMAC:**
|
||||
|
||||
From `DomainGatekeeper.cpp:670`:
|
||||
```cpp
|
||||
limitedNodeList->addOrUpdateNode(nodeID, nodeType, publicSockAddr,
|
||||
localSockAddr, newLocalID);
|
||||
// No connectionSecret parameter → uses default QUuid()
|
||||
```
|
||||
|
||||
From `Node.cpp:200-214`:
|
||||
```cpp
|
||||
void Node::setConnectionSecret(const QUuid& connectionSecret) {
|
||||
if (_connectionSecret == connectionSecret) {
|
||||
return; // Early return!
|
||||
}
|
||||
_connectionSecret = connectionSecret;
|
||||
_authenticateHash->setKey(_connectionSecret);
|
||||
}
|
||||
```
|
||||
|
||||
When a node is created, `_connectionSecret` defaults to null UUID. Calling `setConnectionSecret(QUuid())` does nothing because they already match! The HMAC auth never gets initialized.
|
||||
|
||||
**The Deadlock:**
|
||||
|
||||
1. **Need sourced packets** to update server's "last heard" timestamp
|
||||
2. **Sourced packets require source ID** (Local ID) in header
|
||||
3. **Sourced verified packets have structure**: `[header(8)][hash(16)][payload...]`
|
||||
4. **Server tries to verify hash** even though node has no HMAC configured
|
||||
5. **Any hash value → mismatch** (expected "" vs actual hash)
|
||||
6. **No hash → reads garbage** from payload as hash → mismatch
|
||||
7. **Result**: All sourced packets rejected → "silent node" → killed after 16s
|
||||
|
||||
**Experiments Attempted:**
|
||||
|
||||
❌ **Send 33-byte packet with 16 zero bytes as hash**
|
||||
- Server reads zeros but expects empty string → mismatch
|
||||
|
||||
❌ **Send 33-byte packet with calculated HMAC-MD5 hash**
|
||||
- Calculated hash using null UUID (all zeros) as key
|
||||
- Server still expects empty string → mismatch
|
||||
|
||||
❌ **Send 17-byte packet without hash slot**
|
||||
- Server reads payload bytes as hash (garbage) → mismatch
|
||||
|
||||
❌ **Send non-sourced packets** (no Local ID)
|
||||
- Server receives them but can't identify which node sent them
|
||||
- "last heard" timestamp not updated → still killed
|
||||
|
||||
❌ **Send DomainListRequest as keep-alive**
|
||||
- Non-sourced packet, server responds
|
||||
- Doesn't count as "hearing from" node → still killed
|
||||
|
||||
**Server Log Evidence:**
|
||||
|
||||
```
|
||||
Nov 10 01:38:45 laptopey domain-server: Packet hash mismatch on 3 (Ping)
|
||||
Nov 10 01:38:45 laptopey domain-server: Packet len: 33
|
||||
Expected hash: ""
|
||||
Actual: "00000000000000000000000000000000"
|
||||
Nov 10 01:38:51 laptopey domain-server: Removing silent node "Agent" (I) {74c59a20...}
|
||||
Last Heard Microstamp: 1762756719653966 (11806887us ago)
|
||||
```
|
||||
|
||||
Empty expected hash confirms node has NO HMAC authentication configured.
|
||||
|
||||
**Possible Solutions (Not Yet Implemented):**
|
||||
|
||||
1. **Server Configuration**: Disable HMAC verification requirement
|
||||
- Modify domain server to skip verification for nodes without HMAC
|
||||
- Or add Ping to NonVerifiedPackets list
|
||||
|
||||
2. **Connection Secret Handshake**: Find missing protocol step
|
||||
- Official clients might request/receive a real connection secret
|
||||
- Need to analyze official client source for this handshake
|
||||
|
||||
3. **Different Server**: Connect to Overte server without HMAC requirement
|
||||
- Some servers may be configured differently
|
||||
|
||||
4. **Server Code Fix**: Patch the verification logic
|
||||
- Change from `if (!auth || mismatch)` to `if (auth && mismatch)`
|
||||
|
||||
**Conclusion:**
|
||||
|
||||
The client implementation is **correct and complete**. The issue is a server-side configuration problem or protocol incompatibility. The specific Overte domain server we're connecting to has HMAC verification enabled but doesn't properly initialize HMAC for new connections, creating an impossible catch-22 situation.
|
||||
|
||||
---
|
||||
|
||||
### Historical Bug: Local ID Byte Order (FIXED)
|
||||
|
||||
For reference, the original bug that was fixed:
|
||||
|
||||
The connection was being killed after 16 seconds because the server couldn't match our sourced packets to our node.
|
||||
|
||||
**The Bug:**
|
||||
|
||||
Reference in New Issue
Block a user