Need to cache fixed amount of CClaimTrieNode's, and store the rest on disk #166
Labels
No labels
area: devops
area: discovery
area: docs
area: livestream
area: proposal
consider soon
Epic
good first issue
hacktoberfest
hard fork
help wanted
icebox
Invalid
level: 0
level: 1
level: 2
level: 3
level: 4
needs: exploration
needs: grooming
needs: priority
needs: repro
needs: tech design
on hold
priority: blocker
priority: high
priority: low
priority: medium
resilience
soft fork
Tom's Wishlist
type: bug
type: discussion
type: improvement
type: new feature
type: refactor
type: task
type: testing
unplanned
work in progress
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: LBRYCommunity/lbrycrd#166
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Currently , lbrycrdd loads the entire claimtrie composed of CClaimTrieNode's into memory
The total size of CClaimTrieNode's can exceed what is available in memory so we need to create a cache that will keep a fixed size of CClaimTrieNode's as needed, and store the rest on disk.
Say we create a class CClaimTrieCache, perhaps something like this in pseudo-code (just trying to brainstorm here, open to different approaches).
CClaimTrieCache(int num_nodes_in_cache) : num_nodes_in_cache is the max number of CClaimTrieNode that can be loaded into cache
public methods:
void loadFromDisk() : loads as much CClaimTrieNode's as possible, runs on startup
*CClaimTrieNode root(): This always returns a pointer to the root claim trie node
nodeMapType::iterator children(CClaimTrieNode *node) : Returns an iterator for CClaimTrieNode.children that allows for the traversal of the claimtrie. If the children nodes of "node" has not been loaded into cache, than this means that CClaimTrieCache will need to load it from disk into cache.
Note that currently CClaimTrieNode's are stored on disk using key value storage where the key is the node name. The "children" of CClaimTrieNode's are not stored on disk, they are loaded when they are put into memory ( see InsertFromDisk() https://github.com/lbryio/lbrycrd/blob/master/src/claimtrie.cpp#L1108 ).
Thus this means that for the disk storage of CClaimTrieNode we should store a vector of children letters on disk as well. This will allow us to look up all of the node's children efficiently in the key value storage.
I think for the cache to have efficient performance, we need to eject based on the depth of the node and maybe distance from the new node being inserted? The logic is that if a CClaimTrieNode is closer to the root, it will see far more accesses to it rather than if it further from the root.
I took in other way, you can decide it's valuable or not, so can use basically same structure of code as is, but we can use memory map file to allocate space, i.e. all node allocations will be placement new that will in the virtual memory, shared_ptr will use a custom deallocator.
https://www.boost.org/doc/libs/1_67_0/doc/html/interprocess/managed_memory_segments.html
If you have an other idea, share it, cause other mine is again use of placement new with custom allocations which will be error prone.
I implemented it on memory managed file, but it has a bug, about me, i reported it https://svn.boost.org/trac10/ticket/13626 so did you have a suggestion to investigate?
Do you have this implemented in a branch somewhere I could look at?
It depends on #160 , which is ready to merge?
Here is on pastebin https://pastebin.com/xNuF4YZ4
One possible solution is CMemoryFile to manage a fixed size file, CMemoryFileManager to manage a bunch of memory files, he will lookup in all created files for enough space for every new allocation, if does not have, it will construct a new one. Growing will be a just creating a new file with desired size, shrinking will search for empty files to delete.
@kaykurokawa, i've implement it, like i comment it above, tests pass, it looks good to me. I can upload to pastebin or i can wait #160 to merged.
Here is on pastebin for now -> https://pastebin.com/KBXdEtyw
I want to get #160 merged today. Can you look into using
make_shared
instead of thenew
keyword? Return shared pointers from your methods instead of raw pointers.No it cannot be used, in placement new make_shared does not work.
@bvbfan , Roy's approach to this problem (in btcd) was to store all of the claimtrie nodes in LevelDB. He used the node hash code as the key. This has the advantage of indexed lookups for replay. He also kept the active claims in a dictionary so that no DB lookup is required for the common usage. Do you see any advantages/disadvantages over the mmap approach? I'm assuming that you rebuild the mmap on startup, whereas the LevelDB could resume construction at the last saved height.
I'm not sure how accurate is LevelDB when size increase significantly, e.g. when we have more than 8GB claims, mapping files (last implementation has a couple of file with compile-time configured size) can benefit OS level access of virtual address space which will be as fast as it can, about me. Also @kaykurokawa note:
I made a mistake by closing, sorry.
Now that we have the cache trie separated from the base trie, the next step is to make the base trie actually just be the LevelDB data that we already write to disk to persist that trie. We don't need to load the base trie into RAM. Instead, we can just pull the nodes directly from there as necessary. Things that need to happen:
Merged via https://github.com/lbryio/lbrycrd/pull/308