Implement a hard fork to normalize claim names #159
No reviewers
Labels
No labels
area: devops
area: discovery
area: docs
area: livestream
area: proposal
consider soon
Epic
good first issue
hacktoberfest
hard fork
help wanted
icebox
Invalid
level: 0
level: 1
level: 2
level: 3
level: 4
needs: exploration
needs: grooming
needs: priority
needs: repro
needs: tech design
on hold
priority: blocker
priority: high
priority: low
priority: medium
resilience
soft fork
Tom's Wishlist
type: bug
type: discussion
type: improvement
type: new feature
type: refactor
type: task
type: testing
unplanned
work in progress
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: LBRYCommunity/lbrycrd#159
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "normalization-rewrite"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This is a complete re-write of #102 and replaces it since it's much more complete and handles situations that the other does not handle.
This PR contains a lot of changes and I expect a detailed review will take some time to ensure correctness. I also expect the reproducible_build script will need further modification (likely won't work as-is again due to the added ICU dependency).
@kaykurokawa The "claimtriebranching_hardfork_disktest" unit test does not seem to work with this PR, if you have a second to see what's going on there, it would help. It came up after things were working and then I rebased to latest, so it's commented out for now.
EDIT: @kaykurokawa This last comment is no longer true, so can be ignored now.
Addresses #65
WIP regarding travis/dependency addition.Updated.
Delete it if it's obsolete.
boost.locale does not require this? Why would we need to get this separately from boost?
We default this to true after testing but before the release?
This exists just to convert a char to a string?
A normalized string, yes.
Because we're using icu for normalization, which doesn't work through boost unless it's explicitly enabled.
I don't know what you're asking. It's a flag specified by a caller.
So what is the gain with using boost's wrapper? Suppose we just make ICU calls directly?
It's C++ and style-compatible. I also have previous experience using it, for exactly this kind of purpose.
It appears that the
utf8
object is immutable. Can we move the construction of it out of our inline method? It seems we could just construct a static instance once and be done with it.Updated.
@kaykurokawa Let's talk.
Remove echo
Remove step output
@ -26,7 +25,6 @@ function HELP {
exit 1
}
Revert
Note: This makes the build times accurate
@ -376,3 +394,4 @@
git fetch origin2
git checkout origin2/master contrib/devtools/clang-format-diff.py
git diff -U0 origin2/master -- '*.h' '*.cpp' | ./contrib/devtools/clang-format-diff.py -p1
}
Note that travis builds will be parallelized by this
All fork heights are deferred to @kaykurokawa
All fork heights are deferred to @kaykurokawa
All fork heights are deferred to @kaykurokawa
Remove debug
Remove debug
To be clear, this fork implements normalization by removing the concept of a claimtrie of chars and we instead have a claimtrie of code points, represented as chars (via utf8).
@ -1426,3 +1361,3 @@
{
LogPrintf("%s: Removing a claim was unsuccessful. name = %s, txhash = %s, nOut = %d", __func__, name.c_str(), outPoint.hash.GetHex(), outPoint.n);
LogPrintf("%s: Removing a claim was unsuccessful. name = %s, txhash = %s, nOut = %d\n", __func__, name.c_str(), outPoint.hash.GetHex(), outPoint.n);
return false;
Style: remove blank lines
Style revert unnecessary changes
Remove fprintf
@kaykurokawa Let's talk about this test at some point.
Does we need to copy trie node?
Returning const on non reference is non-sense in C++
I opened up a separate issue to discuss the issue of "how to do character normalization" here: https://github.com/lbryio/lbrycrd/issues/204 .
Seems like the method in this PR may not be optimal but we can discuss that separately since I think it will not affect the bulk of the changes here.
My first attempt at compiling this branch gave me this error:
I am using system-installed boost (with the OpenSSL 1.1 PR) as I haven't been able to make the reproducible_build.sh work for me since July (on Bionic). After getting the above error, I validated that "auto" was the path for ICU, googled a bit, and read the configure.ac a bit. I decided to add this to the configure command:
That did alleviate the issue. It seems that configure.ac is not correctly falling back to pkgconfig in the situation where
--with-icu
is omitted.Agreed. So is inline.
Hrm, odd. It should be a separate case in the configure. I was focused on getting the reproducible builds going because we don't want boost or icu shared linkage (at least for the distributed binaries), as it's error prone and system dependent.
No, we don't want the non-normalized value.
I'm keeping inline in place of a storage specifier though, will remove the const.
@BrannonKing reported a performance issue, and it's true, this is slower than it needs to be. Correctness first, then performance is how I've always operated ;-)
Anyway, I've already got a partial fix, but will be updating when I think it's ready.
But why we want an empty node, it can wipe db in some cases.
Leaking this is not a problem right now, but if we use placement or in-class operator new (that i've plans to do it) will be.
It'll be better
This doesn't address what's needed (and doesn't save any memory, just moves it). A future release will delete the non-normalized trie, which will reduce memory, but can't be done until we're safely past the fork.
My experience with this branch thus far (with fork height at 1M):
main.cpp:2124: bool DisconnectBlock(const CBlock&, CValidationState&, const CBlockIndex*, CCoinsViewCache&, CClaimTrieCache&, bool*): Assertion 'pindex->GetBlockHash() == trieCache.getBestBlock()' failed.
Mine concern is that operator delete isn't called, it can do many other things than to free memory.
Understood, but it needs to remain in memory and intact past the fork. A future release will remove it (by deleting it) when it's clear that we can. If you're generally pointing out that the single pointer is leaked, it's not particularly helpful because we know there are leaks in the claimtrie already (and this one isn't one in particular to be concerned about, IMO). The proposed suggestion doesn't change that either. Upstream uses managed pointers, and while our first update to it will not, it will be much easier to transition after that update. I was originally planning to do that before the upstream release, but it makes sense to do it after since we can skip boost pointers altogether and just use std ones.
Normalization can narrow the claimtrie name space.
@BrannonKing I dusted off the webs, please test and find more (fork related) issues.
The reindex issue is resolved. However, I was unable to run past the fork height. I modified the fork height to occur in the next few minutes. However, it never went past the block height one before my fork height. I waited about 20min. This is on mainnet. I'm seeing a lot of this error:
looks like you were testing this and forgot to change?
Yes, same for me. Previous PR told me I needed reindex when starting with mainnet data.
Brannon that sounds like expected behavior, because the fork happened on your codebase but not on mainnet.
If I try to run on mainnet with a fork height that is past I crash with this error:
No, I left it open so it can be decided by all of us. Locale-aware or not. Choose wisely ;-)
According to http://charette.no-ip.com:81/programming/doxygen/boost/namespaceboost_1_1algorithm.html#a1a318a47f9b4482c28552a62d0297a27 , it seems to be using std::locale() which is whatever the global locale is.
And it seems like the global locale always defaults to the "C" locale for C++ programs? (At least that's what it is set to for me, and other sources seem to say that the global locale should always be "C" )
I don't understand exactly what can cause a string to shrink here?
Right, C is usually the default, which is not UTF-8. The goal is to be systematic (i.e. consistently using normalization), and then applying some effort on lower casing (where a perfect system for all encodings doesn't exist). So long as we're consistent, it's best effort IMO. I'm open to other takes on it.
That's for UTF-8 code points after normalization. For example, 0xc0a0 and 0x20 are encodings of the same thing, but collapsing the former (due to the lower casing) is not handled as equivalent. When there's a dispute, we use the original. Perhaps we could opt to use something else in that case, but again, that's for consistency.
Also, technically it isn't a string as you normally think of them anymore. It is a string of byte-length code points, which are variably sized to make the characters. Sometimes a single encoded character may be N bytes, but after normalization it may be M.
Let me rephrase, I get why the size of the un-normalized string could shrink after normalization,
But I don't get why we are using that as the basis for determining whether we return the original un-normalized string, or the normalized string? Is there something special about this rule? Can you elaborate on your specific example of 0x20 and 0xc0a0 here? What are those characters and why aren't they handled as equivalent?
I think we should discuss what happens to bytes that are invalid utf-8 . Because utf-8 is variable width, not all bytes are valid utf-8. It seems like the normalization function we have now will parse invalid bytes without throwing an error. I'm not sure exactly what the logic is but if I try passing in a byte that is invalid like 0xFF to the normalization function, it returns back 0xFF because it hits this below line in the normalization function:
return (normalized.size() >= name.size()) ? normalized : name;
The variable "normalized" was actually empty when I inspected it. I create this unit test gist, for testing purposes, you can check it out: https://gist.github.com/kaykurokawa/2c35843eb09da2bc8f31ffac46de4099
So two questions. First question is whether we should attempt to normalize invalid bytes at all, perhaps we should reject them from ever entering the claimtrie? Second question is if we are normalizing invalid bytes, than is this current method correct (is it possible that invalid bytes could be accidentally normalized into valid bytes?)
If our standard is UTF-8, we should reject anything that is not valid UTF-8. That's my vote.
I tried to test this in a simple fashion this morning with no luck:
~/.lbrycrd/regtest
./src/lbrycrdd -server -txindex -regtest -walletbroadcast=0
./lbrycrd/src/lbrycrd-cli -regtest generate 30
@BrannonKing Thanks, give it a go again, this is corrected.
@kaykurokawa Good questions. There are few issues that need addressing:
[ Generically, we call both steps normalization in the code ]
The general approach I was going for is that if either fails, fall-back to our current behavior (which is to just treat the bytes as bytes in the claimtrie). Maybe that was a bad assumption on what we need, or maybe it's not working as intended (although it appears to be to me).
Glad we're finally doing this kind of review though because I did expect this process to take some time and we're getting to the hard parts. Open to suggestions. I'm not sure rejecting any non-UTF8 inputs is the correct solution, as it does place a larger burden on the applications using lbrycrd, who really may not care about this feature.
Either way, I think we can all agree that consistency is the most important, in that either the same byte or code-point sequences are stored consistently in the claimtrie.
What if the existing node had additional claims on it?
This line is brutally expensive.
Again, regenerating the hashes for the entire trie (x2) is too expensive for logging.
This line looks suspicious to me; what if the previous node had claims on it? It appears that we don't keep those. Is this where we're losing claims? We need to change them so that all the original ones still complete; don't throw them away.
@ -1663,3 +1593,4 @@
{
name = normalizeClaimName(name);
return removeClaim(name, outPoint, nHeight, nValidAtHeight, true);
}
I know this is a preexisting line, but it seems to make a harsh assumption about the order that data in the trie is read from disk (it must be "pre-order").
@ -2745,3 +2758,4 @@
}
int64_t nTime3 = GetTimeMicros(); nTimeConnect += nTime3 - nTime2;
LogPrint("bench", " - Connect %u transactions: %.2fms (%.3fms/tx, %.3fms/txin) [%.2fs]\n", (unsigned)block.vtx.size(), 0.001 * (nTime3 - nTime2), 0.001 * (nTime3 - nTime2) / block.vtx.size(), nInputs <= 1 ? 0 : 0.001 * (nTime3 - nTime2) / (nInputs-1), nTimeConnect * 0.000001);
Is there an advantage to having this code to enable the normalization go here rather than in the CClaimTrieCache object (since the latter also gets IncrementBlock called)? What rule would guide that?
Why do we need to make a new trieCache object here?
Thinking about this some: an alternative to this would be to simply walk the trie and add necessary changes to the insertion and deletion queues. As written this seems to make an assumption that the vast majority of the trie will have to change. If we modify the cache instead the persisted undo info would allow us to rollback without having to store the duplicated trie in RAM.
This line is missing the
normalizedClaimName
onit->first
? And for all the items in this copy, do we not need to merge them instead of replace them as there might be conflicts?In what way? Show me the measurements.
We presently have 5.5million nodes in the trie.
getMerkleHash(true)
triggers a rehash on every one of those. I just measured the call at 3 microseconds each. This was on a non-optimized build, but even if you figure 1 microsecond per hash call you're looking at 10 seconds of CPU hit for that one line.@ -742,12 +749,20 @@ if test x$use_pkgconfig = xyes; then
AC_MSG_ERROR(pkg-config not found.)
fi
if test "${ICU_PREFIX}" != "auto"; then
I investigate why configure without --with-icu fails, this section is for pkg-config you can replace all lines 756-765 with
Then it will make all needed flags and libs for compilation,
else
section (792-802) looks correct.@ -742,12 +749,20 @@ if test x$use_pkgconfig = xyes; then
AC_MSG_ERROR(pkg-config not found.)
fi
if test "${ICU_PREFIX}" != "auto"; then
I just tested this and still got the same error as before:
cannot determine absolute directory name of 'auto/lib'
. I cleaned my repository. I modified configure.ac as suggested above, then reran autogen.sh and configure (CXXFLAGS="-g" ./configure --enable-cxx --enable-static --disable-shared --with-pic --with-gui=no
) and make.@ -742,12 +749,20 @@ if test x$use_pkgconfig = xyes; then
AC_MSG_ERROR(pkg-config not found.)
fi
if test "${ICU_PREFIX}" != "auto"; then
That's the diff, it works here.
I'm getting close to having a solution that doesn't require keeping two tries in memory. As part of that work, I've been narrowing in on two rules:
Removal of this assert is fine (it's no longer needed using this approach).
@ -2077,0 +2068,4 @@
bool CClaimTrieCache::incrementBlock(insertUndoType& insertUndo, claimQueueRowType& expireUndo,
insertUndoType& insertSupportUndo, supportQueueRowType& expireSupportUndo,
std::vector<std::pair<std::string, int> >& takeoverHeightUndo) const
Why commented code?
For OpenSSL 1.0 you can use EVP_MD_CTX_create or
Then you can use EVP_MD_CTX_new and EVP_MD_CTX_free unconditional.
If this assert fails it looks like it compares noramalized and not normalized name, that's pretty likely when it used in increment block.
@ -2077,0 +2068,4 @@
bool CClaimTrieCache::incrementBlock(insertUndoType& insertUndo, claimQueueRowType& expireUndo,
insertUndoType& insertSupportUndo, supportQueueRowType& expireSupportUndo,
std::vector<std::pair<std::string, int> >& takeoverHeightUndo) const
it was for a previous discussion and now removed
wasn't meant to be part of this review; make sure #221 got it correct
I found and fixed the problem here. Because I was reusing the InsertUndo I had to ensure that the decrementBlock only put appropriate items back into the QueueCache.
@ -2745,3 +2758,4 @@
}
int64_t nTime3 = GetTimeMicros(); nTimeConnect += nTime3 - nTime2;
LogPrint("bench", " - Connect %u transactions: %.2fms (%.3fms/tx, %.3fms/txin) [%.2fs]\n", (unsigned)block.vtx.size(), 0.001 * (nTime3 - nTime2), 0.001 * (nTime3 - nTime2) / block.vtx.size(), nInputs <= 1 ? 0 : 0.001 * (nTime3 - nTime2) / (nInputs-1), nTimeConnect * 0.000001);
removed in my changes
@ -2745,3 +2758,4 @@
}
int64_t nTime3 = GetTimeMicros(); nTimeConnect += nTime3 - nTime2;
LogPrint("bench", " - Connect %u transactions: %.2fms (%.3fms/tx, %.3fms/txin) [%.2fs]\n", (unsigned)block.vtx.size(), 0.001 * (nTime3 - nTime2), 0.001 * (nTime3 - nTime2) / block.vtx.size(), nInputs <= 1 ? 0 : 0.001 * (nTime3 - nTime2) / (nInputs-1), nTimeConnect * 0.000001);
resolved
@ -742,12 +749,20 @@ if test x$use_pkgconfig = xyes; then
AC_MSG_ERROR(pkg-config not found.)
fi
if test "${ICU_PREFIX}" != "auto"; then
included in the latest checkin
It should be static, inilining here is unwanted, furthermore checking to be performed normalize should not be here. For now it should have ability to be called from CClaimTrie and CClaimTrieCache depending of their heights.
@bvbfan , excellent analysis. With our work on #44 it appears that we achieved independence from methods taking a name in rpc/claimtrie.cpp. I think that is critical to your suggestion that we don't do string normalization in the CClaimTrie (instead, we do it all in the "cache"). I really like the suggestion! I think that puts a high priority on getting #44 merged and rebasing our normalization on top of that.
fixed in last push
Fixed in current push.
@ -2291,3 +2318,2 @@
{
if (*itNamesToCheck == itSupportQueue->first && itSupportQueue->second.outPoint == itSupportQueueName->outPoint && itSupportQueue->second.nValidAtHeight == itSupportQueueName->nHeight)
{
if (nameToCheck == itSupportQueue->first && itSupportQueue->second.outPoint == itSupportQueueName->outPoint && itSupportQueue->second.nValidAtHeight == itSupportQueueName->nHeight) {
I don't see a reason to make param in/out.
@ -2291,3 +2318,2 @@
{
if (*itNamesToCheck == itSupportQueue->first && itSupportQueue->second.outPoint == itSupportQueueName->outPoint && itSupportQueue->second.nValidAtHeight == itSupportQueueName->nHeight)
{
if (nameToCheck == itSupportQueue->first && itSupportQueue->second.outPoint == itSupportQueueName->outPoint && itSupportQueue->second.nValidAtHeight == itSupportQueueName->nHeight) {
The out parameter is used in mining.cpp and main.cpp (to avoid duplicate work on that name). Out parameter usage is not very obvious, though. I'm open to other suggestions.
@ -2291,3 +2318,2 @@
{
if (*itNamesToCheck == itSupportQueue->first && itSupportQueue->second.outPoint == itSupportQueueName->outPoint && itSupportQueue->second.nValidAtHeight == itSupportQueueName->nHeight)
{
if (nameToCheck == itSupportQueue->first && itSupportQueue->second.outPoint == itSupportQueueName->outPoint && itSupportQueue->second.nValidAtHeight == itSupportQueueName->nHeight) {
Yeah, i see now. The right approach, to me, is to call normalize before spend, we can prevent double work if we have function like isNormlized before real normalization. But i don't think double call will slow down significantly.
@ -368,0 +380,4 @@
CPPFLAGS="${CPPFLAGS}" LDFLAGS="${LDFLAGS}" \
./configure --without-gui ${OPTIONS} \
--with-boost="${BOOST_PREFIX}" \
--with-icu="${ICU_PREFIX}" >> "${LBRYCRD_LOG}" 2>&1
This is not going to be a good enough substitute for static linking. When I run
ldd
on a build from this script I seelibicui18n.so.55 => not found, libicuuc.so.55 => not found
. It's possible that we could deploy some shared libraries alongside our executables.@ -368,0 +380,4 @@
CPPFLAGS="${CPPFLAGS}" LDFLAGS="${LDFLAGS}" \
./configure --without-gui ${OPTIONS} \
--with-boost="${BOOST_PREFIX}" \
--with-icu="${ICU_PREFIX}" >> "${LBRYCRD_LOG}" 2>&1
Notice line 333 it has --enable-shared, so building with this option the result is dynamic libraries (.so) but for static build we want a static ones (.a) so it should be --disable-shared --enable-static.
Function opening brace goes to the new line, that's tidy expectation.
Opening brace on new line.
I haven't ran the formatter on this yet. I assume that would pick this up?
Yes, it will.
@ -368,0 +380,4 @@
CPPFLAGS="${CPPFLAGS}" LDFLAGS="${LDFLAGS}" \
./configure --without-gui ${OPTIONS} \
--with-boost="${BOOST_PREFIX}" \
--with-icu="${ICU_PREFIX}" >> "${LBRYCRD_LOG}" 2>&1
fixed
I did finally get a green on the Linux compilation again. Some notes on my issues with the reproducible_build.sh:
The advantage of Docker is that you don't have to recompile the dependencies every time. We can stop wasting time on that. I feel strongly like we need to move that direction.
#235 supersedes this PR.
Pull request closed