Sophie

Sophie

distrib > Mageia > 7 > aarch64 > by-pkgid > 7e647d9940d31b34c253e6f71c416c4b > files > 2678

bzr-2.7.0-6.mga7.aarch64.rpm


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="X-UA-Compatible" content="IE=Edge" />
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Directory fingerprints &#8212; Bazaar 2.7.0 documentation</title>
    <link rel="stylesheet" href="_static/classic.css" type="text/css" />
    <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
    
    <script type="text/javascript" id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script>
    <script type="text/javascript" src="_static/jquery.js"></script>
    <script type="text/javascript" src="_static/underscore.js"></script>
    <script type="text/javascript" src="_static/doctools.js"></script>
    <script type="text/javascript" src="_static/language_data.js"></script>
    
    <link rel="shortcut icon" href="_static/bzr.ico"/>

    <link rel="search" title="Search" href="search.html" />
<link rel="stylesheet" href="_static/bzr-doc.css" type="text/css" />
 
  </head><body>
    <div class="related" role="navigation" aria-label="related navigation">
      <h3>Navigation</h3>
      <ul>
<li><a href="http://bazaar.canonical.com/">
    <img src="_static/bzr icon 16.png" /> Home</a>&nbsp;|&nbsp;</li>
<a href="http://doc.bazaar.canonical.com/en/">Documentation</a>&nbsp;|&nbsp;</li>

        <li class="nav-item nav-item-0"><a href="index.html">Developer Document Catalog (2.7.0)</a> &#187;</li>
 
      </ul>
    </div>  

    <div class="document">
      <div class="documentwrapper">
        <div class="bodywrapper">
          <div class="body" role="main">
            
  <div class="section" id="directory-fingerprints">
<h1>Directory fingerprints<a class="headerlink" href="#directory-fingerprints" title="Permalink to this headline">¶</a></h1>
<div class="contents local topic" id="contents">
<ul class="simple">
<li><a class="reference internal" href="#introduction" id="id1">Introduction</a></li>
<li><a class="reference internal" href="#use-case-oriented-apis" id="id2">Use-case oriented APIs</a><ul>
<li><a class="reference internal" href="#commit" id="id3"><code class="docutils literal notranslate"><span class="pre">commit</span></code></a></li>
<li><a class="reference internal" href="#log" id="id4"><code class="docutils literal notranslate"><span class="pre">log</span></code></a></li>
</ul>
</li>
<li><a class="reference internal" href="#open-questions" id="id5">Open questions</a></li>
<li><a class="reference internal" href="#conclusions" id="id6">Conclusions</a></li>
<li><a class="reference internal" href="#design-changes" id="id7">Design changes</a></li>
<li><a class="reference internal" href="#api-changes" id="id8">API changes</a></li>
</ul>
</div>
<div class="section" id="introduction">
<h2><a class="toc-backref" href="#id1">Introduction</a><a class="headerlink" href="#introduction" title="Permalink to this headline">¶</a></h2>
<p>The basic idea is that for a directory in a tree (committed or otherwise), we
will have a single scalar value.  If these values are the same, the contents of
the subtree under that directory are necessarily the same.</p>
<p>This is intended to help with these use cases, by allowing them to quickly skip
over directories with no relevant changes, and to detect when a directory has
changed:</p>
<ul class="simple">
<li>diff/status (both local trees and historical trees)</li>
<li>merge</li>
<li>log -v</li>
<li>log on a directory</li>
<li>commit</li>
</ul>
</div>
<div class="section" id="use-case-oriented-apis">
<h2><a class="toc-backref" href="#id2">Use-case oriented APIs</a><a class="headerlink" href="#use-case-oriented-apis" title="Permalink to this headline">¶</a></h2>
<p>Most of this will be hidden behind the Tree interface.  This should cover
<code class="docutils literal notranslate"><span class="pre">log</span> <span class="pre">-v</span></code>, <code class="docutils literal notranslate"><span class="pre">diff</span></code>, <code class="docutils literal notranslate"><span class="pre">status</span></code>, <code class="docutils literal notranslate"><span class="pre">merge</span></code> (and implicit merge during
push, pull, update):</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">tree</span><span class="o">.</span><span class="n">iter_changes</span><span class="p">(</span><span class="n">other_tree</span><span class="p">)</span>
<span class="n">tree</span><span class="o">.</span><span class="n">get_file_lines</span><span class="p">(</span><span class="n">file_id</span><span class="p">)</span>   <span class="c1"># and get_file, get_file_text</span>
</pre></div>
</div>
<div class="section" id="commit">
<h3><a class="toc-backref" href="#id3"><code class="docutils literal notranslate"><span class="pre">commit</span></code></a><a class="headerlink" href="#commit" title="Permalink to this headline">¶</a></h3>
<p>Commit is similar to <code class="docutils literal notranslate"><span class="pre">iter_changes</span></code>, but different because it needs to
compare to all the trees.  Commit currently needs to compare the working
tree to all the parent trees, which is needed to update the last_modified
field and would be unnecessary if we removed that field (for both files
and directories) and did not store per-file graphs.
This would potentially speed up commit after merge.</p>
<p>Verbose commit also displays the merged files, which does
require looking at all parents of files that aren’t identical
to the left-hand parent.</p>
</div>
<div class="section" id="log">
<h3><a class="toc-backref" href="#id4"><code class="docutils literal notranslate"><span class="pre">log</span></code></a><a class="headerlink" href="#log" title="Permalink to this headline">¶</a></h3>
<p>Log is interested in two operations: finding the revisions that touched
anything inside a directory, and getting the differences between
consecutive revisions (possibly filtered to a directory):</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">find_touching_revisions</span><span class="p">(</span><span class="n">branch</span><span class="p">,</span> <span class="n">file_id</span><span class="p">)</span> <span class="c1"># should be on Branch?</span>
</pre></div>
</div>
<p>Log shows the revisions that merged a change.  At the moment that is not
included in the per-file graph, and it would also not be visible if the
directories were hashed.</p>
</div>
</div>
<div class="section" id="open-questions">
<h2><a class="toc-backref" href="#id5">Open questions</a><a class="headerlink" href="#open-questions" title="Permalink to this headline">¶</a></h2>
<ul>
<li><p class="first">Is this a good idea at all?</p>
<p>If changing a file changes all its parent directories up to the root it
will cause more churn on commit.  (We currently update the all-in-one
inventory, but only have to update one line of it.)</p>
<p>Every time a child changes, we’ll get a new node in the per-directory
graph.  This is generally useful: it allows bzr log to do the default
mode easily, which is to show all changes under that directory.  The
less common operation, <code class="docutils literal notranslate"><span class="pre">log</span> <span class="pre">--no-recursive</span></code> is still possible by
looking only at when the directory itself was renamed, added or removed.
(That is what the directory graph describes in bzr 0.18 and it is rarely
useful.)</p>
</li>
<li><p class="first">Should these be hashes or revision ids or something else?</p>
<p>Pros of using hashes: hashes are easy to generate by a foreign branch
plugin (e.g. bzr-svn).  They don’t need to get recursive last-changed
from the foreign branch, or to walk back through history.  They just
need the relevant directory state, which any system we support can
answer.</p>
<p>Hashes converge: if you modify and then modify back, you get the same
hash.  This is a pro because you can detect that there were ultimately
no significant changes.  And also a con: you cannot use these hashes to form a graph
because they get cycles.</p>
</li>
<li><p class="first">Are the values unique across the whole tree, or only when comparing
different versions of the same object?</p>
<p>If we use last-changed revisions, then they will be very not unique
across the whole tree.  To look up the contents, you must pass a
composite key like <code class="docutils literal notranslate"><span class="pre">(file_id,</span> <span class="pre">last_changed)</span></code>.</p>
<p>If we use hashes they will be same only when the two contain the same
contents.  Since we say that file ids must be unique, this
means they will match if and only if they are empty.  We might relax
that in future when we introduce path tokens.</p>
</li>
<li><p class="first">Is it reasonable to assume hashes won’t collide?</p>
<p>The odds of SHA-1 hashes colliding “accidentally” are vanishingly small.</p>
<p>It is possible that a <a class="reference external" href="http://tools.ietf.org/html/rfc4270">preimage attack</a> against SHA-1 may be discovered
in the future.  Since we’re not proposing in this document to make
revision-ids be SHA-1, if SHA-1 was obsoleted then we could rewrite the
contents of revisions but would not need to rename revisions.  So the
impact of such a migration should just be a format upgrade, and a
recommendation (but not requirement) to re-sign revisions.</p>
</li>
</ul>
<ul>
<li><p class="first">If we use hashes, should it be the hash of the
representation stored for a directory?</p>
<p>In other words, should we pun the representation of the directory with
the form used for validation.</p>
<p>If there’s some data stored that’s not in the hash it’s problematic.
The hash in no longer (effectively) uniquely identifies the
representation.</p>
<p>It is desirable that we have a hash that covers all data, to guard
against bugs, transmission errors, or users trying to hand-hack files.
Since we need one hash of everything in the tree, perhaps we should also
use it for the fingerprint.</p>
<p>Testaments explicitly separate the form used for hashing/signing from
the form used for storage.  This allows us to change the storage form
without breaking existing GPG signatures.  The downside is that we need
to do work O(tree) to make a testament, and this slows down signing,
verifying and generating bundles.  It also means that there is some
stored data which is not protected by the signature: this data is less
important, but corruption of it would still cause problems.
We have encountered some specific problems with disagreement between
inventories as to the last-change of files, which is currently unsigned.
These problems can be introduced by ghosts.</p>
<p>If we hash the representation, there is still a way to support old
signatures, assuming that we never discard irreplaceable information.
The signature should say what format it applies to (similar to
testaments), and we could transform in memory the tree back to that
format.</p>
</li>
<li><p class="first">Is hashing substantially slower than other possible approaches?</p>
<p>We already hash all the plain files.  Except in unusual cases, the
directory metadata will be substantially smaller: perhaps 200:1 as a
rule of thumb.</p>
<p>When building a bzr tree, we spend on the order of 100ms hashing all the
source lines to validate them (about 13MB of source).</p>
</li>
<li><p class="first">Can you calculate one from a directory in the working tree?  Without a basis?</p>
<p>This seems possible with either hashes or revision ids.</p>
<p>Using last_changed means that calculating the fingerprint from a working
tree necessarily requires reading the inventory for the basis
revision, so that we know when unchanged files were last changed.  With
hashes we could calculate them using the working tree information alone.
It’s true that we will often then compare that information to the basis
tree (e.g. for simple <code class="docutils literal notranslate"><span class="pre">bzr</span> <span class="pre">diff</span></code>), but we may only have to compare at
the top level, and sometimes we’re comparing to a
different tree.  This also touches on whether we should store
<code class="docutils literal notranslate"><span class="pre">last_modified</span></code> for files, rather than directories.</p>
<p>For revision ids we need to assign a value to use for uncommitted
changes, but see below about the problems of this.</p>
<p>In some ways it would be elegant to say (hypothetical):</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">wt</span><span class="o">.</span><span class="n">get_root</span><span class="p">()</span><span class="o">.</span><span class="n">get_last_modified</span><span class="p">()</span> <span class="o">==</span> <span class="n">branch</span><span class="o">.</span><span class="n">get_last_revision</span><span class="p">()</span>
</pre></div>
</div>
<p>to know that nothing was changed; but this may not be much better than</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">wt</span><span class="o">.</span><span class="n">get_root</span><span class="p">()</span><span class="o">.</span><span class="n">get_hash</span><span class="p">()</span> <span class="o">==</span>
  <span class="n">branch</span><span class="o">.</span><span class="n">get_basis</span><span class="p">()</span><span class="o">.</span><span class="n">get_root</span><span class="p">()</span><span class="o">.</span><span class="n">get_hash</span><span class="p">()</span>
</pre></div>
</div>
</li>
<li><p class="first">Can you use this to compare (directories from) two working trees?</p>
<p>If you can generate it from a working tree, you should be able to use it
to compare them.</p>
<p>This does rule out for example using <code class="docutils literal notranslate"><span class="pre">last_modified=None</span></code> or
<code class="docutils literal notranslate"><span class="pre">='current:'</span></code> to mean “changed in the working tree.”  Even if this is
not supported there seems some risk that we would get the same
fingerprint for trees that are actually different.</p>
<p>We could assign a
hypothetical revision id to the tree for uncommitted files.  In that
case there is some risk that the not-yet-committed id would become
visible or committed.</p>
</li>
<li><p class="first">Can we use an “approximate basis”?</p>
<p>When using radix trees, you may need context beyond the specific
directory being compared.</p>
</li>
<li><p class="first">Can you get the fingerprint of parents directories with only selected file ids
taken from the working tree?</p>
<p>With hashes, we’d want to carry through the unselected files and
directories from the values they had in the parent revision.</p>
</li>
<li><p class="first">Are unbalanced trees a significant problem?  Trees can be unbalanced by having
many directories (deep or wide), or many files per directory.</p>
<p>For small trees like bzr, 744 of 874 are in the bzrlib subtree.  In
general, larger trees are more balanced, because humans, editors and
other tools have trouble managing very unbalanced trees.  But there are
exceptions: Aaron has one tree with 20,000 generated but versioned
entries in one directory.</p>
</li>
<li><p class="first">Should we use a radix tree approach where fingerprints are calculated on a synthetic
tree that is by definition balanced, even when the actual tree is unbalanced?</p>
</li>
<li><p class="first">What are the specific advantages of using recursive-last-modified rather than
hashes?</p>
<p>It may be a smaller step change.</p>
<p>It’s a bidirectional link: given a directory text identifier <code class="docutils literal notranslate"><span class="pre">(file_id,</span>
<span class="pre">last_changed)</span></code> you can look up the revision that last changed it.</p>
<p>From the preceding, even without the per-file graph you can skip through
the history of this file: go to the last-changed revision, look at all
its parents and repeat.</p>
</li>
<li><p class="first">Is it a smaller change to use recursive-last-modified on directories?</p>
<p>Probably yes:</p>
<ol class="arabic">
<li><p class="first">We can just put it into the current inventory format without changing
anything else.</p>
<p>By contrast to use a hash we’d have to either split up the inventory
as stored, or change the sort order for the inventory, or synthesize
per-directory inventories in memory for hashing.</p>
<p>However, XML is somewhat redundant and slow to parse/generate; and
reading the whole thing before comparing some sections is only a
partial win.  It may be a smaller change but we’d be preserving
things we want to change.</p>
</li>
</ol>
<ol class="arabic simple">
<li>At present we rarely hash storage representations, only file texts.
This is not a large technical change, but it is a conceptual change.
This has some consequences for how we can upgrade it in future: all
the changed directories need to be rewritten up to the revision level.</li>
</ol>
<ol class="arabic simple">
<li>If we address directories by hash we need hash-addressed
storage.</li>
</ol>
<ol class="arabic simple">
<li>If we address directories by hash then for consistency we’d probably
(not necessarily) want to address file texts by hash.</li>
</ol>
<ol class="arabic simple">
<li>The per-file graph can’t be indexed by hash because they can converge, so we
need to either rework or dispose of the per-file graph.</li>
</ol>
</li>
<li><p class="first">Any possibilities for avoiding hashes recurring?</p>
<ol class="arabic simple">
<li>Hash along with an identification of the parents (as in hg).  Then you
can’t convert a tree without all its basis trees, and there is still
convergence when the same merge is done by two people, and you can’t
create it directly from the working tree.</li>
</ol>
<ol class="arabic simple">
<li>Include last-modified revision id in the hash.</li>
</ol>
<ol class="arabic simple">
<li>Index by <code class="docutils literal notranslate"><span class="pre">(revision,</span> <span class="pre">hash)</span></code> or vice versa.</li>
</ol>
<ol class="arabic simple">
<li>Store a per-file graph and allow it to have repeated keys.  The graph
would tell you about all the parent texts ever seen; you would need
to use revision graph information to resolve ambiguities.</li>
</ol>
</li>
<li><p class="first">What are the specific disadvantages of using recursive-last-modified rather than
hashes?</p>
<p>To calculate the last-changed revision, given the last-changed
information of the contained files, you need to look at the revision
graph.  They’re not enough because you need to know the relations
between the mentioned revisions.  In a merge it’s possible the correct
directory last-modified will not be the same as that of any of the files
within it.  This can also happen when a file is removed (deleted or
renamed) from a directory.</p>
</li>
<li><p class="first">Should we split up storage of the inventories?</p>
<p>This is not quite the same but connected.</p>
</li>
<li><p class="first">How does this relate to per-file/per-directory hashes?</p>
<p>If the version of a file or directory is identified by a hash, we can’t
use that to point into a per-file graph.  We can have a graph indexed by
<code class="docutils literal notranslate"><span class="pre">(file_id,</span> <span class="pre">hash,</span> <span class="pre">revision_id)</span></code>.  The last-modified could be stored as
part of this graph.</p>
<p>The graph would no longer be core data; it could be always present but
might be rebuilt.  Treating it as non-core data may make some changes
like shallow branches easier?</p>
</li>
<li><p class="first">How do you ask a tree for a given text?</p>
<p>Right now we say</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">revision_tree</span><span class="o">.</span><span class="n">get_file_lines</span><span class="p">(</span><span class="n">file_id</span><span class="p">)</span>
</pre></div>
</div>
<p>so the choice of storage is hidden behind the revision tree: it could be
accessed by <code class="docutils literal notranslate"><span class="pre">(file_id,</span> <span class="pre">last_changed)</span></code> or by hash or otherwise.</p>
<p>At the moment the Repository exports a friend api to RevisionTree,
currently usually talking in VersionedFiles.</p>
<p>We probably wouldn’t want Repository to expose a <code class="docutils literal notranslate"><span class="pre">get_text_for_sha1()</span></code>
interface because that would be very difficult to support on old
repositories or on foreign branches.</p>
</li>
</ul>
</div>
<div class="section" id="conclusions">
<h2><a class="toc-backref" href="#id6">Conclusions</a><a class="headerlink" href="#conclusions" title="Permalink to this headline">¶</a></h2>
</div>
<div class="section" id="design-changes">
<h2><a class="toc-backref" href="#id7">Design changes</a><a class="headerlink" href="#design-changes" title="Permalink to this headline">¶</a></h2>
</div>
<div class="section" id="api-changes">
<h2><a class="toc-backref" href="#id8">API changes</a><a class="headerlink" href="#api-changes" title="Permalink to this headline">¶</a></h2>
</div>
</div>


          </div>
        </div>
      </div>
      <div class="sphinxsidebar" role="navigation" aria-label="main navigation">
        <div class="sphinxsidebarwrapper">
  <h3><a href="index.html">Table of Contents</a></h3>
  <ul>
<li><a class="reference internal" href="#">Directory fingerprints</a><ul>
<li><a class="reference internal" href="#introduction">Introduction</a></li>
<li><a class="reference internal" href="#use-case-oriented-apis">Use-case oriented APIs</a><ul>
<li><a class="reference internal" href="#commit"><code class="docutils literal notranslate"><span class="pre">commit</span></code></a></li>
<li><a class="reference internal" href="#log"><code class="docutils literal notranslate"><span class="pre">log</span></code></a></li>
</ul>
</li>
<li><a class="reference internal" href="#open-questions">Open questions</a></li>
<li><a class="reference internal" href="#conclusions">Conclusions</a></li>
<li><a class="reference internal" href="#design-changes">Design changes</a></li>
<li><a class="reference internal" href="#api-changes">API changes</a></li>
</ul>
</li>
</ul>

  <div role="note" aria-label="source link">
    <h3>This Page</h3>
    <ul class="this-page-menu">
      <li><a href="_sources/directory-fingerprints.txt"
            rel="nofollow">Show Source</a></li>
    </ul>
   </div>
<div id="searchbox" style="display: none" role="search">
  <h3>Quick search</h3>
    <div class="searchformwrapper">
    <form class="search" action="search.html" method="get">
      <input type="text" name="q" />
      <input type="submit" value="Go" />
      <input type="hidden" name="check_keywords" value="yes" />
      <input type="hidden" name="area" value="default" />
    </form>
    </div>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
        </div>
      </div>
      <div class="clearer"></div>
    </div>
    <div class="related" role="navigation" aria-label="related navigation">
      <h3>Navigation</h3>
      <ul>
<li><a href="http://bazaar.canonical.com/">
    <img src="_static/bzr icon 16.png" /> Home</a>&nbsp;|&nbsp;</li>
<a href="http://doc.bazaar.canonical.com/en/">Documentation</a>&nbsp;|&nbsp;</li>

        <li class="nav-item nav-item-0"><a href="index.html">Developer Document Catalog (2.7.0)</a> &#187;</li>
 
      </ul>
    </div>
    <div class="footer" role="contentinfo">
        &#169; Copyright 2009-2011 Canonical Ltd.
      Created using <a href="http://sphinx-doc.org/">Sphinx</a> 1.8.4.
    </div>
  </body>
</html>