summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMichał Górny <mgorny@gentoo.org>2017-09-14 23:14:39 +0200
committerUlrich Müller <ulm@gentoo.org>2017-10-09 12:08:51 +0200
commitc6fe2071a2e83be2203196ad7f9459941821a034 (patch)
treed81e1d9898c05917e05203af9803b581dff0d915 /glep-0025.rst
parentglep-0045: Mark Final since GLEP 1 now uses ISO 8601 dates (diff)
downloadglep-c6fe2071a2e83be2203196ad7f9459941821a034.tar.gz
glep-c6fe2071a2e83be2203196ad7f9459941821a034.tar.bz2
glep-c6fe2071a2e83be2203196ad7f9459941821a034.zip
Rename all GLEPs to .rst
Diffstat (limited to 'glep-0025.rst')
-rw-r--r--glep-0025.rst297
1 files changed, 297 insertions, 0 deletions
diff --git a/glep-0025.rst b/glep-0025.rst
new file mode 100644
index 0000000..8a9e96c
--- /dev/null
+++ b/glep-0025.rst
@@ -0,0 +1,297 @@
+GLEP: 25
+Title: Distfile Patching Support
+Version: $Revision$
+Last-Modified: $Date$
+Author: Brian Harring <ferringb@gentoo.org>
+Status: deferred
+Type: Standards Track
+Content-Type: text/x-rst
+Created: 6-Mar-2004
+Post-History: 4-Apr-2004, 11-Nov-2004
+
+Abstract
+========
+
+The intention of this GLEP is to propose the creation of patching support for
+Portage, and iron out the implementation details.
+
+Status
+======
+
+Timed out
+
+
+Motivation
+==========
+
+Reduce the bandwidth load placed on our mirrors by decreasing the amount of
+bytes transferred when upgrading between versions. Side benefit of this is to
+significantly decrease the download requirements for users lacking broadband.
+
+Binary patches vs GNUDiff patches
+=================================
+
+Most people are familiar with diff patches (unified diff for example)- this
+glep is specifically proposing the use of an actual binary differencer. The
+reason for this is that diff patches are line based- you change a single
+character in a line, and the whole line must be included in the patch. Binary
+differencers work at the byte level- it encodes just that byte. In that
+respect binary patches are often much more efficient then diff patches.
+
+Further, the ability to reverse a unified patch is due to the fact the diff
+includes **both** the original line, and the modified line. The author isn't
+aware of any binary differencer that is able to create patches the can be
+reversed- basically they're unidirectional, the patch that is generated can
+only be used to upgrade or downgrade the version, not both. The plus side of
+this limitation is a significantly decreased patch size.
+
+The choice of binary patches over diff patches pretty much comes down to the
+fact they're smaller- example being a kdelibs binary patch for 3.1.4->3.1.5 is
+75kb, the equivalent diff patch is 123kb, and is unable to result in a correct
+md5 [1]_.
+
+Currently, this glep is proposing only the usage of binary patches- that's not
+to say (with a fair amount of work) it couldn't be extended to support
+standard diffs.
+
+Rationale
+=========
+
+The difference between source releases typically isn't very large, especially
+for minor releases. As an example, kdelibs-3.1.4.tar.bz2 is 10.53 MB, and
+kdelibs-3.1.5.tar.bz2 is 10.54 MB. A bzip2'ed patch between those versions is
+75.6 kb [2]_, less then 1% the size of 3.1.5's tbz2.
+
+Specification
+=============
+
+Quite a few sections of gentoo are affected- mirroring, the portage tree, and
+portage itself.
+
+Additions to the tree
+---------------------
+
+For adding patch info into the tree, this glep proposes a global patch list
+(stored in profiles as patches.global), and individual patch lists stored in
+relevant package directories (named patches). Using the kernel packages as an
+example, a global list of patches enables us to create a patch once, add an
+entry, and have all kernel packages benefit from that single entry. Both
+patches.global, and individual package patch files share the same format:
+
+::
+
+ MD5 md5-value patch-url size MD5 md5-value ref-file size UMD5 md5-value new-file size
+
+For those familiar with digest file layout, this should look familiar.
+Essentially, chksum type, value, filename, size. The UMD5 chksum type is just
+the uncompressed md5/size of the file- so if the UMD5 were for a bzip2
+compressed file, it would be the md5 value/size of the uncompressed file.
+And an example:
+
+::
+
+ MD5 ccd5411b3558326cbce0306fcae32e26 http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.patch.bz2 75687 MD5 82c265de78d53c7060a09c5cb1a78942 kdelibs-3.1.4.tar.bz2 10537433 UMD5 0b1908a51e739c07ff5a88e189d2f7a9 kdelibs-3.1.5.tar.bz2 48056320
+
+In the above example, the md5sum of
+http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.patch.bz2 is
+calculated, compared to the stored value, and then the file size is checked.
+The one difference is the UMD5 checksum type- the md5 value and the size are
+specific to the *uncompressed* file. Continuing, for cases where the patch
+will reside on one of our mirrors, the patch filename would be sufficient.
+
+Finally, note that this is a unidirectional patch- using the above example,
+kdelibs-3.1.4-3.1.5 can **only** be used to upgrade from 3.1.4 to 3.1.5, not
+in reverse (originally explained in `Binary patches vs GNUDiff patches`_).
+
+Portage Implementation
+----------------------
+
+This glep proposes the patching support should be (at this stage) optional-
+specifically, enabled via FEATURES="patching".
+
+Fetching
+''''''''
+
+When patching is enabled, the global patch list is read, and the packages
+patch list is read. From there, portage determines what files could be used
+as a base for patching to the desired file- further, determining if it's
+actually worth patching (case where it wouldn't be is when the target file is
+less then the sum of the patches needed). Any patches to be used are fetched,
+and md5 verified.
+
+Reconstruction
+''''''''''''''
+
+Upon fetching and md5 verification of patch(es), the desired file is
+reconstructed. Assuming reconstruction didn't return any errors, the target
+file has its uncompressed md5sum calculated and verified, then is recompressed
+and the compressed md5sum calculated. At this point, if the compressed md5
+matches the md5 stored in the tree, then portage transfers the file into
+distfiles, and continues on its merry way.
+
+If the compressed md5 is different from the tree's value, then the (proposed)
+md5 database is updated with new compressed md5. Details of this database
+(and the issue it addresses) follow.
+
+Compressed MD5sums:
+'''''''''''''''''''
+
+There will be instances where a file is reconstructed perfectly, recompressed,
+and the recompressed md5sum differs from what is stored in the tree- the
+problem is that the md5sum of a compressed file is inherently tied to the
+compressor version/options used to compress the original source.
+
+=====================
+The Problem in Detail
+=====================
+
+A good example of this problem is related to bzip2 versions used for
+compression. Between bzip2 0.9x and bzip2 1.x, there was a subtle change in
+the compressor resulting in a slightly better compression result- end result
+being a different file, eg a different md5sum. Assuming compressor versions
+are the same, there also is the issue of what compression level the target
+source was originally compressed at- was it compressed with -9, -8 or -7?
+That's just a sampling of the various original settings that must be accounted
+for, and that's limited to gzip/bzip2; other compressors will add to the
+number of variables to be accounted for to produce an exact recreation of the
+compressed md5sum.
+
+Tracking the compressor version and options originally used isn't really a
+valid option- assuming all options were accounted for, clients would still be
+required to have multiple versions of the same compressor installed just for
+the sake of recreating a compressed md5sum *even though* the uncompressed
+source's md5 has already been verified.
+
+=====================
+The Proposed Solution
+=====================
+
+The creation of a clientside flatfile/db of valid alternate md5/size pairs
+would enable portage to handle perfectly reconstructed files, that have a
+different md5sum due to compression differences. The proposed format is thus:
+
+::
+
+ MD5 md5sum orig-file size MD5 md5sum [ optional new-name ] size
+
+Example:
+
+::
+
+ MD5 984146931906a7d53300b29f58f6a899 OOo_1.0.3_source.tar.bz2 165475319 MD5 0733dd85ed44d88d1eabed704d579721 165444187
+
+An alternate md5/size pair for a file would be added **only** when the
+uncompressed source's md5/size has been verified, yet upon recompression the
+md5 differs. For cleansing of older md5/size pairs from this db, a utility
+would be required- the author suggests the addition of a distfiles-cleaning
+utility to portage, with the ability to also cleanse old md5/size pairs when
+the file the pair was created for no longer exists in distfiles.
+
+Where to store the database is debatable- /etc/portage or /var/cache/edb are
+definite options.
+
+The reasoning for allowing for an optional new-name is that it provides needed
+functionality should anyone attempt to extend portage to allow for clients to
+change the compression used for a source (eg, recompress all gzip files as
+bzip2). Granted, no such code or attempt has been made, but nothing is lost
+by leaving the option open should the request/attempt be made.
+
+A potential gotcha of adding this support is that in environments where the
+distfiles directory is shared out to multiple systems, this db must be shared
+also.
+
+
+
+Distfile Mirror Additions
+-------------------------
+
+One issue of contention is where these files will actually be stored. As of
+the writing of this glep, a full distfiles mirror is roughly around 40 gb- a
+rough estimate by the author places the space requirements for patches for
+each version at a total of around 4gb. Note this isn't even remotely a hard
+figure yet, and a better figure is being checked into currently.
+
+Regardless of the exact space figure, finding a place to store the patches
+will be problematic. Expansion of the required mirror space (essentially just
+swallowing the patches storage requirement) is unlikely, since it was one of
+the main arguments against the now defunct glep9 attempt [2]_. A couple of
+ideas that have been put forth to handle the additional space requirements are
+as follows-
+
+1) Identification of mirrors willing to handle the extra space requirements-
+essentially create an additional patch mirror tier.
+
+2) Mirroring only a patch for certain package versions, rather then full
+source. Using kdelibs-3.1.5 as an example, only the patch would be mirrored
+(rather then the full 10.53 MB source). Downside to this approach is that a
+user who is downloading kdelibs for the first time would either need to pull
+it from the original SRC_URI (placing the burden onto the upstream mirror), or
+pull the 3.1.4 version, and the patch- pulling 63k more then if they had just
+pulled the full version. The kdelibs 3.1.4/3.1.5 example is something of an
+optimal case- not all versions will have such miniscule patches.
+
+3) A variation on the idea above, essentially mirroring only the patch for
+the oldest version(s) of a package; eg, kdelibs currently has version 3.05,
+3.1.5, 3.2.0, and 3.2.1- the mirrors would only carry a patch for 3.05, not
+full source (think RESTRICT="fetch"). One plus to this is that patches to
+downgrade in version are smaller then the patches to upgrade in version- there
+are exceptions to this, but they're hard to find. A major downside to this
+approach is A) a user would have to sync up to get the patchlists for that
+version, B) creation of a set of patches to go backwards in version (see
+`Binary patches vs GNUDiff patches`_)..
+
+Of the options listed above, the first is the easiest, although the second
+could be made to work. Feedback and any possible alternatives would be
+greatly appreciated.
+
+Patch Creation
+--------------
+
+Maintenance of patch lists, and the actual patch creation ought to be managed
+by a high level script- essentally a dev says "I want a patch between this
+version, and that version: make it so", the script churns away
+creating/updating the patch list, and generating the patch locally. The
+utility next uploads the new patch to /space/distfiles-local on dev.gentoo.org
+(exempting if it's not a locally generated patch), and repoman is used to
+commit the updated patch list.
+
+What would be preferable (although possibly wishful thinking), is if hardware
+could be co-opted for automatic patch generation, rather then forcing it upon
+the devs- something akin to how files are pulled onto the mirror automatically
+for new ebuilds.
+
+The initial bulk of patches to get will be generated by the author, to ease
+the transition and offer patches for people to test out.
+
+Backwards Compatibility
+=======================
+
+As noted in `The Proposed Solution`_, a system using patching and sharing out
+it's distfiles must share out it's alternate md5 db. Any system that uses the
+distfiles share must support the alternate md5 db also. If this is considered
+enough of an issue, it is conceivable to place reconstructed sources with an
+alternate md5 into a subdirectory of distdir- portage only looks within
+distdir, unwilling to descend into subdirectories.
+
+Also note that `Distfile Mirror Additions`_ may add additional backwards
+compatibility issues, depending on what solution is accepted.
+
+Reference Implementation
+========================
+
+TODO
+
+References
+==========
+.. [1] http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.{patch,diff}.bz2.
+.. [2] kdelibs-3.1.4-3.1.5.patch.bz2, switching format patch, created via diffball-0.4_pre4 (diffball is available at http://sourceforge.net/projects/diffball)
+ Bzip2 -9 compressed, the patch is 75,687 bytes, uncompressed it is 337,649 bytes. The patch is available at http://dev.gentoo.org/~ferringb/kdelibs-3.1.4-3.1.5.patch.bz2 for those curious.
+.. [3] Glep9, 'Gentoo Package Update System'
+ (http://glep.gentoo.org/glep-0009.html)
+
+Copyright
+=========
+
+This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
+Unported License. To view a copy of this license, visit
+http://creativecommons.org/licenses/by-sa/3.0/.