diff options
author | Michał Górny <mgorny@gentoo.org> | 2017-09-14 23:14:39 +0200 |
---|---|---|
committer | Ulrich Müller <ulm@gentoo.org> | 2017-10-09 12:08:51 +0200 |
commit | c6fe2071a2e83be2203196ad7f9459941821a034 (patch) | |
tree | d81e1d9898c05917e05203af9803b581dff0d915 /glep-0025.rst | |
parent | glep-0045: Mark Final since GLEP 1 now uses ISO 8601 dates (diff) | |
download | glep-c6fe2071a2e83be2203196ad7f9459941821a034.tar.gz glep-c6fe2071a2e83be2203196ad7f9459941821a034.tar.bz2 glep-c6fe2071a2e83be2203196ad7f9459941821a034.zip |
Rename all GLEPs to .rst
Diffstat (limited to 'glep-0025.rst')
-rw-r--r-- | glep-0025.rst | 297 |
1 files changed, 297 insertions, 0 deletions
diff --git a/glep-0025.rst b/glep-0025.rst new file mode 100644 index 0000000..8a9e96c --- /dev/null +++ b/glep-0025.rst @@ -0,0 +1,297 @@ +GLEP: 25 +Title: Distfile Patching Support +Version: $Revision$ +Last-Modified: $Date$ +Author: Brian Harring <ferringb@gentoo.org> +Status: deferred +Type: Standards Track +Content-Type: text/x-rst +Created: 6-Mar-2004 +Post-History: 4-Apr-2004, 11-Nov-2004 + +Abstract +======== + +The intention of this GLEP is to propose the creation of patching support for +Portage, and iron out the implementation details. + +Status +====== + +Timed out + + +Motivation +========== + +Reduce the bandwidth load placed on our mirrors by decreasing the amount of +bytes transferred when upgrading between versions. Side benefit of this is to +significantly decrease the download requirements for users lacking broadband. + +Binary patches vs GNUDiff patches +================================= + +Most people are familiar with diff patches (unified diff for example)- this +glep is specifically proposing the use of an actual binary differencer. The +reason for this is that diff patches are line based- you change a single +character in a line, and the whole line must be included in the patch. Binary +differencers work at the byte level- it encodes just that byte. In that +respect binary patches are often much more efficient then diff patches. + +Further, the ability to reverse a unified patch is due to the fact the diff +includes **both** the original line, and the modified line. The author isn't +aware of any binary differencer that is able to create patches the can be +reversed- basically they're unidirectional, the patch that is generated can +only be used to upgrade or downgrade the version, not both. The plus side of +this limitation is a significantly decreased patch size. + +The choice of binary patches over diff patches pretty much comes down to the +fact they're smaller- example being a kdelibs binary patch for 3.1.4->3.1.5 is +75kb, the equivalent diff patch is 123kb, and is unable to result in a correct +md5 [1]_. + +Currently, this glep is proposing only the usage of binary patches- that's not +to say (with a fair amount of work) it couldn't be extended to support +standard diffs. + +Rationale +========= + +The difference between source releases typically isn't very large, especially +for minor releases. As an example, kdelibs-3.1.4.tar.bz2 is 10.53 MB, and +kdelibs-3.1.5.tar.bz2 is 10.54 MB. A bzip2'ed patch between those versions is +75.6 kb [2]_, less then 1% the size of 3.1.5's tbz2. + +Specification +============= + +Quite a few sections of gentoo are affected- mirroring, the portage tree, and +portage itself. + +Additions to the tree +--------------------- + +For adding patch info into the tree, this glep proposes a global patch list +(stored in profiles as patches.global), and individual patch lists stored in +relevant package directories (named patches). Using the kernel packages as an +example, a global list of patches enables us to create a patch once, add an +entry, and have all kernel packages benefit from that single entry. Both +patches.global, and individual package patch files share the same format: + +:: + + MD5 md5-value patch-url size MD5 md5-value ref-file size UMD5 md5-value new-file size + +For those familiar with digest file layout, this should look familiar. +Essentially, chksum type, value, filename, size. The UMD5 chksum type is just +the uncompressed md5/size of the file- so if the UMD5 were for a bzip2 +compressed file, it would be the md5 value/size of the uncompressed file. +And an example: + +:: + + MD5 ccd5411b3558326cbce0306fcae32e26 http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.patch.bz2 75687 MD5 82c265de78d53c7060a09c5cb1a78942 kdelibs-3.1.4.tar.bz2 10537433 UMD5 0b1908a51e739c07ff5a88e189d2f7a9 kdelibs-3.1.5.tar.bz2 48056320 + +In the above example, the md5sum of +http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.patch.bz2 is +calculated, compared to the stored value, and then the file size is checked. +The one difference is the UMD5 checksum type- the md5 value and the size are +specific to the *uncompressed* file. Continuing, for cases where the patch +will reside on one of our mirrors, the patch filename would be sufficient. + +Finally, note that this is a unidirectional patch- using the above example, +kdelibs-3.1.4-3.1.5 can **only** be used to upgrade from 3.1.4 to 3.1.5, not +in reverse (originally explained in `Binary patches vs GNUDiff patches`_). + +Portage Implementation +---------------------- + +This glep proposes the patching support should be (at this stage) optional- +specifically, enabled via FEATURES="patching". + +Fetching +'''''''' + +When patching is enabled, the global patch list is read, and the packages +patch list is read. From there, portage determines what files could be used +as a base for patching to the desired file- further, determining if it's +actually worth patching (case where it wouldn't be is when the target file is +less then the sum of the patches needed). Any patches to be used are fetched, +and md5 verified. + +Reconstruction +'''''''''''''' + +Upon fetching and md5 verification of patch(es), the desired file is +reconstructed. Assuming reconstruction didn't return any errors, the target +file has its uncompressed md5sum calculated and verified, then is recompressed +and the compressed md5sum calculated. At this point, if the compressed md5 +matches the md5 stored in the tree, then portage transfers the file into +distfiles, and continues on its merry way. + +If the compressed md5 is different from the tree's value, then the (proposed) +md5 database is updated with new compressed md5. Details of this database +(and the issue it addresses) follow. + +Compressed MD5sums: +''''''''''''''''''' + +There will be instances where a file is reconstructed perfectly, recompressed, +and the recompressed md5sum differs from what is stored in the tree- the +problem is that the md5sum of a compressed file is inherently tied to the +compressor version/options used to compress the original source. + +===================== +The Problem in Detail +===================== + +A good example of this problem is related to bzip2 versions used for +compression. Between bzip2 0.9x and bzip2 1.x, there was a subtle change in +the compressor resulting in a slightly better compression result- end result +being a different file, eg a different md5sum. Assuming compressor versions +are the same, there also is the issue of what compression level the target +source was originally compressed at- was it compressed with -9, -8 or -7? +That's just a sampling of the various original settings that must be accounted +for, and that's limited to gzip/bzip2; other compressors will add to the +number of variables to be accounted for to produce an exact recreation of the +compressed md5sum. + +Tracking the compressor version and options originally used isn't really a +valid option- assuming all options were accounted for, clients would still be +required to have multiple versions of the same compressor installed just for +the sake of recreating a compressed md5sum *even though* the uncompressed +source's md5 has already been verified. + +===================== +The Proposed Solution +===================== + +The creation of a clientside flatfile/db of valid alternate md5/size pairs +would enable portage to handle perfectly reconstructed files, that have a +different md5sum due to compression differences. The proposed format is thus: + +:: + + MD5 md5sum orig-file size MD5 md5sum [ optional new-name ] size + +Example: + +:: + + MD5 984146931906a7d53300b29f58f6a899 OOo_1.0.3_source.tar.bz2 165475319 MD5 0733dd85ed44d88d1eabed704d579721 165444187 + +An alternate md5/size pair for a file would be added **only** when the +uncompressed source's md5/size has been verified, yet upon recompression the +md5 differs. For cleansing of older md5/size pairs from this db, a utility +would be required- the author suggests the addition of a distfiles-cleaning +utility to portage, with the ability to also cleanse old md5/size pairs when +the file the pair was created for no longer exists in distfiles. + +Where to store the database is debatable- /etc/portage or /var/cache/edb are +definite options. + +The reasoning for allowing for an optional new-name is that it provides needed +functionality should anyone attempt to extend portage to allow for clients to +change the compression used for a source (eg, recompress all gzip files as +bzip2). Granted, no such code or attempt has been made, but nothing is lost +by leaving the option open should the request/attempt be made. + +A potential gotcha of adding this support is that in environments where the +distfiles directory is shared out to multiple systems, this db must be shared +also. + + + +Distfile Mirror Additions +------------------------- + +One issue of contention is where these files will actually be stored. As of +the writing of this glep, a full distfiles mirror is roughly around 40 gb- a +rough estimate by the author places the space requirements for patches for +each version at a total of around 4gb. Note this isn't even remotely a hard +figure yet, and a better figure is being checked into currently. + +Regardless of the exact space figure, finding a place to store the patches +will be problematic. Expansion of the required mirror space (essentially just +swallowing the patches storage requirement) is unlikely, since it was one of +the main arguments against the now defunct glep9 attempt [2]_. A couple of +ideas that have been put forth to handle the additional space requirements are +as follows- + +1) Identification of mirrors willing to handle the extra space requirements- +essentially create an additional patch mirror tier. + +2) Mirroring only a patch for certain package versions, rather then full +source. Using kdelibs-3.1.5 as an example, only the patch would be mirrored +(rather then the full 10.53 MB source). Downside to this approach is that a +user who is downloading kdelibs for the first time would either need to pull +it from the original SRC_URI (placing the burden onto the upstream mirror), or +pull the 3.1.4 version, and the patch- pulling 63k more then if they had just +pulled the full version. The kdelibs 3.1.4/3.1.5 example is something of an +optimal case- not all versions will have such miniscule patches. + +3) A variation on the idea above, essentially mirroring only the patch for +the oldest version(s) of a package; eg, kdelibs currently has version 3.05, +3.1.5, 3.2.0, and 3.2.1- the mirrors would only carry a patch for 3.05, not +full source (think RESTRICT="fetch"). One plus to this is that patches to +downgrade in version are smaller then the patches to upgrade in version- there +are exceptions to this, but they're hard to find. A major downside to this +approach is A) a user would have to sync up to get the patchlists for that +version, B) creation of a set of patches to go backwards in version (see +`Binary patches vs GNUDiff patches`_).. + +Of the options listed above, the first is the easiest, although the second +could be made to work. Feedback and any possible alternatives would be +greatly appreciated. + +Patch Creation +-------------- + +Maintenance of patch lists, and the actual patch creation ought to be managed +by a high level script- essentally a dev says "I want a patch between this +version, and that version: make it so", the script churns away +creating/updating the patch list, and generating the patch locally. The +utility next uploads the new patch to /space/distfiles-local on dev.gentoo.org +(exempting if it's not a locally generated patch), and repoman is used to +commit the updated patch list. + +What would be preferable (although possibly wishful thinking), is if hardware +could be co-opted for automatic patch generation, rather then forcing it upon +the devs- something akin to how files are pulled onto the mirror automatically +for new ebuilds. + +The initial bulk of patches to get will be generated by the author, to ease +the transition and offer patches for people to test out. + +Backwards Compatibility +======================= + +As noted in `The Proposed Solution`_, a system using patching and sharing out +it's distfiles must share out it's alternate md5 db. Any system that uses the +distfiles share must support the alternate md5 db also. If this is considered +enough of an issue, it is conceivable to place reconstructed sources with an +alternate md5 into a subdirectory of distdir- portage only looks within +distdir, unwilling to descend into subdirectories. + +Also note that `Distfile Mirror Additions`_ may add additional backwards +compatibility issues, depending on what solution is accepted. + +Reference Implementation +======================== + +TODO + +References +========== +.. [1] http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.{patch,diff}.bz2. +.. [2] kdelibs-3.1.4-3.1.5.patch.bz2, switching format patch, created via diffball-0.4_pre4 (diffball is available at http://sourceforge.net/projects/diffball) + Bzip2 -9 compressed, the patch is 75,687 bytes, uncompressed it is 337,649 bytes. The patch is available at http://dev.gentoo.org/~ferringb/kdelibs-3.1.4-3.1.5.patch.bz2 for those curious. +.. [3] Glep9, 'Gentoo Package Update System' + (http://glep.gentoo.org/glep-0009.html) + +Copyright +========= + +This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 +Unported License. To view a copy of this license, visit +http://creativecommons.org/licenses/by-sa/3.0/. |