https://bugs.winehq.org/show_bug.cgi?id=42756
Bug ID: 42756 Summary: test.winehq.org uses a lot of disk space Product: WineHQ.org Version: unspecified Hardware: x86 OS: Linux Status: NEW Severity: normal Priority: P2 Component: www-unknown Assignee: wine-bugs@winehq.org Reporter: fgouget@codeweavers.com Distribution: ---
According to my estimates the static HTML files for the test.winehq.org site use around 21.5 GB. That seems a lot for such a simple website.
This estimate is based on the following data collected from 3 days worth of reports:
8.3MB / report 60 reports / day 500MB / day 44 days history -> 21.5 GB total
It's not because the raw reports are big either. On any given day we get: 50 MB of raw report files 180 MB of full report HTML files 285 MB of individual test unit HTML files 3 MB of index files
So most of it is caused by the HTML files.
Here are some ways one could minimize the bloat: 1) Split the raw report into one file per test unit plus one index file to know in which order to reassemble them. These would not use more space than the original raw report except for the filesystem's internal fragmentation. Then each file the web server serves would be generated from these. For the raw report one would just have to concatenate them in the right order. The full report and the individual test unit files would have to parse the relevant source text files. A drawback is all the extra parsing required to generate each page. Plus some issues in the raw report can only be detected when parsing the full report as they show up as issue transitioning from one test result to the next.
2) An other option would be to skip generating the full report HTML file and omit the header and footer of the individual test unit HTML files. Then each page could be generated by assembling the individual HTML fragments with the right HTML glue to build either a full HTML report or an individual test unit HTML file. This would have the advantage of not requiring more parsing to serve each file.
3) Drop the individual test unit HTML files and always link to the full HTML report. The drawback is more bandwidth usage since every one looking at a test result would end up downloading the 180 MB full report instead of the small individual test unit reports.
4) Drop the full HTML report and only keep the per test unit HTML files. This provides a lower disk saving but does not force users to download a big file when they are only interested in the result of one test unit.
5) Use some way to reduce the size of the links to the Git source. Removing the links altogether saves 191 MB in the test case. Out of the 128 characters of a typical link, 95 of these are identical from one link to the next. So maybe using some JavaScript one could factorize most of this.
A drawback of options 1 and 2 is that since the files would no longer be static they could probably not be put up on a content distribution network. But I'm not sure test.winehq.org uses a CDN now so this probably does not matter.
It may still be possible to reduce the processing load introduced by 1 and 2 by caching the generated files for a few hours or days. All in all this would increase complexity quite a bit though.
https://bugs.winehq.org/show_bug.cgi?id=42756
--- Comment #1 from François Gouget fgouget@codeweavers.com --- A variant on option 2 would be to have JavaScript code to grab the chunk index and the relevant HTML chunks and assemble them on the client-side.