I read a forum question from an Opera user who was upset because Opera 9.10 now saves web pages “like IE and Firefox” – meaning saving them with all the included files. His problem was easily solved with a configuration change but it got me thinking. Generally this doesn’t seem to be such a bad idea, it allows you to open a saved web page and it will look exactly the same. So I tried to understand why this user was so upset and why I almost never use this feature myself. It seems there are three things.
It doesn’t just create the file the user told it to create but also a directory for the auxiliary files. It isn’t obvious to the user that this will happen and that he has to remove the directory as well when he chooses to remove the saved page. Even if he knows it, it still means some effort locating the directory which is annoying. Yes, if you happen to use Windows Explorer it will remove this directory automatically but this is a hack and a very non-obvious action again.
Solution: save everything into one file. I first thought of using the data: urls to embed all data inside the same HTML file. This would have the advantage of sticking to the HTML format, also nothing other than the web page saving code would need to be changed. However, I noticed disadvantages as well: this file wouldn’t be usable in Internet Explorer (it still doesn’t support data: URLs). Most importantly however, if the same image is used multiple times on the page it will have to be stored multiple times, no way to specify “this image uses the same URL as image XYZ” in HTML. That last one is a showstopper so that supporting Microsoft’s MTHML format is probably still the more realistic alternative even though it means much more effort.
If you save this page and open it again, what will you see? Right, two images — one that already was on the web page when it was saved and a second that was created by this code when the saved page was opened. You can get even more images by saving again.
Saving more than necessary
There have been complaints that even though Adblock Plus blocks the ads the saved page will still have them. The problem is that web page saving doesn’t respect content policies and will download files even if they are blocked. That is especially concerning for web bugs that have been blocked because of privacy concerns. Previously I was thinking that this is the way it should be, after all “HTML, complete” mode is supposed to create a copy of the original web page. But now I am tending to filing a bug on this issue.
Solution: only download files that haven’t been blocked. The implementation here shouldn’t be difficult for a change, images and objects already implement the imageBlockingStatus property that indicates whether the image has been blocked by a content policy.
PS: If everything goes well this post should appear on Planet Mozilla. Yay, that’s exciting! :)
That’s why we need “save as PDF”. The HTML thing can safely be removed then — it’s irritating to users and useless for Web developers.
Do you know of any concrete plans? I remember people talking that producing PDFs will be possible with Cairo but I am not sure how serious this is.
As to Adblock Plus – it doesn’t change the DOM. So if you save the DOM it will still contain the images that have been blocked.
Unless I specifically need the saved HTML version, I don’t generally use that feature – mostly for the second reason. If I order something online and want a copy of the order acknowledgement, I print the page to file (currently Postscript, but PDF would be nice) so I’ve got an exact copy.
For “Save as PDF” the back end is almost done:
No progress on the front end:
Yes, there are certainly plans to fix printing using Cairo’s PDF capabilities. Judging by bug 162659 there are no plans to allow saving PDFs however (note that saving a page is in many respects not the same as printing it – e.g. you want to preserve backgrounds and disable printing-specific scaling/page breaks).
Saving as PDF is listed on the Firefox 3 feature list as priority 2.
Linked from here:
Ah, great! Thanks.
Yes, it indeed did show up on planet.m.org — nice!
Why not to use JAR archive? FF3 will supprt JAR URLs, or will it?
I thought about it. Support for JARs has been there at least since Netscape 6.0 so if they are trying to sell it to you as a new feature in FF3 – don’t believe them :). I see three problems with JARs:
You can solve the last two issues by saving the main HTML page outside the JAR – but then the JAR file doesn’t have any advantages compared to a directory for the same files.
Well, I’m aware of those issues. So, is there any standard for archiving web pages? If yes, it should be followed, otherwise FF should set one. When it comes to JAR and its association with java, then I’d suggest using a different extension name, e.g. war (I know it’s been used by java web apps, this times it’d stand for web archive). It’s archive file format after all. This time you get a single file of standardized archiving file format, accessible by FF (other browsers would learn to use it too).
We really, really do need MHTML support – it’s a wonderful format, that lets you save a complete web page in a single file, and a great way to keep a complete, atomic copy of a web page. Everyone who has saved MHT files in IE before is now stuck with IE if they want to be able to see them in the future – we’re letting them maintain lockin by not implementing what is essentially an open format!
Saving as PDF is also something that’s awesome and thankfully my grumbling about it make then reopen the bug for it :)
“Even if he knows it, it still means some effort locating the directory which is annoying.”
Huh? The directory is stored in the same directory as the HTML file, so why would it be an effort to locate it?
Because in every file browser I know directories are displayed separately from files so that you have to scroll up and find it by the name. At least if your directory has more than five files which it usually does ;)
You don’t see the difference between pre and post processing: by the time Adblock Plus begins to do its job, the undesirable object is probably already in the browser’s cache. “Save As” probably just functions to pull an exact copy of what’s already in the browser’s cache out into the user’s desired directory. So you get all the web bugs, etc… and depending on the relative/absolute filter paths, the Adblock Plus filter may or may not supress correctly when opening from the local directories. In order to get pre-processing, where the bugs, etc. are not downloaded, the incoming data needs to be inspected before the browser.
Agnitum’s free Outpost firewall does this- one configurable filter intercepts the incoming data and if incoming network traffic matches a pattern (such as a 1×1 gif), it dies. Of course this is counterproducive on webpages that use some sort of script to make certain that everything gets downloaded b/c it will try unsuccessfully forever (click, click, click…)to get its bugs/ads downloaded. Of course Outpost can identify that script and not load it either and then the webmaster makes the page unviewable until the script is loaded… so the user installs a commercial Anti-ad$ that replaces undesirable script with harmless script for a yearly$, but Anti-ad$ downloads update$, hogs system resources/CPU cycles and on and on infinitum.
Thanks for the junk advertising. You are wrong, Adblock Plus prevents things from being downloaded and put into the browser cache. And “Save Page As” does in fact download things it cannot find in the browser cache without consulting content policies – including images blocked by Adblock Plus. Which is a bug in the “Save Page As” feature and should be fixed.
Opera 9.10 now supports MHTML files, offering an alternative to Internet Exploder. I would love to see Firefox support MHTML as well.
Are there any extensions that let Firefox open MHTML files?
Google spits out this one: https://addons.mozilla.org/en-US/firefox/addon/212
Not sure whether this is worth anything, especially since it is abandoned. I think that for these things to work properly they should be done in the core, not as an extension.
I realize this is old; however your blog is always a treasure trove, Wladimir.
MHTML at Wikipedia
I think this is relevant to read and contemplate before moving forward in designing and architecture.
Quote from article: Konqueror ..does include a feature for saving web pages as single files (“web archives”, file extension .war) that are actually gzipped tarballs..
TGZ is a suboptimal format for rapid access, if you ask me.
Apple went proprietary with Safari..Thanks Apple. :D
I don’t use IE anymore — mostly because of AB+ — however the MHT concept is great. I have also used them on my phone as a reference.
PDF would work, however last time I looked it was rather static so I think keeping HTML is better.
MHTML should also be compressed which is better than multiple files with the current FF implementation.
Ideally the format would be standardized by the w3c (how long would THAT take??) if RFC 2557 MHTML is insufficient.
FYI, there is and unpacker for MHTML from the Microsoft Office Area however a quick search didn’t find the addon — which adds options in the context menu to compile/decompile saved files — to those interested.
p.s. Wladimir you are and inspiration and a role model, good sir! Thank you for ALL of your amazing work!
Mozilla could easily use the jar: protocol to save an entire webpage in a single file – but will yet another proprietary format really make the world better? As you already mention, there are currently three different and incompatible solutions implemented by different browsers. So I still think that supporting MHTML would offer the most value to the user because it would make 90% of the browser market interoperable. But when I googled I found a big “oops”: http://www.patentstorm.us/patents/6886132/description.html. Guess supporting MHTML is out of question then.
In http://www.informationweek.com/news/global-cio/showArticle.jhtml?articleID=162100345 it is clearer what this patent is about – guess supporting MHTML in Firefox is still an option, if somebody actually takes the time.