I read a forum question from an Opera user who was upset because Opera 9.10 now saves web pages “like IE and Firefox” – meaning saving them with all the included files. His problem was easily solved with a configuration change but it got me thinking. Generally this doesn’t seem to be such a bad idea, it allows you to open a saved web page and it will look exactly the same. So I tried to understand why this user was so upset and why I almost never use this feature myself. It seems there are three things.
It doesn’t just create the file the user told it to create but also a directory for the auxiliary files. It isn’t obvious to the user that this will happen and that he has to remove the directory as well when he chooses to remove the saved page. Even if he knows it, it still means some effort locating the directory which is annoying. Yes, if you happen to use Windows Explorer it will remove this directory automatically but this is a hack and a very non-obvious action again.
Solution: save everything into one file. I first thought of using the data: urls to embed all data inside the same HTML file. This would have the advantage of sticking to the HTML format, also nothing other than the web page saving code would need to be changed. However, I noticed disadvantages as well: this file wouldn’t be usable in Internet Explorer (it still doesn’t support data: URLs). Most importantly however, if the same image is used multiple times on the page it will have to be stored multiple times, no way to specify “this image uses the same URL as image XYZ” in HTML. That last one is a showstopper so that supporting Microsoft’s MTHML format is probably still the more realistic alternative even though it means much more effort.
If you save this page and open it again, what will you see? Right, two images — one that already was on the web page when it was saved and a second that was created by this code when the saved page was opened. You can get even more images by saving again.
Saving more than necessary
There have been complaints that even though Adblock Plus blocks the ads the saved page will still have them. The problem is that web page saving doesn’t respect content policies and will download files even if they are blocked. That is especially concerning for web bugs that have been blocked because of privacy concerns. Previously I was thinking that this is the way it should be, after all “HTML, complete” mode is supposed to create a copy of the original web page. But now I am tending to filing a bug on this issue.
Solution: only download files that haven’t been blocked. The implementation here shouldn’t be difficult for a change, images and objects already implement the imageBlockingStatus property that indicates whether the image has been blocked by a content policy.
PS: If everything goes well this post should appear on Planet Mozilla. Yay, that’s exciting! :)