How to convert dirty HTML page to...

User 2496587 Photo


Registered User
6 posts

Hello everyone,

Can you help me with the following problem.

I have to convert a pdf to xhtml-strict. So I import the PDF version into Serif PagePlus X6 which is a DTP application which can publish to HTML.

Importing the PDF into PagePlus is no problem and the layout is the same as in the pdf.

But we also want to show that PDF publication onto a website where the background and text styles are very different. The background of the PDF has to be white while the background on the website has to be a shade of yellow. The text on the PDF is black while on the website the text has to be color "Burundi" which is dark red.

So I let PagePlus generate the html page. The problem is that every line is a "div". So there are a several hundreds of div's present. So for cleaning up this PagePlus generated page into our XHTML-strict version is almost impossible.

I give you here a sample of some few lines (only 5 lines out of the many) :

### Start Example ###

<div class="Wp-Normal-P">
<span class="Normal-C-C4">Dinsdag 9 en 16</span><span class="Normal-C-C5"> </span><span class="Normal-C-C4">oktober -<wbr> informatieavonden rijbewijs via<br></span></div>
<div class="Normal-P">
<span class="Normal-C-C4">vrije begeleiding -<wbr> 19u30<br></span></div>
<div class="Normal-P-P1">
<span class="Normal-C-C6">V</span><span class="Normal-C-C7">oor de l</span><span class="Normal-C-C8">e</span><span class="Normal-C-C7">erl</span><span class="Normal-C-C8">i</span><span class="Normal-C-C7">ngen van </span><span class="Normal-C-C8">h</span><span class="Normal-C-C7">et z</span><span class="Normal-C-C8">e</span><span class="Normal-C-C7">sde j</span><span class="Normal-C-C8">a</span><span class="Normal-C-C7">ar is de voor</span><span class="Normal-C-C8">b</span><span class="Normal-C-C7">erei</span><span class="Normal-C-C8">d</span><span class="Normal-C-C7">ing op het th</span><span class="Normal-C-C8">e</span><span class="Normal-C-C7">oretisch ex</span><span class="Normal-C-C8">a</span><span class="Normal-C-C7">men op</span><span class="Normal-C-C8">g</span><span class="Normal-C-C7">en</span><span class="Normal-C-C8">o</span><span class="Normal-C-C7">men in het g</span><span class="Normal-C-C8">e</span><span class="Normal-C-C7">wo</span><span class="Normal-C-C8">n</span><span class="Normal-C-C7">e<br></span></div>
<div class="Normal-P-P2">
<span class="Normal-C-C7">lesse</span><span class="Normal-C-C8">n</span><span class="Normal-C-C7">p</span><span class="Normal-C-C8">a</span><span class="Normal-C-C7">kket van h</span><span class="Normal-C-C8">e</span><span class="Normal-C-C7">t laats</span><span class="Normal-C-C8">t</span><span class="Normal-C-C7">e jaar secun</span><span class="Normal-C-C8">d</span><span class="Normal-C-C7">ai</span><span class="Normal-C-C9">r</span><span class="Normal-C-C7">.<br></span></div>
<div class="Normal-P-P2">
<span class="Normal-C-C6">V</span><span class="Normal-C-C7">oor di</span><span class="Normal-C-C8">e</span><span class="Normal-C-C7">ge</span><span class="Normal-C-C8">ne</span><span class="Normal-C-C7">n die nad</span><span class="Normal-C-C8">i</span><span class="Normal-C-C7">en opter</span><span class="Normal-C-C8">e</span><span class="Normal-C-C7">n voor</span><span class="Normal-C-C8"> </span><span class="Normal-C-C7">het syste</span><span class="Normal-C-C8">e</span><span class="Normal-C-C7">m</span><span class="Normal-C-C10"> </span><span class="Normal-C-C7">van de vrije </span><span class="Normal-C-C8">b</span><span class="Normal-C-C7">egele</span><span class="Normal-C-C8">i</span><span class="Normal-C-C7">ding w</span><span class="Normal-C-C8">o</span><span class="Normal-C-C7">rden </span><span class="Normal-C-C8">d</span><span class="Normal-C-C7">eze 2 </span><span class="Normal-C-C8">a</span><span class="Normal-C-C7">von</span><span class="Normal-C-C8">d</span><span class="Normal-C-C7">en </span><span class="Normal-C-C8">g</span><span class="Normal-C-C7">eor</span><span class="Normal-C-C8">g</span><span class="Normal-C-C7">ani-<wbr><br></span></div>

### End Example ###

I do not have to tell you that this is a complete mess and there are hundreds such lines.

Can you tell me, how I can clean this mess with CoffeeCup HTML-editor (CHE)? Any timesaving suggestion will be considered. Myself, I think in the direction of creating a template in CHE which contains a link to the external CSS stylesheet file. I copy the text manually from the PDF, paragraph per paragraph and just insert them within the body tags and place the markup.

Thank you very much for your advice and time spend on my problem which is really appreciated.

Wish you all a very nice day.

Friendly greetings,

Bad_Wolf

Edit : I want to add that CoffeeCup HTML-editor is a great application and I like it very much despite the fact I am a very new user. Great tool for manual coding.
User 38401 Photo


Senior Advisor
10,951 posts

Hiya Bad Wolf,

Firstly, how many pages are there to this PDF and is it with a lot of images etc. or is it basically all text? That would determine the best way around this as if it's only a few pages you may find it much easier to just setup your CSS file for the site, create the new page and structure the HTML how you want it and just copy the text from the pdf into <p></p> tag pairs.

EDIT: There used to be a tool in the HTML Editor for stripping code completely, but I don't see it anywhere now so I just changed this paragraph in case you read it already.

There may be a much better way to export a PDF into HTML though, try searching for some other tools to use for that maybe also.
User 122279 Photo


Senior Advisor
14,624 posts

That was indeed some 'dirty html' (*shudder*!")
I would recommend Jo Ann's approach if the layout is simple, but as you say, that you are very new to this, maybe Visual Site Designer (short: VSD) would be a better tool, especially if the layout has a lot of images. You wouldn't have to mess with the dirty code, just copy text and images and drop them on a VSD page, then arrange them the way you want. With images you probably need to copy them from the pdf file, paste them into an image editing programme, save them with a file name and then insert them into the web page. (You would have to do this with the images also if you want to use the html Editor.)

You can download trial versions of both programmes, so try them out before you decide on one of them.
Ha en riktig god dag!
Inger, Norway

My work in progress:
Components for Site Designer and the HTML Editor: https://mock-up.coffeecup.com


User 2496587 Photo


Registered User
6 posts

Hi Jo Ann,

Thank you very much for your fast reply and your interesting information.

The PDF consists of 4 pages. When PagePlus converts the PDF, then I have 4 different (dirty) html pages.

Indeed there are also images present within the text. Also words are containing divs inside them. To give you just an example how bad it is here just an imaginary word :

C</div><div>offeeCup

and I left the paramaters out.

I tried you suggestion about the "Code Stripper" (I think you mean the "Code cleaner") and I got rid of most of the unnecessary divs PagePlus used. So the code is looking a bit better now. Nonetheless still a mess. I am impressed with the results of the "Code cleaner", it is a very powerfull tool.

I will follow your suggestion and just create a xhtml-strict file with HTML editor and paste the text from the PDF directly into the XHTML, then add the markup. Because I use an external CSS file for the website, I only need to place the correct styling tags in the text.

I tried a lot of application, but they all give the same dirty code. I do not want to be negative about visual tools for website creation, but I see a lot of dirty code coming from them. I do not know why it is so difficult for a visual tool to make a distinction between a paragraph and a line.

I appreciate your help very much. Wish you a very nice day.

Friendly greetings,

Bad_Wolf
User 38401 Photo


Senior Advisor
10,951 posts

No the Code Cleaner is a different tool, there used to be a tool that could strip the code completely (I may have this confused with Notetab Plus or Notepad ++, I'll check the Notepad ++ shortly and see. All it leaves is text that is inside the code so it usually worked pretty well, but I'll check the other apps I have and see which one it is for sure.
User 2496587 Photo


Registered User
6 posts

Hello Linger,

Thank you also very much for your reply and suggestions.

I am new to using the CoffeeCup HTML editor but I code websites manually in XHTML since 2001.

As a former typesetter (Agfa MCS100), I am much better in coding a website using manual tools than visual tools. I am just puzzled by the amount of dirty code produced by PagePlus.

When like suggested by Jo Ann, I create a template holding the dtd, head and begin of body section and at the end the closing body and html tags, for me it will be faster just copy and paste the text from the pdf to the HTML editor. Then I add the styles. I estimate those 4 pages will take me less than 45 minutes which is acceptable for me.

Thank you very much for your input and suggestions.

Friendly greetings,

Bad_Wolf
User 2496587 Photo


Registered User
6 posts

Hello Jo Ann,

Thank you again.

I just searched again for the "Code Stripper" but I could not found it. I think it will be present in one of the other applications you mentioned.

Wish you a very good night.

Friendly greetings,

Bad_Wolf
User 38401 Photo


Senior Advisor
10,951 posts

Ok I just found it in Notetab which is a text editor you can get a 30 day trial here:
http://www.notetab.com/

Not sure if it will keep your text formatting or not, but try it and see what it does I guess. Make sure to backup all your files before you load them in there so you don't lose them :P Worked pretty slick for a menu page I just used it on and it left the text intact and indents etc so the code will be readable and easy to copy from one place to the next if it keeps it for PDF files. Let us know how that works out.
User 2496587 Photo


Registered User
6 posts

Hello Jo Ann,

Thank you very much for your kind advice and your help, which I appreciate very much.

The HTML editor did a very good job last night by removing all the divs which breaks individual lines. So only the main divs where present.

However, I am an application developer and are searching now for a way to just drag and drop the text from the PDF into my application and then add the markup to that text but also dragging and dropping like using the code tab in the HTML editor.

For the pdf I have to convert, the first line of a paragraph is always a header (h5) so I can add those tags around the first sentence automaticly. So it is perfectly possible to create every paragraph in one div block.

For the images present in the text, I can place the cursor at the place within the text where the image has to be placed and open a dialog which let me choose the URL.

When I succeed, I will share the application for free in return for the kind support and help I received from all of you. Without any exception, you all where great !!!

Wish you all the best and I will update you on my progess.

Have a very nice day.

Friendly greetings,

Bad_Wolf
User 38401 Photo


Senior Advisor
10,951 posts

Sounds like you have it pretty much under control Bad Wolf. As for the drag and drop, copy and paste works just as well since you really only want the text itself anyways so you can style it up how you want it on the page. Copy the entire page of text and style around it if need be, shouldn't hopefully be too daunting for a handful of pages. At least it isn't over 100 pages to do right :P

Have something to add? We’d love to hear it!
You must have an account to participate. Please Sign In Here, then join the conversation.