Sign-In
SIGN-IN TO EPUBNOW!
 
Username:
Password:
 
 

XML Workflow for Publishers, Part - 3

03, 2000
By Dr. Brijesh Kumar,  Digital Media Initiatives

Publishers are confronted with three situations -

(a) backlist titles which are mostly in print and no digital copies exist;

(b) backlist where titles are available in PDFs; and

(c) Frontlist titles which are currently in production.

There may be different approaches to transform backlist and frontlist titles to a futuristic XML based workflow. We shall not deal with how content is extracted and converted in to a processable document, but we shall focus on the various features of a document which a publisher must generate in the system to move forward and unleash the XML advantage forever in future. One of the simplest forms to structure published content is in XHTML (eXtensible HyperText Markup Language) [http://www.w3.org/TR/xhtml1/] which is a reformulation of HTML 4.0 into XML. We are presuming that a publisher shall leverage in-house resources to convert from print titles to image-PDFs and then save print-PDFs as HTML and further make them tidy to make them processable through the XML workflow. The Extensible Hypertext Markup Language, or XHTML, is a markup language that has the same depth of expression as HTML, but also conforms to XML syntax. [1]

CONVERT FROM WORD TO XHTML

We shall now study the entire workflow of converting a manuscript available in MSWord file into a processable XHTML file:

  • Open the Word file and Save As "Web Page, Filtered";
  • Open the HTML file in an HTML Editor (such as FrontPage);
  • Remove Tags and Attributes as per the following directives to clean the mark-up:
    • meta: remove tag
    • style: remove tag and content
    • div: remove tag
    • body: remove lang, style attribute
    • p: remove class, style attribute
    • span: remove tag
    • i: convert to em
    • b: convert to strong
    • font: remove tag and content
    • h1 to h6: remove style
    • align=center class="center"
    • align=right class="right"
    • center convert to <br /><p class="center">
    • <p>&nbsp;</p>: remove
  • Convert entities like quotes, ndash, mdash, etc.and foreign Entities: Please use UTF-8 value/ codepoint value for all foreign entities. Please check a correct value from this link [2].
  • For ©, [UTF Value from the list = U+00A9 (169)] you may use either:
    • &#x00A9; or
    • &#169;

  • After the above cleaning-up, the following clean HTML is obtained:
    • <html>
    • <head>
    • <title>
    • <link>
    • <body>
    • <h1> to <h6>
    • <p>
    • <em>
    • <strong>
    • <big>
    • <small>
    • <sup>
    • <sub>
    • <a>
    • <img/>
    • <div>
    • <mbp:pagebreak />
    • <br />
    • <blockquote>
    • <table>
    • <tr>
    • <td>
    • <ul>
    • <li>
    • <hr />
  • Attributes Acceptable in the Clean HTML:
    • href
    • id
    • class
    • alt
    • src
    • width
    • height
    • rel
    • type

Once the above excercise is done, a clean HTML Markup is obtained. This markup may be checked for Well-formedness to convert HTML in to XHTML, which is characterised by the following properties [3]:

XHTML stands for EXtensible HyperText Markup Language

  • XHTML is aimed to replace HTML
  • XHTML is almost identical to HTML 4.01
  • XHTML is a stricter and cleaner version of HTML
  • XHTML is HTML defined as an XML application
  • XHTML is a W3C Recommendation

Once we have obtained a clean and stricter version as XHTML, we are ready to progress further on to the XML Pipeline, which we shall discuss in subsequent posts.

---------

[1] http://en.wikipedia.org/wiki/XHTML

[2] http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_re...

[3] http://www.w3schools.com/xhtml/xhtml_intro.asp

RDF Resource Description Framework    Copyright © 2010, ePubNow! | All right reserved. |
Powered by Cardamom CMS©