Untitled

XML DOM
- API: caled XML DOM
- XML: was the standard mark up language to use to exchange information between computers over a network. Its not as dominant anymore because of JSON.
- WHat is XML
The extensibe markup language is a markup language that defines a set of rules for encoding ocuments in a format that i both human-readable and machine-readable
	- XML is self describing
	- XML is designed to store and transport data
	- XML seperates data from presentation
	- XML tages are not predefined
	- XML is a platform independent

XML Termonology
	- tags have names: sample, color
	- Start tags: <sample>, <color>
	- End tags: </sample>, </color>
	- An XML element starts and ends with matching tag names:
	<color>Purple</color>
	- Tag nesting constructs a heirarchy (tree data structures)
- Elements can have attributes: <sample id="5039">

	FIgure 1. Example code, Most Popular Color Survey:
	<sample id="5309">
		<male>
			<name>John</name>
			<color>Yellow</color>
		</male>
		<male>
			<name>Micheal</name>
			<color>Purple</color>
		</male>
	</sample>
Well Formed XML Documents
- at its base level well-formed documents require that:
	1. Content be defined.
	2. Content be delimited with a beginning and end tag
	3. Content be properly nested (parents within roots, children within parents)

- To be a well-formed document, rules must be established about the declaration and treatment of entities. Tags are case sensitive, with attributes delimited with quotation marks. Empty elements have rules established. Overlapping tags invalidate a document. Ideally, a well-formed document conforms to the design goals of XML. Other key syntax rules provided in the specification include:

	1. It contains only properly encoded legal Unicode characters.

	2. None of the special syntax characters such as < and & appear except when performing their markup-delineation roles.

	3. The begin, end, and empty-element tags that delimit the elements are correctly nested, with none missing and none overlapping.

	4. The element tags are case-sensitive; the beginning and end tags must match exactly. Tag names cannot contain any of the characters !"#$%&'()*+,/;<=>?@[\]^`{|}~, nor a space character, and cannot start with -, ., or a numeric digit.

	5. There is a single "root" element that contains all the other elements.


What is XHTML?
- XHTML is a standard for HTML that follows all the well formedness rules of XML. The important differences from HTML are:
	1. XHTML elements must properly nested
	2. XHTML elements mst always be closed
	3. XHTML elementsmust be in lowercase
	4. XHTML documents must have one root element
	5. Attributes names must be in lower case
	6. Attribute values must be quoted
- Since XHTML documents are well formed, they may, therefore be parse using standard XML parsers, unlike HTML, which requireds a lenient HTML-specific parser.

Case Stude - From HTML to XHTML:
	$ java -jar tagsoup-1.2.1.jar --files sample.html
	src: sample.html dst: sample.xhtml

	<html>
		<body>
			<h3>Sample</h3>
			<table id="5309">
				<tr><td>John></td><td>Yellow</td><tr>
				<tr><td>Micheal</td><td>Purple</td></tr>
			</table>
		</body>
	</html>
	id - attribute
	john, micheal - text
	td - element
	tr -

What is DOM?
- The document Object model(DOM) is a platform and language-neutral application programmign interface (API) for XML documents. It defines the logical structures of documents and the way a document is accessed and manipulated.
- With the Document Object Model, programmers can build documents, navigate their structure, and add , modify, or delete elements and content.
- DOM is a logical model that may be implements in any convienient manner. MOst implementations use a logical strucutre wich is very much like a tree.
- The name "Document Object MOdel" was chosen because it is "object model" in the traditional object ooriented design sense. The model encompasses not only the structure of the database.

API: DOM
The document object model is a platform and language neutral applicaiton programming interface for XML documents
- Interface NodeList
- Interface Node
- Interface Element
- Interface Document

A NodeList represents a sequences of nodes:
	- length - tells you how many elements it has
	- item(i) - function item and give it index i

Interface Node
- All of the components of an XML document are subclasses of NOde.
	- DOMString nodeName 		- nodeType:	|Node.ELEMENT_NODE
	- DOMString nodeValue	       			|Node.ATTRIBUTE_NODE
	- Node parentNode	       			|Node.TEXT_NODE
	- boolean hasChilNodes()       			|Node.DOCUMENT_NODE
	- NodeList childNodes

Interface Element
- Element is a subclass of Node
	- NodeList getElementsByTagName(tagName)
	- boolean hasAttribute(attributeName)
	- DOMSTRING getAttribute(attributeName)

Interface Document
- A document represents an entire XML document, including its constitent elements, attributes, comments, etc
	- Element documentElement
	- Element getElementByID(elementId)
	- NodeList getElementsByTagName(tagName)
documentElement is the one and only root element of the document.

From XHTML to CSV: comma seperated varibales
	$ pytho xhtmlToCsv.py sample.xhtml
	John,Yellow
	Micheal,Purple
Python code:
	import sys
	import xml.dom.minidom
	document=xml.dom.minidom.parse(sys.argv[1])
	tablesElements = document.getElementsByTagName('table')
	for tr in tableElements[0].getElemementsByTagName('tr'):
		data = []
		for td in tr.getElementsByTageName('td'):
			for node in td.childNodes:
				if node in td.childNodes:
					if node.nodeType == node.TEXT_NODE:
						data,append(node,nodeValue)
		print(','.join(data))