2013-09-25

[GNOME] Final Report for GXml in the 2013 Google Summer of Code

The Google Summer of Code has ended, and GXml is spoiled with the fruits of labour:


  • the autotools build system has improved

  • documentation is more complete and more accurate

  • many new examples across most classes, especially for C and JavaScript

  • many bugs were flushed out and fixed (e.g. attribute syncing between underlying libxml2 xmlNodes and GXmlElements)

  • it has a mailing list (gxml-list@gnome.org)

  • new stuff


    • document child management, node cloning


  • new memory tests

  • new error handling model



  • new memory handling model (fixing leaks and improving performance!)

  • improved API compliance

  • bug-fix release (0.3.2) without API breaks

  • imminent 0.4.0 with API breaks (pending some updated patches for XPath, Serialization, etc)


I've talked about those before (near the start and while at GUADEC) so for my report I'm going to focus on the outcome in terms of performance.



Look forward to 0.4.0 imminently, and happy hacking.




GXml's performance versus pure libxml2


One question people have had is the difference in performance between libxml2 and GXml, since GXml currently wraps it.  Things should be worse, as there's typically more code for each operation, but how large will the penalty be and will it matter for you?


Tests


I created a simple test suite with the four following tasks:


  1. loading a file from disk

  2. loading a file from memory

  3. stringifying a document

  4. saving a document to disk


The test suite is highly modular, and it's easy to add new tests.  For
each test, you define a setup function, a test function (the measured
test), and a cleanup function.  So if you'd like to see anything else in particular tested, let me know.


Environment




I've run it on a Lenovo ThinkPad Twist S230u with the following configuration


  • Intel® Core™ i5-3317U CPU @ 1.70GHz × 4 

  • 4GB RAM, SODIMM DDR3 Synchronous 1333 MHz (0,8 ns)

  • 500GB HD @ 5400 RPM (HGST HTS725050A7)


    • /home, including test files


  • 24GB SSD (Samsung MZMPA024)


    • everything outside of /home, including libraries 


  • Fedora 19, x86_64

  • libxml2-2.9.1-1.fc19

  • GXml from git HEAD





Test Data


The test data was based on my updateinfo.xml files from yum, in particular the one found at: /var/cache/yum/x86_64/19/updates/gen/updateinfo.xml.  It contained 98743 different nodes over 11,136kB.  I created smaller and larger versions of it, resulting in















namenodessize (kB)
test3.xml22 2762 784
test4.xml47 7075 568
test5.xml98 74311 136
test6.xml197 48422 268
test7.xml394 96644 536

This testing could be improved by using diffferent types of files with different types of data.  Flatter ones versus deeper ones, for instance.  The different sizes were done by either duplicating the content within the root element or by deleting the second half of nodes inside the root element.  test5.xml represents the original updateinfo.xml


Measurements


Three values were measured.  One was time taken to complete a task (like load a file), using g_get_monotonic_time, which reports in microseconds.  One was memory used by the task after it completed, using mallinfo, in particular the uordblks field (total allocated space), and one was memory leaks (also using mallinfo, after we freed memory).


Procedure


I ran the tests once averaged over 10 trials for each combination of test and file, and then again over 25 trials.  Ways the procedure could be improved includes better isolation on the system from other processes, or providing more detail than the averaged scores, so we can detect any exceptional anomalies (e.g. some other process causes a file load to be delayed by hogging I/O).


Results


Keep in mind that GXml wraps libxml2 for most functionality, so we
don't expect it to be faster than libxml2, rather we want to see what
penalty a GObject wrapper (written in Vala) causes.


Memory Leaks


GXml was leaking memory like a sieve before the summer.  (0.3.2 includes memory leak fixes without the API breaks!), so I wanted to know what memory was left after these tasks from both libxml2 and GXml.  Luckily, neither had any in the cases tested.  (That does not mean there aren't any!  Kudos to those who find them (and more to do who patch them)).


Results























































datalibxml2gxmldiff
load disk
memory
test3.xml 20814019236675841,1371
test4.xml 42604277484771521,1378
test5.xml 86151738980652171,1383
test6.xml 1722616571961260661,1385
test7.xml 3444835593922412801,1386
time
test3.xml 37547565131,5051
test4.xml 66747637970,9558
test5.xml 1442341610241,1164
test6.xml 2844882879111,0120
test7.xml 5614065649041,0062
load mem
memory
test3.xml 24988568288660151,1552
test4.xml 51434229595238411,1573
test5.xml 1041920431206655881,1581
test6.xml 2083567302413307371,1583
test7.xml 3437910093915640271,1390
time
test3.xml 44199538601,2186
test4.xml 84215716950,8513
test5.xml 1729201847351,0683
test6.xml 3471573599091,0367
test7.xml 5726275555190,9701
save
time
test3.xml 25610245130,9572
test4.xml 52908491750,9294
test5.xml 96449983081,0193
test6.xml 1921971962951,0213
test7.xml 3843433951941,0282
stringify
memory
test3.xml 273533931361921,1465
test4.xml 569649662877761,1038
test5.xml 11394656125928001,1051
test6.xml 22789264251855521,1051
time
test3.xml 22873267491,1695
test4.xml 46166545371,1813
test5.xml 932051113121,1943
test6.xml 1989882356451,1842






Discussion


loading documents from disk

 

When it comes to loading a file from the disk, we compared xmlReadFile versus gxml_document_new_from_path (which uses xmlParseFile). 



Memory usage differences are consistently ~14% higher. 



Time-wise, on smaller files, GXml tasks up to 50% longer than using libxml2.  I'm not sure why test4.xml is miraculously lower from this run.  You can see that the larger the file, smaller the difference, which makes sense, since most of the hardwork is done by libxml2 anyway.



loading documents from memory



With memory, again, we see a consistent increase between ~14-16%.



Time-wise, again GXml oddly performs better on test4.xml.  Elsewise, we see the same trend: there is little difference with larger files.



saving to disk



We don't report memory differences because GXml's save functionality cleans up its use of xmlSaveCtxt before it exits, so we can't (easily) see how much we used.  Neither leak, so there is nothing to see there.



Time-wise, it seems to take about the same length of time, but GXml may be trending to more.  This could be due to tasks like synchronising data that is initially stored just in GXmlNodes and needs to be copied into the xmlDoc of libxml2 to make it to disk.



stringification



Memory-wise, we typically see an increase of ~10-15%.  Note that they failed to handle the stringification of the largest file, test7.xml, which requires further investigation.  Stringification was done with xmlDocDumpFormatMemory.



Time-wise, the increase was ~16-20%. 


Conclusion


Regarding memory usage, if you use GXml for cases such as these, you can expect around a 15% increase in memory usage.  That makes sense, as GObjects are used instead of the light C structures libxml2 typically does.  One benefit in hwrapping libxml2 is that we don't actually create a GXmlNode for every xmlNode in a document, only the ones we use, so a pure GObject implementation might use more memory.



Regarding time usage, the difference for some operations is small, a couple percent, and for others, the difference is larger with smaller files, as big as 50% when loading a smaller file.  Larger files in those cases (such as loading documents) see less and less of a penalty.



I feel as though for many common applications, these don't represent a significant penalty (time taken in loading large documents is still a few dozen milliseconds), and can be worth the benefits in using a GObject API.


Going forward


If you're interested in more about GXml's performance, the test suite will be in gxml/tests/performance/.  Feel free to submit new tests and test files.



Regarding GXml, HEAD will be pushed out in a new feature release including the API changes, fancy new features, and contributions from others, including Daniel Espinosa, Adam Ples, Simon Reimer, and others.



Cheerio!

Keine Kommentare:

Kommentar veröffentlichen

Dieses Blog durchsuchen

Labels

#Technology #GNOME gnome gxml fedora bugs linux vala google #General firefox security gsoc GUADEC android bug xml fedora 18 javascript libxml2 programming web blogger encryption fedora 17 gdom git emacs libgdata memory mozilla open source serialisation upgrade web development API Spain containers design evolution fedora 16 fedora 20 fedora 22 fedup file systems friends future glib gnome shell internet luks music performance phone photos php podman preupgrade tablet testing typescript yum #Microblog Network Manager adb apache art automation bash brno catastrophe css data loss debian debugging deja-dup disaster docker emusic errors ext4 facebook fedora 19 gee gir gitlab gitorious gmail gobject google talk google+ html libxml mail microsoft mtp mysql namespaces nautilus nextcloud owncloud picasaweb pitivi ptp python raspberry pi resizing rpm school selinux signal sms speech dispatcher systemd technology texting time management uoguelph usability video web design youtube #Tech Air Canada C Electron Element Empathy Europe GError GNOME 3 GNOME Files Go Google Play Music Grimes IRC Mac OS X Mario Kart Memento Nintendo Nintendo Switch PEAP Selenium Splatoon UI VPN Xiki accessibility advertising ai albums anaconda anonymity apple ask asus eee top automake autonomous automobiles b43 backup battery berlin bit rot broadcom browsers browsing canada canadian english cars chrome clarity comments communication compiler complaints computer computers configuration console constructive criticism cron cropping customisation dataloss dconf debug symbols design patterns desktop summit development discoverability distribution diy dnf documentation drm duplicity e-mail efficiency email english environment estate experimenting ext3 fedora 11 festival file formats firejail flac flatpak forgottotagit freedom friendship fuse galaxy nexus galton gay rights gdb german germany gimp gio gjs gnome software gnome-control-center google assistant google calendar google chrome google hangouts google reader gqe graphviz growth gtest gtg gtk gvfs gvfs metadata hard drive hard drives hardware help hp humour ide identity instagram installation instant messaging integration intel interactivity introspection jabber java java 13 jobs kernel keyboard language language servers languages law learning lenovo letsencrypt libreoffice librpm life livecd liveusb login lsp macbook maintainership mariadb mario matrix memory leaks messaging mounting mouse netflix new zealand node nodelist numix obama oci ogg oggenc oh the humanity open open standards openoffice optimisation org-mode organisation package management packagekit paint shedding parallelism pdo perl pipelight privacy productivity progress progressive web apps pumpkin pwa pyright quality recursion redhat refactoring repairs report rhythmbox sandboxes scheduling screenshots self-navigating car shell sleep smartphones software software engineering speed sql ssd synergy tabs test tests themes thesis tracker travel triumf turtles tv tweak twist typing university update usb user experience valadoc video editing volunteering vpnc waf warm wayland weather web apps website wifi wiki wireless wishes work xinput xmpp xorg xpath
Powered by Blogger.