interactive 3d modelling in outdoor augmented reality worlds














Research Thesis for the Degree of Doctor of Philosophy
By Wayne Piekarski
Bachelor of Engineering in Computer Systems Engineering (Hons), University of South Australia
wayne@cs.unisa.edu.au
Supervisor
Dr. Bruce Thomas
Adelaide, South Australia
February 2004
|
Wearable Computer Lab School of Computer and Information Science Division of Information Technology, Engineering, and the Environment The University of South Australia |
|
Table of Contents
Chapter 1 - Introduction............................................................................................................ 1
1.1 Problem statement............................................................................................................ 4
1.2 Thesis statement............................................................................................................... 6
1.3 Contributions................................................................................................................... 7
1.4 Dissertation structure....................................................................................................... 8
Chapter 2 - Background........................................................................................................... 10
2.1 Definition of augmented reality..................................................................................... 10
2.2 Applications................................................................................................................... 12
2.3 See through display technology..................................................................................... 20
2.4 3D tracking technology.................................................................................................. 25
2.5 Desktop direct manipulation techniques........................................................................ 38
2.6 Virtual reality interaction techniques............................................................................. 39
2.7 Physical world capture techniques................................................................................. 51
2.8 CAD modelling.............................................................................................................. 55
2.9 Outdoor augmented reality wearable computers........................................................... 60
2.10 Summary...................................................................................................................... 64
Chapter 3 - Augmented reality working planes....................................................................... 65
3.1 Distance estimation cues................................................................................................ 66
3.2 AR working planes definition........................................................................................ 70
3.3 Coordinate systems........................................................................................................ 72
3.4 Plane creation................................................................................................................. 76
3.5 Object manipulation....................................................................................................... 80
3.6 Vertex placement........................................................................................................... 82
3.7 Accurate alignment with objects.................................................................................... 82
3.8 Alignment accuracy using HMDs.................................................................................. 83
3.9 Summary........................................................................................................................ 89
Chapter 4 - Construction at a distance..................................................................................... 90
4.1 Technique features......................................................................................................... 90
4.2 Direct object placement techniques............................................................................... 95
4.3 Body-relative plane techniques...................................................................................... 97
4.4 AR working planes techniques.................................................................................... 103
4.5 Integrated example...................................................................................................... 110
4.6 Operational performance.............................................................................................. 111
4.7 Summary...................................................................................................................... 112
Chapter 5 - User interface...................................................................................................... 114
5.1 Design rationale........................................................................................................... 114
5.2 Cursor operations......................................................................................................... 117
5.3 Command entry........................................................................................................... 123
5.4 Display interface.......................................................................................................... 127
5.5 Tinmith-Metro modelling application.......................................................................... 132
5.6 Informal user evaluations............................................................................................. 138
5.7 Future work.................................................................................................................. 142
5.8 Summary...................................................................................................................... 143
Chapter 6 - Software architecture.......................................................................................... 144
6.1 Design overview.......................................................................................................... 145
6.2 Previous work.............................................................................................................. 147
6.3 Object design............................................................................................................... 152
6.4 Object storage.............................................................................................................. 156
6.5 Implementation internals.............................................................................................. 161
6.6 Sensors and events....................................................................................................... 168
6.7 Rendering..................................................................................................................... 172
6.8 Demonstrations............................................................................................................ 174
6.9 Summary...................................................................................................................... 181
Chapter 7 - Hardware............................................................................................................ 182
7.1 Hardware inventory..................................................................................................... 182
7.2 Tinmith-Endeavour backpack...................................................................................... 184
7.3 Glove input device....................................................................................................... 193
7.4 Summary...................................................................................................................... 200
Chapter 8 - Conclusion.......................................................................................................... 201
8.1 Augmented reality working planes.............................................................................. 201
8.2 Construction at a distance............................................................................................ 202
8.3 User interfaces............................................................................................................. 203
8.4 Vision-based hand tracking......................................................................................... 203
8.5 Modelling applications................................................................................................. 204
8.6 Software architecture................................................................................................... 204
8.7 Mobile hardware.......................................................................................................... 204
8.8 Future work.................................................................................................................. 205
8.9 Final remarks................................................................................................................ 207
Appendix A - Evolutions....................................................................................................... 208
A.1 Map-in-the-Hat (1998)................................................................................................ 208
A.2 Tinmith-2 prototype (1998)......................................................................................... 210
A.3 Tinmith-3 prototype (1999)......................................................................................... 213
A.4 Tinmith-4 and ARQuake prototype (1999)................................................................ 217
A.5 Tinmith-evo5 prototype one (2001)............................................................................ 218
A.6 Tinmith-VR prototype (2001)..................................................................................... 221
A.7 Tinmith-evo5 prototype two with Tinmith-Endeavour (2002)................................... 222
A.8 ARQuake prototype two with Tinmith-Endeavour (2002)........................................ 224
A.9 Summary..................................................................................................................... 226
Appendix B - Attachments.................................................................................................... 227
B.1 CD-ROM.................................................................................................................... 227
B.2 Internet........................................................................................................................ 228
References.............................................................................................................................. 229
List of Figures
Figure 1‑1.... Example Sony Glasstron HMD with video camera and head tracker.................. 2
Figure 1‑2.... Example of outdoor augmented reality with computer-generated furniture........ 3
Figure 1‑3.... Schematic of augmented reality implementation using a HMD........................... 3
Figure 2‑1.... Example of Milgram and Kishino’s reality-virtuality continuum....................... 11
Figure 2‑2.... The first head mounted display, developed by Ivan Sutherland in 1968........... 13
Figure 2‑3.... External and AR immersive views of a laser printer maintenance application... 13
Figure 2‑4.... Virtual information windows overlaid onto the physical world......................... 14
Figure 2‑5.... Worker using an AR system to assist with wire looming in aircraft assembly... 15
Figure 2‑6.... AR with overlaid ultrasound data guiding doctors during needle biopsies....... 15
Figure 2‑7.... Studierstube AR environment, with hand-held tablets and widgets................. 16
Figure 2‑8.... Marker held in the hand provides a tangible interface for viewing 3D objects.. 16
Figure 2‑9.... Actors captured as 3D models from multiple cameras overlaid onto a marker.. 17
Figure 2‑10.. Touring Machine system overlays AR information in outdoor environments.... 18
Figure 2‑11.. BARS system used to reduce the detail of AR overlays presented to the user. 18
Figure 2‑12.. Context Compass provides navigational instructions via AR overlays.............. 19
Figure 2‑13.. Schematic of optical overlay-based augmented reality...................................... 22
Figure 2‑14.. Optically combined AR captured with a camera from inside the HMD............ 23
Figure 2‑15.. Schematic of video overlay-based augmented reality........................................ 23
Figure 2‑16.. Example video overlay AR image, captured directly from software................. 25
Figure 2‑17.. Precision Navigation TCM2 and InterSense InertiaCube2 tracking devices..... 37
Figure 2‑18.. CDS system with pull down menus and creation of vertices to extrude solids. 43
Figure 2‑19.. CHIMP system with hand-held widgets, object selection, and manipulation... 43
Figure 2‑20.. Immersive and external views of the SmartScene 3D modelling environment.. 44
Figure 2‑21.. Partial UniSA campus model captured using manual measuring techniques..... 53
Figure 2‑22.. Screen capture of Autodesk’s AutoCAD editing a sample 3D model............... 56
Figure 2‑23.. Venn diagrams demonstrating Boolean set operations on 2D areas A and B.... 58
Figure 2‑24.. CSG operations expressed as Boolean sets of 3D objects.................................. 58
Figure 2‑25.. Plane equation divides the universe into two half spaces, inside and outside... 59
Figure 2‑26.. Finite cylinder defined by intersecting an infinite cylinder with two planes..... 59
Figure 2‑27.. Box defined using six plane equations and CSG intersection operator............. 60
Figure 2‑28.. Wearable input devices suitable for use in outdoor environments..................... 62
Figure 3‑1.... 3D objects are projected onto a plane near the eye to form a 2D image............ 66
Figure 3‑2.... Normalised effectiveness of various depth perception cues over distance........ 69
Figure 3‑3.... Graph of the size in pixels of a 1m object on a HMD plane 1m from the eye... 70
Figure 3‑4.... Coordinate systems used for the placement of objects at or near a human........ 72
Figure 3‑5.... World-relative AR working planes remain fixed during user movement........... 73
Figure 3‑6.... Location-relative AR working planes remain at the same bearing from the user and maintain a constant distance from the user......................................................................................... 74
Figure 3‑7.... Body-relative AR working planes remain at a fixed orientation and distance to the hips and are not modified by motion of the head......................................................................... 75
Figure 3‑8.... Head-relative AR working planes remain attached to the head during all movement, maintaining the same orientation and distance to the head.......................................................... 76
Figure 3‑9.... AR working plane created along the head viewing direction of the user.......... 77
Figure 3‑10.. AR working plane created at a fixed offset and with surface normal matching the view direction of the user...................................................................................................................... 77
Figure 3‑11.. AR working plane created at intersection of cursor with object, and normal matching the user’s view direction.............................................................................................................. 78
Figure 3‑12.. AR working plane created relative to an object’s surface.................................. 78
Figure 3‑13.. AR working plane created at a nominated object based on the surface normal of another reference object.................................................................................................................. 79
Figure 3‑14.. Manipulation of an object along an AR working plane surface......................... 79
Figure 3‑15.. Depth translation from the user moving a head-relative AR working plane...... 79
Figure 3‑16.. AR working plane attached to the head can move objects with user motion.... 79
Figure 3‑17.. Scaling of an object along an AR working plane with origin and two points.... 80
Figure 3‑18.. Rotation of an object along AR working plane with origin and two points...... 80
Figure 3‑19.. Vertices are created by projecting the 2D cursor against an AR working plane 81
Figure 3‑20.. AR working plane attached to the head can create vertices near the user......... 81
Figure 3‑21.. Example fishing spot marked using various shore-based landmarks.................. 83
Figure 3‑22.. Example of range lights in use to indicate location-relative to a transit bearing 83
Figure 3‑23.. Sony Glasstron HMD measured parameters and size of individual pixels......... 84
Figure 3‑24.. Distant landmarks must be a minimum size to be visible on a HMD................. 85
Figure 3‑25.. Dotted lines indicate the angle required to separate the two marker’s outlines. 85
Figure 3‑26.. Similar triangles used to calculate final positioning error function..................... 86
Figure 3‑27.. Rearrangement and simplification of final positioning error equation............... 87
Figure 3‑28.. Derivation of alignment equation when marker B is at an infinite distance....... 87
Figure 3‑29.. 3D surface plot with marker distances achieving alignment accuracy of 2 cm.. 88
Figure 3‑30.. 3D surface plot with marker distances achieving alignment accuracy of 50 cm 89
Figure 4‑1.... AR view of virtual table placed in alignment with physical world table........... 95
Figure 4‑2.... VR view of bread crumbs markers defining a flat concave perimeter............... 96
Figure 4‑3.... AR view showing registration of perimeter to a physical world grassy patch... 96
Figure 4‑4.... Example bread crumbs model extruded to form an unbounded solid shape..... 97
Figure 4‑5.... Infinite carving planes used to create a convex shape from an infinite solid..... 98
Figure 4‑6.... Orientation invariant planes generated using multiple marker positions.......... 100
Figure 4‑7.... Relationship between GPS accuracy and required distance to achieve better than 1 degree of orientation error for two different GPS types.................................................. 100
Figure 4‑8.... Orientation invariant planes formed using first specified angle and markers... 101
Figure 4‑9.... Box objects can be moved into a building surface to carve out windows....... 102
Figure 4‑10.. Convex trapezoid and concave T, L, and O-shaped objects............................ 103
Figure 4‑11.. Concave object created using CSG difference of two convex boxes.............. 103
Figure 4‑12.. AR working planes are used to specify vertices and are projected along the surface normal for carving the object’s roof................................................................................................ 104
Figure 4‑13.. AR view of infinite planes building created with sloped roof......................... 105
Figure 4‑14.. AR view of infinite planes building being interactively carved with a roof.... 105
Figure 4‑15.. VR view of building with sloped roof, showing overall geometry.................. 105
Figure 4‑16.. Frames of automobile carving, with markers placed at each corner................. 106
Figure 4‑17.. Final resulting automobile shown overlaid in AR view, and in a VR view..... 106
Figure 4‑18.. Schematic illustrating the painting of a window onto a wall surface............... 107
Figure 4‑19.. Examples showing surface of revolution points for tree and cylinder objects. 107
Figure 4‑20.. AR view of surface of revolution tree with markers on AR working plane..... 108
Figure 4‑21.. VR view of final surface of revolution tree as a solid shape............................ 108
Figure 4‑22.. Outdoor stack of pallets approximating a box, before modelling.................... 109
Figure 4‑23.. VR view of final model with captured geometry and mapped textures.......... 109
Figure 4‑24.. AR view of final abstract model, including street furniture items................... 111
Figure 4‑25.. VR view of final abstract model, including street furniture items................... 111
Figure 5‑1.... Each finger maps to a displayed menu option, the user selects one by pressing the appropriate finger against the thumb.............................................................................................. 117
Figure 5‑2.... Immersive AR view, showing gloves and fiducial markers, with overlaid modelling cursor for selection, manipulation, and creation................................................................................ 118
Figure 5‑3.... Translation operation applied to a virtual tree with the user’s hands............... 120
Figure 5‑4.... Scale operation applied to a virtual tree with the user’s hands......................... 121
Figure 5‑5.... Rotate operation applied to a virtual tree with the user’s hands...................... 122
Figure 5‑6.... Original WordStar application, showing menu toolbar at bottom of screen.... 124
Figure 5‑7.... Immersive AR overlay display components explained..................................... 128
Figure 5‑8.... Top down aerial view of VR environment in heading up and north up mode. 129
Figure 5‑9.... Orbital view centred on the user with a VR style display................................ 129
Figure 5‑10.. User, view plane, 3D world objects, and distant projection texture map......... 131
Figure 5‑11.. Immersive view of Tinmith-Metro with 3D cursor objects appearing to be floating over the incoming video image....................................................................................................... 132
Figure 5‑12.. External view of Tinmith-Metro with user’s body and 3D environment......... 132
Figure 5‑13.. Options available from the top-level of Tinmith-Metro’s command menu...... 133
Figure 5‑14.. Menu hierarchy of available options for the Tinmith-Metro application.......... 134
Figure 5‑15.. Original horizontal menu design in immersive AR view.................................. 139
Figure 5‑16.. View of the user interface being tested in a VR immersive environment........ 140
Figure 6‑1.... Overall architecture showing sensors being processed using libraries and application components, and then rendered to the user’s HMD..................................................................... 146
Figure 6‑2.... Layers of libraries forming categories of objects available to process data...... 153
Figure 6‑3.... Data values flow into a node for processing, producing output values............ 153
Figure 6‑4.... Expanded view of data flow model showing stages of processing................. 154
Figure 6‑5.... Network distribution is implemented transparently using automatically generated serialisation callbacks and a network transmission interface............................................................... 156
Figure 6‑6.... Examples demonstrating usage of the hierarchical object store....................... 158
Figure 6‑7.... Simplified layout of composite Position class, showing nested objects........... 160
Figure 6‑8.... Edited extract from the is-300.h orientation tracker C++ definition file......... 162
Figure 6‑9.... Complete XML serialisation of the IS-300 orientation tracker object............. 163
Figure 6‑10.. C++ code demonstrating setup and execution of callbacks............................. 164
Figure 6‑11.. Mathematical operations possible between absolute and relative objects........ 169
Figure 6‑12.. Distorted view of Tinmith-Metro showing improperly placed avatar objects when the resolution of OpenGL’s internal values is exceeded............................................................. 170
Figure 6‑13.. User is represented in the 3D world with a hierarchical avatar model............. 176
Figure 6‑14.. Indoor tracking system with backpack, head and shoulder mounted video cameras, GPS antenna, and fiducial markers on the hands, walls and ceiling.............................................. 177
Figure 6‑15.. Partial layout of manipulation menu, with internal commands and next path. 179
Figure 7‑1.... Data bus interconnect diagram of components used for mobile outdoor AR.. 184
Figure 7‑2.... Rear view of previous backpack design, showing tangled mess of cabling..... 185
Figure 7‑3.... Front and rear views of the Tinmith-Endeavour backpack in use outdoors..... 185
Figure 7‑4.... Design of polycarbonate housing with hinged laptop holder and internals..... 187
Figure 7‑5.... Backpack shown in desktop configuration, permitting normal use outside..... 187
Figure 7‑6.... Interior of backpack housing, showing internal components and cabling........ 188
Figure 7‑7.... Power supply breakout box, with +5V, +9V, and +12V at each connector.... 189
Figure 7‑8.... Power bus interconnect diagram of components used for mobile outdoor AR 190
Figure 7‑9.... Two USB ports, glove connector, and cables mounted onto shoulder straps.. 190
Figure 7‑10.. Design of brackets to attach Sony Glasstron and Firefly camera to a helmet.. 191
Figure 7‑11.. Designs of various versions of the glove and attached fiducial markers.......... 194
Figure 7‑12.. Circuit schematic for the low power glove controller....................................... 195
Figure 7‑13.. Example use of the fiducial marker tracking used for a 3D cursor.................. 196
Figure 7‑14.. ARToolKit default camera_para.dat file, with error x=2.5, y=48.0................. 198
Figure 7‑15.. Graphical depictions showing original and new orthogonal camera model..... 199
Figure A‑1... Phoenix-II wearable computer with batteries, cables, and belt mounting........ 209
Figure A‑2... Map-in-the-Hat prototype inside ruck sack, with antenna, cables, and HMD. 209
Figure A‑3... Screen shots of Map-in-the-Hat indicating a waypoint on the display............ 209
Figure A‑4... Tinmith-2 hiking frame with some equipment attached................................... 211
Figure A‑5... 2D top down map overlaid on physical world (using offline AR overlay)...... 212
Figure A‑6... 2D top down map overlay with current location relative to nearby buildings. 212
Figure A‑7... 3D wireframe overlay of building, with small extension made to the left....... 213
Figure A‑8... Tinmith-3 backpack with HMD, head tracker, and forearm keyboard............ 214
Figure A‑9... Software interconnect diagram for Tinmith-2 to Tinmith-4 prototypes........... 215
Figure A‑10. View of ModSAF tool with simulated entities and a wearable user................ 215
Figure A‑11. Wearable user in outdoor environment generates DIS packets........................ 216
Figure A‑12. MetaVR view of avatar for wearable user and helicopter for ModSAF entity 216
Figure A‑13. DIS entities overlaid in yellow on HMD with a top down view..................... 216
Figure A‑14. Visualising artificial CAD building extensions overlaid on physical world..... 217
Figure A‑15. ARQuake implemented using optical AR with virtual monsters shown.......... 218
Figure A‑16. Mock up demonstrating how a modelling system could be used outdoors...... 219
Figure A‑17. Side view of original Tinmith-evo5 backpack, with cabling problems............. 220
Figure A‑18. Close up view of messy cable bundles and miscellaneous input devices......... 220
Figure A‑19. Screen shots of the first Tinmith-Metro release in use outdoors...................... 221
Figure A‑20 VR immersive system used to control the Tinmith-Metro user interface.......... 222
Figure A‑21. Side and front views of the Tinmith-Endeavour backpack in use outdoors.... 223
Figure A‑22. Screen capture of the latest Tinmith-Metro release in use outdoors................. 224
Figure A‑23. USB mouse embedded into a children’s bubble blowing toy.......................... 225
Figure A‑24. Monsters overlaid on the physical world with video overlay ARQuake.......... 225
List of Tables
Table 2‑1..... Comparison between optical and video combined AR systems
Table 2‑2..... Comparison between various types of 3D tracking technology
Table 2‑3..... Comparison between forms of VR interaction techniques
Table 3‑1..... Alignment accuracies for markers at various distances from the user
Table 4‑1..... Top down view of building shapes with vertices (v), edges (e), and facets (f)
Table 6‑1..... Approximate round trip delays experienced for network serialisation.
Table 7‑1..... Current backpack components with cost, location, and power consumption
Abbreviations and Definitions
1394 IEEE Standard 1394, also referred to as Firewire or i.Link [IEEE95]
2D Two Dimensional in XY
3D Three Dimensional in XYZ
AAAD Action at a distance, first defined by Mine [MINE95a]
ACRC Advanced Computing Research Centre at UniSA
AGD66 Australian Geodetic Datum 1966 [ICSM00]
AGD84 Australian Geodetic Datum 1984 [ICSM00]
AR Augmented Reality
CIS School of Computer and Information Science at UniSA
COTS Commercial Off The Shelf
CRT Cathode Ray Tube (technology used in television and monitor displays)
CSG Constructive Solid Geometry
DGPS Differential GPS
DIS IEEE Standard 1278, the Distributed Interactive Simulation protocol [IEEE93]
DOF Degrees of Freedom ([X, Y, Z] for position, [q, f, j] for orientation)
3DOF Three degrees of freedom, only three measurements, such as only orientation or position tracker information
6DOF Six degrees of freedom, information about orientation and position, a complete tracking solution
DSTO Defence Science Technology Organisation, Adelaide, South Australia
ECEF Earth-Centred Earth-Fixed Cartesian coordinates, in metres [ICSM00]
Evo Evolution or version number
FOV Field of View, the angle of the user’s view that a head mounted display or camera can cover
GLONASS Russian Federation, Global Navigation Satellite System (Global'naya Navigatsionnaya Sputnikovaya Sistema in Russian)
GPS US Department of Defence, Global Positioning System
HMD Head Mounted Display
HUD Heads Up Display
ITD Information Technology Division (located at DSTO Salisbury, Adelaide)
IPC Inter-Process Communication
LCD Liquid Crystal Display
LOD Land Operations Division (located at DSTO Salisbury, Adelaide)
LLH Latitude Longitude Height spherical polar coordinates [ICSM00]
LSAP Land Situational Awareness Picture System
MR Mixed Reality
NFS Sun Microsystems’ Network File System [SAND85]
OEM Original Equipment Manufacturer
RPC Sun Microsystems’ Remote Procedure Calls
RTK Real-Time Kinematic (centimetre grade GPS technology)
SERF Synthetic Environment Research Facility (located at DSTO Salisbury, Adelaide)
SES Scientific and Engineering Services (located at DSTO Salisbury, Adelaide)
STL Standard Template Library (for the C++ language)
SQL Server Query Language
Tinmith This Is Not Map In The Hat (named for historical purposes)
UniSA University of South Australia
USB Universal Serial Bus
UTM Universal Transverse Mercator grid coordinates, in metres [ICSM00]
VE Virtual Environment
VR Virtual Reality
WCL Wearable Computer Lab at the University of South Australia
WIM Worlds in Miniature [STOA95]
WIMP Windows, Icons, Menus, and Pointer
WGS84 World Geodetic System 1984 [ICSM00]
X The X Window System
XML Extensible Mark-up Language
Summary
This dissertation presents interaction techniques for 3D modelling of large structures in outdoor augmented reality environments. Augmented reality is the process of registering projected computer-generated images over a user’s view of the physical world. With the use of a mobile computer, augmented reality can also be experienced in an outdoor environment. Working in a mobile outdoor environment introduces new challenges not previously encountered indoors, requiring the development of new user interfaces to interact with the computer. Current AR systems only support limited interactions and so the complexity of applications that can be developed is also limited.
This dissertation describes a number of novel contributions that improve the state of the art in augmented reality technology. Firstly, the augmented reality working planes technique gives the user the ability to create and edit objects at large distances using line of sight and projection techniques. This technique overcomes limitations in a human’s ability to perceive depth, and requires simple input devices that are available on mobile computers. A number of techniques that leverage AR working planes are developed, collectively termed construction at a distance: street furniture, bread crumbs, infinite planes, projection carving, projection colouring, surface of revolution, and texture map capture. These techniques can be used to create and capture the geometry of outdoor shapes using a mobile AR system with real-time verification and iterative refinement. To provide an interface for these techniques, a novel AR user interface with cursors and menus was developed. This user interface is based around a pair of pinch gloves for command input, and the use of a custom developed vision tracking system for use in a mobile environment. To develop applications implementing these contributions, a new software architecture was designed to provide a suitable abstraction to make development easier. This architecture is based on an object-oriented data flow approach, uses a special file system notation object repository, and supports distributed objects. The software requires a platform to execute on, and so a custom wearable hardware platform was developed. The hardware is based around a backpack that contains all the equipment required, and uses a novel flexible design that supports simple reconfiguration.
Based on these contributions, a number of modelling applications were developed to demonstrate the usefulness of these techniques. These modelling applications allow users to walk around freely outside, and use proprioception and interactions with the hands to control the task. Construction at a distance allows the user to model objects such as buildings, trees, automobiles, and ground features with minimal effort in real-time, and at any scale and distance beyond the user’s reach. These applications have been demonstrated in the field to verify that the techniques can perform as claimed in the dissertation.
Declaration
I declare that this thesis does not incorporate without acknowledgment any material previously submitted for a degree or diploma in any university and that to the best of knowledge it does not contain any materials previously published or written by another person except where due reference is made in the text.
______________________
Wayne Piekarski
Adelaide, February 2004
______________________
Dr Bruce Thomas – Thesis Supervisor
Adelaide, February 2004
Acknowledgements
A dissertation does not just appear out of nowhere, and although it is supposed to be a contribution by one person for a PhD, there are still a lot of people who have helped me out over the years. I have been fortunate enough to have had the support of so many people and without it this would not have been possible. While most people did not help directly on the project, every one of them contributed in some way towards helping me to get where I am today, even things like just being a friend and going out and having fun. Others were responsible for giving me a push in the right direction in life, and for everyone listed here I am eternally grateful for their help.
Firstly there is the Wearable Computer Lab crew. Although I initially started in the lab alone, over the years we have grown to being a major lab at the university, and I have worked with a number of people - Ben Close, Hannah Slay, Aaron Toney, Ben Avery, Ross Smith, Peter Hutterer, Matthias Bauer, Pierre Malbezin, Barrie Mulley, Matthew Schultz, Scott Sheridan, Leonard Teo, John Squires, John Donoghue, Phil DeBondi, and Dan Makovec. Many of us have spent many countless late nights working on projects to meet deadlines and the spirit of our team is truly awesome.
In the CIS department I also have a number of friends apart from those in the WCL who I go to lunch with almost every day, and I also enjoy spending time with them outside of work hours - Grant Wigley, Greg Warner, Malcolm Bowes, Stewart Itzstein, and Toby Richer. I would especially like to thank Grant for his friendship during the PhD program, as we both started at the same time and have helped each other out considerably. On behalf of the lunch crew I would also like to express gratitude to the Brahma Lodge Hotel, for serving up copious amounts of the finest cheesy potatoes on the planet to fuel our lunch time cravings.
In the CIS department I have worked with three heads of school over the years and each have supported me in my endeavours - Andy Koronios, Brenton Dansie, and David Kearney. With their financial and leadership contributions I have been given the resources I need to complete this PhD and also help to develop the WCL into what it is today.
Staff members in the CIS department have also been extremely helpful. The numerous general staff performed the many tasks that are required to keep the department running each day, and were always very happy to help out when required. They helped to organise my teaching trips, the ordering of equipment, and dealing with finances. Greg Warner and Malcolm Bowes ran the department servers and allowed us to perform unorthodox computer networking in the lab. Frank Fursenko and Tony Sobey also discussed with me C++ and graphics programming on a number of occasions. Millist Vincent assisted by proofreading parts of the thesis and provided technical comments.
The DSTO Land Operations Division with Victor Demczuk and Franco Principe were initially responsible for donating various wearable computers and resources to the department. This was used to start the initial projects in the WCL and would not have existed without them. The Information Technology Division at DSTO has also been instrumental in further research work we have done, giving us a number of large grants for equipment as well as the design of our new backpack. I would especially like to thank Rudi Vernik for his vision in granting this funding to the WCL and it has helped to make our research first class work. The SES group with John Wilson, Paul Zalkauskas, Chris Weckert, and Barry Crook, led by Peter Evdokiou, have to be the finest and most professional group of engineers I have ever met. Together they manufactured the Tinmith-Endeavour backpack design which dazzles people at conferences all over the world.
When I was still growing up in 1993, I had the fortune of using an Internet dial up service run by Mark Newton. He introduced me to the one true operating system Unix, and all its amazing features. I used his machine to learn how to use Unix and the fledgling Internet, and through this I met a number of other people in the area. The late Chris Wood took the time to teach me how to write Makefiles, use Emacs (the one true editor), and how to correct bugs in my first X11 application. These contributions helped to steer my professional development toward Unix development which would come into much use at university. The Linux community has also supported me by inviting me to speak at most of their conferences, allowing me to teach audiences about my research work and to learn from others. The developers who donated their time to the development of drivers for the hardware I use have been most helpful, without these this project would not have been possible.
When I was just starting at university I was fortunate enough to meet Steve Baxter. He had just started up an Internet company called SE Net along with his wife Emily Baxter and friend Chris Foote, and asked me to work as the first employee of the company helping with sales and support. As the company grew I was given the role of Manager of R&D, and given the task of developing the systems that controlled the entire company. Steve trusted in my abilities the future of his entire company. This enabled me to gain a wealth of experience in leadership and design that would never be given to most 18 year olds, and for this I am very grateful. Together we built a company that led the field in a number of areas. As part of the team at SE Net, I have many friends - Matt Altus, David Kuzmak, Richard and Genni Kaye, Robert Gulley, Andrew Xenides, Rebecca Razzano, Lindsay Whitbread, Megan Hehir, Mark Mills, and Scott Smith. Unfortunately SE Net has now been absorbed by the new owners of the company, and so the fine traditions and spirit from SE Net are no longer around, but exist in our memories forever.
During my travels overseas I have met a number of great people who have been friendly and helpful. I would like to especially thank the Computer Science department at the University of North Carolina at Chapel Hill for allowing me to spend three months performing research there. Sally Stearns and Ray Thomas were kind enough to let me stay at their house for the first few days while I found a place to stay. At UNC I made many friends such as Mark Harris, Scott Cooper, Ken Hoff, Benjamin Lok, Samir Nayak, David Marshburn, Andrew Nashel, and Stefan Sain, as well as the Pi Lambda Phi fraternity and Drink Club crew.
There are also a number of other people who do not work with me but have been friends for many years and I would also like to thank them for their support - David Pridgeon, Trent Greenland, Ghassan Abi Mosleh, Rebecca Brereton, Donna Harvey, Krishianthi Karunarathna, Josie Brenko, Tam Nguyen, Sarah Bolderoff, and Derek Munneke.
The most instrumental person for this dissertation was my supervisor Dr Bruce Thomas. I have worked with Bruce for the last five years first as a final year undergraduate project student, and then as a PhD student. Bruce had the insight to move into wearable computers and augmented reality in the very early days and formed the Wearable Computer Lab we have today. Bruce has helped me to gain an international profile in the AR and wearables community, by generously giving me the funding to travel to numerous places all over the world and to meet other researchers (and obtain a frequent flyer gold card). This international development has strengthened my PhD with experience from a wide range of people and further motivated my research with fresh ideas. I would also like to thank Bruce for the countless hours I have spent with him discussing research, proof reading papers and this dissertation, talking about life in general, and having a beer as friends when travelling.
To achieve a PhD at the University of South Australia, the dissertation must be reviewed by two external experts in the area of research. I was fortunate enough to have Professor Steven Feiner from Columbia University and Associate Professor Mark Billinghurst from HIT Lab New Zealand as reviewers, who are both outstanding researchers in the international community. Each of them carefully read through the hundreds of pages of this dissertation and gave me excellent feedback which has been integrated into this final version. I would like to thank both of them for their time and dedication to reviewing my work and helping to improve it.
Most importantly of all, I would like to thank my mum Kris, dad Spishek, brother Arron, and my grandparents for supporting me for the last 25 years. My family also helped me build some of the early backpack prototypes in the garage, making an important contribution to the project. It is through their encouragement and care that I have made it through all the steps to reach this point in life, and I couldn’t have done it without them. When my dad bought me a Commodore 64 when I was a little boy, who would have thought I would have ended up here today? My family has always taken care of me and I love them all very much.
In summary, I would like to thank everyone for putting up with me for the last couple of years. I believe that this dissertation has made a real contribution to the field of computer science and I hope that everyone that reads this dissertation finds it useful in their work. It has been a fun journey so far, and I look forward to catching up with everyone and having lots of fun and good times, because that is the most important thing of all.
Now it is time to catch up on some sleep and have a holiday! (Well, not really - there is still plenty of other work to do now)
Regards,
______________________
Wayne Piekarski
Adelaide, February 2004
Inspiration
“Another noteworthy characteristic of this manual is that it doesn't always tell the truth ... The author feels that this technique of deliberate lying will actually make it easier for you to learn the ideas. Once you understand the simple but false rule, it will not be hard to supplement that rule with its exceptions.”
Donald Knuth, from the preface to The TeXbook
“If you do choose to use a computer, beware the temptation it offers to let manuscript preparation displace composition. They are two separate activities, best done separately. Hyphenation and exposition are at war with one another. Pagination vies with content. The mind busy fretting over point size has no time left over to consider clarity. If you need a break from the ardors of composition, try the time-honored ones like poking the fire or baking bread. They smell good, and they don't give you any illusion that your paper is making progress while you indulge in them.”
Mary-Claire van Leunen
1
"We all agree that your theory is crazy, but is it crazy enough?"
Niels Bohr (1885-1962)
Augmented reality (AR) is the registration of projected computer-generated images over a user’s view of the physical world. With this extra information presented to the user, the physical world can be enhanced or augmented beyond the user’s normal experience. The addition of information that is spatially located relative to the user can help to improve their understanding of it. In 1965, Sutherland described his vision for the Ultimate Display [SUTH65], with the goal of developing systems that can generate artificial stimulus and give a human the impression that the experience is actually real. Sutherland designed and built the first optical head mounted display (HMD) that was used to project computer-generated imagery over the physical world. This was the first example of an augmented reality display [SUTH68]. Virtual reality (VR) was developed later using opaque display technology to immerse the user into a fully synthetic environment. One of the first integrated environments was by Fisher et al., combining tracking of the head for VR with the use of tracked gloves as an input device [FISH86].
Augmented reality and virtual reality share common features in that they present computer-generated images for a user to experience, with information anchored to 3D locations relative to the user’s display, their body, or the world [FEIN93b]. The physical world seen when using AR can be thought of as a fourth kind of information that the user can experience, similar to world-relative display but not artificially generated. A typical example of a head mounted display is shown in Figure 1‑1, and an example AR scene with both physical and virtual worlds is depicted in Figure 1‑2. The schematic diagram in Figure 1‑3 depicts how a see-through HMD (used to produce AR images for the user) can be conceptualised, and the next chapter discusses in depth their implementation. While other forms of sensory stimulation such as haptics and audio are also available to convey information to the user, these will not be discussed since the focus of this dissertation is HMD-based AR.
|
|
|
Figure 1‑1 Example Sony Glasstron HMD with video camera and head tracker |
Although the first HMD was implemented in 1968 and performed augmented reality, the first main area of research with these devices was for virtual reality. VR has a number of similar research problems with AR, but does not rely on the physical world to provide images and can also be viewed on display monitors or projectors. Since the user is normally tethered to the VR system, they are not able to walk large distances and virtual movement techniques such as flying are required to move beyond these limits. These same problems restricted initial AR research because, by its very nature, the user would like to walk around and explore the physical world overlaid with AR information without being tethered to a fixed point. While there are some important uses for AR in fixed locations, such as assisted surgery with overlaid medical imagery [STAT96] or assistance with assembly tasks [CURT98], the ability to move around freely is important. When discussing the challenges of working outdoors Azuma states that the ultimate goal of AR research is to develop systems that “can operate anywhere, in any environment” [AZUM97b].
A pioneering piece of work in mobile augmented reality was the Touring Machine [FEIN97], the first example of a mobile outdoor AR system. Using technology that was small and light enough to be worn, a whole new area of mobile AR research (both indoor and outdoor) was created. While many research problems are similar to indoor VR, there are unsolved domain specific problems that prevent mainstream AR usage. Older survey papers (such as by Azuma [AZUM97a]) cover many technological problems such as tracking and registration. As this technology has improved, newer research is focusing on higher-level problems such as user interfaces, as discussed by Azuma et al. [AZUM01].
|
|
|
Figure 1‑2 Example of outdoor augmented reality with computer-generated furniture
Figure 1‑3 Schematic of augmented reality implementation using a HMD |
Many outdoor AR systems produced to date rely only on the position and orientation of the user’s head in the (sometimes limited) physical world as the user interface, with user interfaces that can indirectly adjust the virtual environment. Without a rich user interface capable of interacting with the virtual environment directly, AR systems are limited to changing only simple attributes, and rely on another computer (usually an indoor desktop machine) to actually create and edit 3D models. To date, no one has produced an outdoor AR system with an interface that allows the user to leave behind all their fixed equipment and independently perform 3D modelling in real-time. This dissertation explores user interface issues for AR, and makes a number of contributions toward making AR systems able to operate independently in the future, particularly for use in outdoor environments.
The development of user interfaces for mobile outdoor AR systems is currently an area with many unsolved problems. When technology is available in the future that solves existing registration and tracking problems, having powerful applications that can take advantage of this technology will be important. Azuma et al. state that “we need a better understanding of how to display data to a user and how the user should interact with the data” [AZUM01]. In his discussions of virtual reality technology, Brooks stated that input devices and techniques that substitute for real interactions were still an unsolved problem and important for interfacing with users [BROO97]. While VR is a different environment in that the user is fully immersed and usually restricted in motion, the 3D interaction problems Brooks discusses are similar and still relevant to the AR domain. Working in an outdoor environment also imposes more restrictions due to its mobile nature, and increases the number of problems to overcome.
On desktop computers, the ubiquitous WIMP interface (windows, icons, menus, and pointer - as pioneered by systems such as the Xerox Star [JOHN89]) is the de-facto standard user interface that has been refined over many years. Since mobile outdoor AR is a unique operating environment, many existing input methodologies developed for AR/VR and desktop user interfaces are unavailable or unsuitable for use. In an early paper about 3D modelling on a desktop, Liang and Green stated that mouse-based interactions are bottlenecks to designing in 3D because users are forced to decompose 3D tasks into separate 1D or 2D components [LIAN93]. Another problem is that most desktop 2D input devices require surfaces to operate on, and these are unavailable when walking outdoors. Rather than trying to leverage WIMP-based user interfaces, new 3D interfaces should be designed that take full advantage of the environment and devices available to the user. An advantage of AR and VR is that the user’s body can be used to control the view point very intuitively, although no de-facto standard has emerged for other controls in these applications. While research has been performed in the VR area to address this, many techniques developed are intended for use in immersive environments (which do not have physical world overlay requirements) and with fixed and limiting infrastructure (preventing portability).
With AR systems today (and with virtual environments in general), a current problem is the supply of 3D models for the computer to render and overlay on the physical world. Brooks mentions that 3D database construction and modelling is one of four technologies that are crucial for VR to become mainstream, and is still an unsolved problem [BROO97]. He mentions that there is promising work in the area of image-based reconstruction, but currently modelling is performed using CAD by-products or hard work. An interesting problem area to explore is what types of models can be created or captured directly while moving around outdoors. By integrating the modelling process and user interface, the user is able to control the modelling process directly and take advantage of their extensive knowledge of the environment.
For this dissertation I have investigated a number of different unsolved problems in the field of AR, and then combined the solutions developed to produce real world applications as demonstrations. I have formulated these different problems into research questions that will be addressed in this dissertation:
· How can a user intuitively control and interact with a mobile AR system outdoors, without hampering their mobility or encumbering their hands?
How can a user perform manipulation tasks (such as translate, rotate, and scale) with an outdoor mobile AR system of existing 3D geometry, in many cases out of arm’s reach and at scales larger than the user’s body?
How can a user capture 3D geometry representing objects that exist in the physical world, or create new 3D geometry of objects that the user can preview alongside physical world objects, out of arm’s reach and at scales larger than the user’s body?
What is an appropriate software architecture to develop this system with, to operate using a wide variety of hardware and software components, and to simplify application development?
What hardware must be developed and integrated into a wearable platform so that the user can perform AR in the physical world outdoors?
What application domains can take advantage of the novel ideas presented in this dissertation, and for what real world uses can they be applied to?
The main goal of this dissertation is to answer the questions discussed previously and contribute solutions towards the problems facing augmented reality today, predominantly in the area of user interfaces for mobile outdoor systems. The specific research goals of this dissertation are as follows:
· Mobile user interface - The user interface should not unnecessarily restrict the user’s mobility, and should use intuitive and natural controls that are simple to learn and use. Requiring the user to carry and manipulate physical props as controls may interfere with the user’s ability to perform a required task, such as holding a tool.
Real-time modelling enhanced with proprioception - The user should be able to interactively create and capture the geometry of buildings using the presence of their body. Using solid modelling operations and the current position and orientation of the head and hands will make this process intuitive for the user.
Mobile augmented reality - Outdoor augmented reality requires a mobile computer, a head mounted display, and tracking of the body. These technologies currently suffer from a number of limitations and applications must be designed with this in mind to be realisable. By using current technology, new ideas can be tested immediately rather than waiting for future technology that may not appear in the short term.
Using parts of the body such as the head and hands to perform gestures is a natural way of interacting with 3D environments, as humans are used to performing these actions when explaining operations to others and dealing with the physical world. By using techniques such as pointing and grabbing with objects in positions relative to the body, user interfaces can leverage proprioception, the user’s inbuilt knowledge as to what their body is doing. Mine et al. [MINE97a] demonstrated that designing user interfaces to take advantage of a human’s proprioceptive capabilities produced improved results. Using an input device such as a mouse introduces extra levels of abstraction for the direct manipulation metaphor (as discussed by Johnson et al. [JOHN89]), and so using the head and hands allows more intuitive controls for view point specification and object manipulation.
Trying to leverage existing 2D input devices for use in a naturally 3D environment is the wrong approach to this problem, and designing proper 3D user interfaces that directly map the user to the problem (as well as taking advantage of existing interactive 3D research) will yield improved results. The use of 3D input devices has been demonstrated to improve design and modelling performance compared to 2D desktop systems that force the user to break down 3D problems into separate and unnatural 2D operations [CLAR76] [SACH91] [BUTT92] [LIAN93].
In VR environments the user is able to use combinations of physical movement and virtual flying operations to move about. In contrast, in AR environments the user is required to always move with their physical body otherwise registration between the physical and artificial worlds will be broken. Using direct manipulation techniques, interacting with objects that are too large or too far away is not possible. Desktop-based CAD systems rely on 2D inputs but can perform 3D operations through the use of a concept named working planes. By projecting a 2D input device cursor onto a working plane the full 3D coordinates of it can be calculated unambiguously. By extending the concept of working planes to augmented reality, both the creation of geometry and interaction with objects at a distance can be achieved. These working planes can be created using the physical presence of the body or made relative to other objects in the environment. Accurate estimation of the depth of objects at large distances away has been shown to be difficult for humans [CUTT95], and AR working planes provides accurate specification of depth. This functionality is achieved using a slightly increased number of interaction steps and reduction in available degrees of freedom.
By combining AR working planes with various primitive 3D objects (such as planes and cubes) and traditional constructive solid geometry techniques (such as carving and joining objects), powerful modelling operations can be realised, which I have termed construction at a distance. These operations give users the capability to capture 3D models of existing outdoor structures (supplementing existing surveying techniques), create new models for preview that do not currently exist, and perform editing operations to see what effect changes have on the environment. By taking advantage of a fully tracked AR system outdoors, and leveraging the presence of the user’s body, interactive modelling can be supported in an intuitive fashion, streamlining the process for many types of real world applications.
This dissertation makes a number of research contributions to the current state of the art in augmented reality and user interfaces. Some of the initial contributions in this dissertation also require a number of supporting hardware and software artefacts to be designed and developed, each with their own separate contributions. The full list of contributions is:
· The analysis of current techniques for distance estimation and action at a distance, and the formulation of a technique named augmented reality working planes. This new technique can create objects accurately at large distances through the use of line of sight techniques and the projection of 2D cursors against planes. This technique is usable in any kind of virtual environment and is not limited to augmented reality. [PIEK03c]
The design and implementation of a series of techniques I term construction at a distance, which allow users to capture the 3D geometry of existing outdoor structures, as well as create 3D geometry for non-existent structures. This technique is based on AR and uses the physical presence of the user to control the modelling. Objects can be modelled that are at scales much larger than the user, and out of arm’s reach. [PIEK03c]
The iterative design and development of an augmented reality user interface for pointing and command entry, allowing a user wearing gloves to navigate through and select menu options using finger presses, without requiring high fidelity tracking that is unavailable outdoors. This user interface can operate without tracking, but when tracking is available it allows interaction at a distance with 3D environments. [PIEK03d]
The development of a vision-based hand tracking system using custom designed pinch gloves and existing fiducial marker tracking software that can work reliably under outdoor wearable conditions. [PIEK02f]
The development of applications that allow users to model buildings with the techniques described in this dissertation, and in some cases being able to model objects previously not possible with or faster than existing surveying techniques. [PIEK01b] [PIEK03c]
The design and implementation of a software architecture that is capable of supporting the research for this dissertation, culminating in the latest design of the Tinmith software. Current software architectures are still immature and do not support all the requirements for this dissertation, and so this architecture was designed to support these requirements and implements many novel solutions to various problems encountered during the design. [PIEK01c] [PIEK02a] [PIEK03f]
The iterative design and development of hardware required to demonstrate the applications running outdoors. Wearable computing is currently in its infancy and so the devices that are required to be used outdoors are not necessarily designed for this, and numerous problems have been encountered for which novel solutions to these are implemented. [PIEK02h]
After this introduction chapter, chapter two contains an overall background discussion introducing the concepts and technology that form a core of this dissertation. This overview provides a general discussion of related research, and each chapter then includes other more specific background information when relevant. Chapter three discusses the problems associated with a user interacting with objects at large distances – while most VR systems take advantage of a human’s ability to work well within arm’s reach, outdoor AR work tends to be performed away from the body where depth perception attenuates very rapidly. The novel idea of using the concept of CAD working planes for AR is introduced, along with various techniques that can be employed to perform interaction at a wide range of scales and distances beyond the reach of the user. Chapter four takes the concepts previously developed in the dissertation as well as constructive solid geometry and develops a new series of techniques named construction at a distance that allow users to perform modelling of outdoor objects using a mobile AR system. These techniques are then demonstrated using a series of examples of outdoor objects to show their usefulness. Chapter five explains how the previously developed techniques are interfaced to the user through pointing with the hands, command entry using the fingers, and the display of data to the HMD. The user interface is required to develop applications that are useable tools - an important part of this dissertation is the ability to test out the techniques and improve them iteratively. Chapter six introduces the software architecture used to facilitate the development of virtual environment applications. The software architecture contains a number of novel features that simplify the programming of these applications, unifying all the components with a consistent design based on object-oriented programming, data flow, and a Unix file system-based object repository. By tightly integrating components normally implemented separately, such as the scene graph, user interface, and internal data handling, capabilities that are normally difficult to program can be handled with relative ease. To complete the research, chapter seven describes the hardware components that are an important part of the overall implementation since the software relies on these to execute. The research and development of the mobile backpack, user input glove, and vision tracking system are explained to show some of the important new innovations that have been developed in these areas for use outdoors. After concluding with a discussion of the numerous contributions made and future work, this dissertation contains an appendix with a history of my previous hardware and software implementations, as well as links to where further information can be found about the project.
2
"If you want to make an apple pie from scratch, you must first create the universe."
Carl Sagan
This chapter contains a summary of the state of the art in augmented reality research and related technologies that are relevant to this dissertation. Some of this information has been developed during the period of my research and is included so comparisons can be made. First, an extended (from chapter 1) description and definition of augmented reality is presented, followed by a discussion of how it fits into the spectrum of virtual environments. The chapter then goes on to discuss various indoor and outdoor AR applications that have been developed, demonstrating the current state of the art. Following this is a discussion of the two techniques for performing real-time AR overlay, as well as a summary of the numerous types of tracking technologies employed. After these technologies have been explained, a history of human computer interaction techniques for desktop and virtual reality systems is then covered. Current techniques used for the capture of models in the physical world are then discussed, followed by a section summarising commercially available CAD software and solid modelling techniques. Finally, the problems of working outdoors with wearable computers are described, including how they can be used for mobile augmented reality.
When Sutherland proposed the concept of the Ultimate Display [SUTH65], his goal was to generate artificial stimulus that would give the user the impression that the experience is real. Instead of immersing the user into an artificial reality, a second approach is to augment the user’s senses with extra information, letting them experience both artificial and real stimulus simultaneously. In his excellent survey paper of the field, Azuma defines augmented reality systems as those that contain the following three characteristics [AZUM97a]:
· Combines real and virtual
Interactive in real-time
Registered in 3D
This definition does not limit augmented reality to the use of head mounted displays (allowing for monitors, projectors, and shutter glasses), but excludes non-interactive media such as movies and television shows. This dissertation focuses on mobile outdoor augmented reality, and therefore this chapter will focus only on research related to head mounted displays.
|
Figure 2‑1 Example of Milgram and Kishino’s reality-virtuality continuum (Adapted from [MILG94]) |
With the availability of real-time computer-generated 3D graphics, computers can render synthetic environments on a display device that can give the user the impression they are immersed within a virtual world. This technology is referred to as virtual reality (VR) and is designed to simulate with a computer the physical world humans normally can see. The opposite of VR is the real physical world typically experienced by a human, although it may be slightly attenuated because it is being viewed via a head mounted display or video camera. Augmented reality is therefore made up of a combination of virtual and real environments, although the exact make up of this may vary significantly. Milgram and Kishino used these properties to define a reality-virtuality continuum [MILG94], and this can be used to perform comparisons between various forms of mixed reality by placement onto a spectrum. At one end of the continuum is the physical world, the other end is fully synthetic virtual environments, and AR is located somewhere in between since it is a combination of the two. Figure 2‑1 is adapted from Milgram and Kishino’s continuum, with example pictures at different locations on the reality-virtuality spectrum but showing the view from the same location. The first image in Figure 2‑1 shows a view of the physical world seen through a head mounted display, with no virtual information at all. The next image is augmented reality, where artificial objects (such as the table) are added to the physical world. The third image is augmented virtuality, where physical world objects (such as a live display of the user’s view of the world) are added into a fully immersive virtual environment. The final image depicts a completely synthetic environment, with no information from the physical world being presented. Every type of 3D environment can be placed somewhere along this spectrum and can be used to easily compare and contrast their properties.
To overlay 3D models on to the user’s view, a mobile AR system requires a HMD to be combined with a device that can measure the position and orientation of the user’s head. As the user moves through the physical world the display is updated by the computer in real-time. The accuracy of the virtual objects registered to the physical world influences the realism of the fusion that the user experiences. A major focus of current AR research has been achieving good registration, as discussed extensively in survey papers by Azuma [AZUM97a] and Azuma et al. [AZUM01]. There are a number of known problems that cause poor registration, such as tracker inaccuracies, HMD misalignment, and delays in the various stages of rendering from the trackers to the display.
While registration is important for producing AR applications that are realistic (giving the user a sense of presence and hence being more immersive and easier to use) it is not the only important issue in AR research. Other questions, such as how do users interface with these systems, and what kind of tasks can systems perform, are also important and make the registration research useable for building real world applications.
During the evolution of technologies such as virtual reality and augmented reality, there have been a number of applications developed that demonstrate the use of this technology. In the field of augmented reality, this research work initially began indoors where hardware is able to be large and consume considerable electrical power without imposing too many restrictions on its use. As hardware has become smaller in size and more powerful, researchers are demonstrating more complex systems and are starting to move outdoors. This section discusses various applications that have been developed for both indoor and outdoor environments, approximately arranged in chronological order where possible.
For indoor augmented reality, there are a number of applications that have been developed in areas as diverse as information display, maintenance, construction, and medicine. These applications are used to provide extra situational awareness information to users to assist with their tasks. By projecting data onto the vision of a user, information is shown in situ in the environment and the user can better understand the relationship the data has with the physical world. The first working AR demonstration was performed using a HMD designed by Sutherland [SUTH68] and is shown in Figure 2‑2. This HMD is transparent, in that the user can see the physical world as well as computer-generated imagery from small CRT displays overlaid using a half silvered mirror. So while the goal of the Ultimate Display concept was to completely immerse the user’s senses into a virtual environment, Sutherland actually invented the addition of information (augmented reality) with the development of this display. Sutherland’s demonstration projected a simple wire frame cube with line drawn characters representing compass directions on each wall. Other see through HMDs were developed for use in military applications, with examples such as the Super Cockpit project by Furness [FURN86]. The use of HMDs was designed to improve on existing heads up displays (HUD) in military aircraft, providing information wherever the user is looking instead of just projected onto the front of the glass windshield. Similar technology is used to implement displays for virtual reality, except these are opaque and do not use the physical world to provide extra detail.
|
|
|
Figure 2‑2 The first head mounted display, developed by Ivan Sutherland in 1968 (Reprinted and reproduced with permission by Sun Microsystems, Inc) |
|
|
|
Figure 2‑3 External and AR immersive views of a laser printer maintenance application (Images courtesy of Steven Feiner – Columbia University) |
The KARMA system was developed by Feiner et al. as a test bed for the development of applications that can assist with 3D maintenance tasks [FEIN93a]. Instead of simply generating registered 3D graphics from a database to display information, KARMA uses automatic knowledge-based generation of output depending on a series of rules and constraints that are defined for the task. Since the output is not generated in advance, the system can customise the output to the current conditions and requirements of the user. One example demonstrated by Feiner et al. was a photocopier repair application (shown in Figure 2‑3) where the user is presented with detailed 3D instructions showing how to replace toner and paper cartridges.
|
|
|
Figure 2‑4 Virtual information windows overlaid onto the physical world (Image courtesy of Steven Feiner – Columbia University) |
The Windows on the World work by Feiner et al. demonstrated the overlay of windows with 2D information onto an AR display [FEIN93b]. While traditional AR systems render 3D information, this system is based on 2D information in an X Windows server. Windows of information can be created in the X server and then attached to the display, the user’s surround, or the physical world. As the user moves about the 3D environment, the system recalculates the position of the windows on the HMD. Since the system is based on X Windows, any standard X application can be used and information always appears facing the user with no perspective warping. Figure 2‑4 shows an example of 2D information windows attached to different parts of the environment.
One of the first commercially tested applications for augmented reality was developed by the Boeing company to assist with the construction of aircraft [CURT98]. One task performed by workers is the layout of wiring bundles on looms for embedding into the aircraft under construction. These wiring looms are complicated and so workers must constantly refer to paper diagrams to ensure the wires are placed correctly. Curtis et al. describe the testing of a prototype AR system that overlays the diagrams over the wiring board so that workers do not have to take their eyes away from the task. Although it was never fully deployed in the factory, this research is a good demonstration of how AR technology can be used to assist workers with complicated real world tasks.
|
|
|
Figure 2‑5 Worker using an AR system to assist with wire looming in aircraft assembly (Image courtesy of David Mizell – Boeing Company) |
|
|
|
Figure 2‑6 AR with overlaid ultrasound data guiding doctors during needle biopsies (Image courtesy of Andrei State – University of North Carolina, Chapel Hill) |
Using AR to assist doctors with medical imaging is an area that shows much promise in the near future. A current problem with X-ray and ultrasound images is that they are two dimensional and it is difficult to spatially place this information easily within the physical world. By overlaying this information onto the patient using AR, the doctor can immediately see how the imaging data relates to the physical world and use it more effectively. State et al. have been performing research into the overlay of ultrasound images onto the body to assist with breast biopsies [STAT96]. During the biopsy, a needle is injected into areas of the body that the doctor needs to take a sample of and analyse. Normally, the doctor will take many samples and hope that they manage to achieve the correct location, but damaging areas of tissue in the process. Using AR, the ultrasound overlay can be used to see where the biopsy needle is relative to the area of interest, and accurately guide it to the correct location. This results in less damage to the surrounding tissue and a greater chance of sampling the desired area. Figure 2‑6 shows an example of a needle being inserted into a simulated patient with overlaid ultrasound imagery.
|
|
|
Figure 2‑7 Studierstube AR environment, with hand-held tablets and widgets (Images courtesy of Gerhard Reitmayr – Vienna University of Technology) |
Schmalstieg et al. [SCHM00] and Reitmayr and Schmalstieg [REIT01a] describe a collaborative augmented reality system named Studierstube, which can perform shared design tasks. In this environment, users can work together to perform tasks such as painting objects and direct manipulation of 3D objects, as shown in Figure 2‑7. To provide users with a wide range of possible operations, the user carries a Personal Interaction Panel (PIP) [SZAL97]. The PIP can be constructed using either a pressure sensitive tablet or a tracked tablet and pen combination, and the AR system then overlays interactive widgets on top of the tablet. Using the pen on the tablet, the user can control the widgets that are linked up to various controls affecting the environment.
|
|
|
Figure 2‑8 Marker held in the hand provides a tangible interface for viewing 3D objects (Images courtesy of Mark Billinghurst – University of Washington) |
The ARToolKit was developed by Kato and Billinghurst to perform the overlay of 3D objects on top of paper fiducial markers, using only tracking data derived from captured video images [KATO99]. Using this toolkit, a number of applications have been developed that use tangible interfaces to directly interact with 3D objects using the hands. Billinghurst et al. [BILL99] use this toolkit to perform video conferencing, with the user able to easily adjust the display of the remote user, as shown in Figure 2‑8. Another application that uses this technology is Magic Book by Billinghurst et al. [BILL01]. Each page of the magic book contains markers that are used to overlay 3D objects with AR. By pressing a switch on the display the user can be teleported into the book and experience immersive VR. Magic Book integrates an AR interface (for viewing the book from a top down view with a tangible interface) with a VR interface (for immersively flying around the book’s 3D world).
|
|
|
Figure 2‑9 Actors captured as 3D models from multiple cameras overlaid onto a marker (Image courtesy of Adrian Cheok – National University of Singapore) |
The 3D Live system by Prince et al. [PRIN02] captures 3D models of actors in real-time that can then be viewed using augmented reality. By arranging a series of cameras around the actor, Virtual Viewpoint software from Zaxel [ZAX03] captures the 3D geometry using a shape from silhouette algorithm, and then is able to render it from any specified angle. 3D Live renders this output onto ARToolKit markers, and live models of actors can be held in the hands and viewed using easy to use tangible interfaces, as shown in Figure 2‑9. Prince et al. explored a number of displays for the system, such as holding actors in the hands on a card, or placing down life sized actors on the ground with large markers.
While indoor examples are useful, the ultimate goal of AR research is to produce systems that can be used in any environment with no restrictions on the user. Working outdoors expands the range of operation and has a number of unique problems, discussed further in Section 2.9. Mobile outdoor AR pushes the limits of current technology to work towards achieving the goal of unrestricted AR environments.
The first demonstration of AR operating in an outdoor environment is the Touring Machine (see Figure 2‑10) by Feiner et al. from Columbia University [FEIN97]. The Touring Machine is based on a large backpack computer system with all the equipment necessary to support AR attached. The Touring Machine provides users with labels that float over buildings, indicating the location of various buildings and features at the Columbia campus. Interaction with the system is through the use of a GPS and head compass to control the view of the world, and by gazing at objects of interest longer than a set dwell time the system presents further information. Further interaction with the system is provided by a tablet computer with a web-based browser interface to provide extra information. The Touring Machine was then extended by Hollerer et al. for the placement of what they termed Situated Documentaries [HOLL99]. This system is able to show 3D building models overlaying the physical world, giving users the ability to see buildings that no longer exist on the Columbia University campus. Another feature is the ability to view video clips, 360 degree scene representations, and information situated in space at the site of various events that occurred in the past.
|
|
|
Figure 2‑10 Touring Machine system overlays AR information in outdoor environments (Images courtesy of Steven Feiner – Columbia University) |
|
|
|
Figure 2‑11 BARS system used to reduce the detail of AR overlays presented to the user (Images courtesy of Simon Julier – Naval Research Laboratory) |
The Naval Research Laboratory is investigating outdoor AR with a system referred to as the Battlefield Augmented Reality System (BARS), a descendent of the previously described Touring Machine. Julier et al. describe the BARS system [JULI00] and how it is planned for use by soldiers in combat environments. In these environments, there are large quantities of information available (such as goals, waypoints, and enemy locations) but presenting all of this to the soldier could become overwhelming and confusing. Through the use of information filters, Julier et al. demonstrate examples (see Figure 2‑11) where only information of relevance to the user at the time is shown. This filtering is performed based on what the user’s current goals are, and their current position and orientation in the physical world. The BARS system has also been extended to perform some simple outdoor modelling work [BAIL01]. For the user interface, a gyroscopic mouse is used to manipulate a 2D cursor and interact with standard 2D desktop widgets.
|
|
|
Figure 2‑12 Context Compass provides navigational instructions via AR overlays (Images courtesy of Riku Suomela – Nokia Research Lab) |
Nokia research has been performing research into building outdoor wearable AR systems, but with 2D overlaid information instead of 3D registered graphics. The Context Compass by Suomela and Lehikoinen [SUOM00] is designed to give users information about their current context and how to navigate in the environment. 2D cues are rendered onto the display (as depicted in Figure 2‑12). Other applications such as a top down perspective map view have also been implemented by Lehikoinen and Suomela [LEHI02]. To interact with the system, a glove-based input technique named N-fingers was developed by Lehikoinen and Roykkee [LEHI01]. The N-fingers technique provides up to four buttons in a diamond layout that can be used to scroll through lists with selection, act like a set of arrow keys, or directly map to a maximum of four commands.
Apart from the previously mentioned systems, there are a small number of other mobile AR systems that have also been developed. Billinghurst et al. performed studies on the use of wearable computers for mobile collaboration tasks [BILL98] [BILL99]. Yang et al. developed an AR tourist assistant with a multimodal interface using speech and gesture inputs [YANG99]. Puwelse et al. developed a miniaturised prototype low power terminal for AR [POUW99]. Behringer et al. developed a mobile AR system using COTS components for navigation and control experiments [BEHR00]. The TOWNWEAR system by Satoh et al. demonstrated high precision AR registration through the use of a fibre optic gyroscope [SATO01]. The DWARF software architecture was designed by Bauer et al. for use in writing mobile outdoor AR applications [BAUE01]. Cheok has developed some outdoor games using AR and the 3D Live system discussed previously [CHEO02a] [CHEO02c]. Cheok has also developed accelerometer-based input devices such as a tilt pad, a wand, and a gesture pad for use with wearable computers [CHEO02b]. Fisher presents an authoring toolkit for mixed reality experiences and developed a prototype outdoor AR system [FISH02]. Ribo et al. developed a hybrid inertial and vision-based tracker for use in real-time 3D visualisation with outdoor AR [RIBO02]. Roberts et al. are developing a prototype for visualisation of subsurface data using hand held, tripod, and backpack mounted outdoor AR systems [ROBE02]. The use of AR for visualisation of archaeological sites was performed by Vlahakis et al. [VLAH02].
As previously mentioned, this dissertation focuses on the use of HMDs to merge computer-generated images with the physical world to perform augmented reality. This section describes the HMDs and other supporting technology necessary to display AR information, implemented using either optical or video combination techniques. These techniques are described and then compared so the applications of each can be better understood.
Rolland et al. [ROLL94], Drascic and Milgram [DRAS96], and Rolland and Fuchs [ROLL00] describe in detail the technological and perceptual issues involved with both optical and video see through displays. These authors identified a number of important factors that need to be considered when selecting which technology to use for an application, and these are as follows:
· System latency – the amount of time taken from when physical motion occurs to when the final image reflecting this is displayed.
Real-scene resolution and distortion – the resolution that the physical world is presented to the user, and what changes are introduced by the optics.
Field of view – the angular portion of the user’s view that is taken up by the virtual display, and whether peripheral vision is available to the user.
Viewpoint matching – the view of the physical world may not match the projection of the 3D overlay, and it is desirable to minimise these differences for the user.
Engineering and cost factors – certain designs require complex optics and so tradeoffs must be made between features and the resources required to construct the design.
· Perceived depth of overlapping objects – when virtual objects are drawn in front of a physical world object, it is desirable that the virtual objects perform correct occlusion.
Perceived depth of non-overlapping objects – by using depth cues such as familiar sizes, stereopsis, perspective, texture, and motion parallax, users can gauge the depth to distant objects.
Qualitative aspects – the virtual and physical worlds must be both rendered and these images must preserve their shape, colour, brightness, contrast, and level of detail to be useful to the user.
Depth of field – When physical and virtual images are passed through optics they will be focused at a particular distance. Keeping the image sharp at the required working distance is important for the user.
· User acceptance and safety – if the display attenuates the physical world it could be unsafe to use in some environments since the user’s vision system is not being supplied with adequate information to navigate.
Adaptation – some displays have limitations that can be adjusted to by humans over time, and can be used as an alternative to improving the technology if there are no harmful side effects.
Peripheral field of view – the area outside the field of view of the virtual display is not overlaid with information, but is still useful to the user when navigating in the physical world.
The design of an optically combined see through HMD system may be represented by the schematic diagram in Figure 2‑13, although in practice the design is much more complex due to the internal optics required to merge and focus the images. A small internal LCD screen or CRT display in the HMD generates an image, and an optical combiner (such as a half silvered mirror or a prism) reflects part of the light into the user’s eyes, and allowing light from the physical world to pass through to the eyes as well.
|
|
|
Figure 2‑13 Schematic of optical overlay-based augmented reality |
In general, most current AR systems based on optically combined displays share the following properties:
· Optical combiners are used to merge physical and virtual world images.
The computer generates an overlay image that uses black whenever it wants the pixels to be see-through, and so the images are simple and can be rendered quickly.
The physical world light is seen by the user directly and has high resolution with an infinite refresh rate and no delay, while the generated image is pixelated and delayed.
The physical world remains at its dynamic focal length, while the overlay image is fixed at a specific focal length.
Accurate registration of the image with the physical world is difficult because the computer cannot monitor the final AR image to correct any misalignments.
Ghosting effects are caused by the optical combiner since both virtual and physical images are visible simultaneously (with reduced luminance), and obscuring the physical world with a generated image cannot typically be performed.
The field of view of the display is limited by the internal optics, and distortions increase at larger values.
The front of the display must be unoccluded so that the physical world can be seen through the HMD.
An example image from an optically combined AR system is shown in Figure 2‑14, with a 3D virtual table overlaying the physical world. Some of the problems with the technology are shown by the ghosted image and reflections, caused by sunlight entering the interface between the HMD and the lens of the camera capturing the photo.
Recent technology has improved on some of the problems discussed in this section. Pryor et al. developed the virtual retinal display, using lasers to project images through an optical combiner onto the user’s retina [PRYO98]. These displays produce images with less ghosting effects and transmission losses than an LCD or CRT-based design. Kiyokawa et al. produced a research display that can block out the physical world selectively using an LCD mask inside the HMD to perform proper occlusion [KIYO00].
|
|
|
Figure 2‑14 Optically combined AR captured with a camera from inside the HMD |
|
|
|
Figure 2‑15 Schematic of video overlay-based augmented reality |
Video combined see through HMD systems use video cameras to capture the physical world, with virtual objects overlaid in hardware. This technique was first pioneered by Bajura et al. in 1992 [BAJU92]. An example implementation is depicted in the schematic in Figure 2‑15, with a video camera capturing images of the physical world that are combined with graphics generated by a computer. The display for this technique is opaque and therefore the user can only see the physical world through the video camera input. The combination process can be performed using two different techniques: using chroma-keying as a stencil to draw the video where AR pixels have not been drawn, or using the computer to draw the AR pixels on top of the video. The final image is then displayed to the user directly from an LCD or CRT display through appropriate optics.
In general, most current AR systems based on video combined displays share the following properties:
· The display is opaque and prevents light entering from the physical world, making it also possible to use for virtual reality tasks with no modifications required.
Some form of image processing is used to merge physical and virtual world images. Real-time image transformations may be necessary to adjust for resolution differences, spherical lens distortions, and differences in camera and display position.
The capture of the physical world is limited to the resolution of the camera, and the presentation of both physical and virtual information is limited to the resolution of the display. The final image viewed by the user is pixelated and delayed, with consistency between physical and virtual depending on whether the camera and display have similar resolutions.
The entire image projected to the user is at a constant focal length, which while reducing some depth cues also makes the image easier to view because the focus does not vary between physical and virtual objects.
More accurate registration may be achieved since the computer has access to both incoming and outgoing images. The computer may adjust the overlay to improve registration by using a closed feedback loop with image recognition.
The image overlay has no ghosting effects since the incoming video signal can be modified to completely occlude the physical world if desired.
By using video cameras in other spectrums (such as infra-red or ultraviolet) the user can perceive the physical world that is not normally visible to the human eye.
Demonstrations to external viewers on separate monitors or for recording to tape is simple since the video signal sent to the HMD may be passed through to capture exactly what the user sees.
An example image from a video combined AR system is shown in Figure 2‑16, with a 3D virtual table overlaying the physical world. Slight blurring of the video stream is caused by the camera resolution differing from that used by the display.
|
|
|
Figure 2‑16 Example video overlay AR image, captured directly from software |
Table 2‑1 lists a summary of the information presented concerning optical and video combination techniques, comparing their features and limitations. Neither technology is the perfect solution for AR tasks, so the appropriate technique should be selected based on the requirements of the application.

Table 2‑1 Comparison between optical and video combined AR systems
To render graphics that are aligned with the physical world, devices that track in three dimensions the position and orientation of the HMD (as well as other parts of the body) are required. A tracker is a device that can measure the position and/or orientation of a sensor relative to a source. The tracking data is then passed to 3D rendering systems with the goal being to produce results that are realistic and match the physical world as accurately as possible. There have been a number of survey papers in the area: Welch and Foxlin discuss the state of the art in tracking [WELC02], Holloway and Lastra summarise the technology [HOLL93], and Azuma covers it as part of a general AR survey [AZUM97a]. This section covers the most popular technologies for tracking, with a particular focus on the types that are useful when working outdoors. This section is by no means a complete discussion of tracking and does not present new tracking results. I simply use currently available devices in this dissertation to provide tracking for my applications.
There are a number of different tracking technologies used, varying by the number of dimensions measured and the physical properties used. Holloway and Lastra discuss the different characteristics of various tracking systems [HOLL93], and these are summarised as follows:
· Accuracy – the ability of a tracker to measure its physical state compared to the actual values. Static errors are visible when the object is not moving, while dynamic errors vary depending on the motion of the object at the time.
Resolution – a measure of the smallest units that the tracker can measure.
Delay – the time period between reading inputs, processing the sensor data, and then passing this information to the computer. Large delays cause virtual objects to lag behind the correct location.
Update rate – the update rate measures the number of data values per second the tracker can produce. Faster update rates can perform smoother animation in virtual environments.
Infrastructure – trackers operate relative to a reference source. This reference may need to be measured relative to other objects to provide world coordinates useful to applications.
Operating range – trackers are limited to operating within a limited volume defined by the infrastructure. Signals emitted by sources attenuate rapidly over distance, which limits the range of operation.
Interference – various tracking technologies use emissions of signals that can be interfered with by other sources. External interference can be difficult to cancel out and affects the accuracy of results.
Cost – trackers range in price depending on complexity and the accuracy provided.
In this section, various aspects of the above factors will be discussed, along with the following extra factors:
· Degrees of freedom – trackers measure a number of degrees of freedom, being able to produce orientation, position, or some combination of these as results.
Coordinate type – some trackers measure velocity or acceleration that requires integration to produce relative-position values. When integrating values that are not exact, errors accumulate over time and cause drift. Absolute values do not require integration and are stable over time.
Working outdoors has a number of problems that are not noticed when dealing with indoor tracking systems. The use of tracking equipment in an indoor environment is simplified due to known limitations of the working environment. Alternatively, when working outdoors the environment is virtually unlimited in size and setting up infrastructure may be difficult. The use of technology that is required to be mobile restricts further the choices of tracking devices available. Azuma discusses in detail many problems to do with performing tracking outdoors [AZUM97b], and some extra factors to consider for comparison are:
· Portability – the device must be able to be worn by a person for use in a mobile environment, so weight and size are important.
Electrical power consumption – the tracking system must be able to run using batteries and not have excessive power requirements.
One of the main points stressed by Welch and Foxlin [WELC02] and Azuma et al. [AZUM98] is that to obtain the best quality tracking and to minimise any problems, hybrid tracking should be used. Since no tracking technology is perfect, hybrid trackers combine two or more different types of technologies with varying limitations to produce a better overall tracker. The last part of this section discusses some hybrid systems in detail.
Mechanical trackers rely on a physical connection between source and object, producing absolute position and orientation values directly.
The first tracker developed for interactive 3D computer graphics was the mechanical “Sword of Damocles” by Sutherland along with his new HMD [SUTH68]. This tracker is a mechanical arm with angle sensors at each joint. By knowing the length of each arm segment and the measured angle at each joint, the position and orientation of the tip of the arm can be calculated relative to the base. Measuring angles at a mechanical joint is very accurate with only very slight delays. Due to the mechanical nature of the device, the motion of the user is restricted to the length of the arm and the various joints that connect it together. The arm is quite heavy for a human and so while counterweights help to make it lighter, the inertia of the arm requires the user to perform movements slowly and carefully to avoid being dragged about and injured.
Sutherland also demonstrated a wand like device to use for 3D input when using the HMD. This device uses a number of wires connected to pulleys and sensors that measure location information. While much more lightweight, this device requires that the wires not be touched by other objects in the room as well as the user, and so the user must take this into account when moving about the room, restricting their motion.
Accelerometers measure linear forces applied to the sensor and are source-less, producing relative-position values through double integration. Accelerometers can measure absolute pitch and roll when measuring acceleration caused by gravity.
Accelerometers are small and simple devices that measure acceleration forces applied to an object along a single axis, discussed in detail by Foxlin et al. [FOXL98a]. Modern accelerometers are implemented using micro-electro-mechanical systems (MEMS) technology that have no moving parts and can be embedded into small IC sized components. Accelerometers vibrate small elements internally and measure applied forces by sensing changes in these vibrations. To acquire velocity this value must be integrated, and then integrated again if relative position is required. The advantages of accelerometers are that they require no source or infrastructure, support very fast update rates, are cheap to buy, have low power requirements, and are simple to add to a wearable computer. The main disadvantage of this technology is that the process of integrating the measurements suffers from error accumulation and so within a short time period the values drift and become inaccurate. Due to the rapid accumulation of errors, accelerometers are not normally used standalone for position tracking. Accelerometers are commercially available from companies such as Crossbow [XBOW02].
When three accelerometers are mounted orthogonally to each other, a tilt sensor is formed that can measure the pitch and roll angles toward the gravity vector. Since gravity is a constant downward acceleration of approximately 9.8 ms-2 on Earth, orientation can be calculated by measuring the components of the gravity force that is being applied to each accelerometer. The tilt sensor output is vulnerable to errors caused by velocity and direction changes since these applied forces are indistinguishable from gravity.
Gyroscopes measure rotational forces applied to the sensor and are source-less, producing relative-orientation values through integration.
The first gyroscopes were mechanical devices constructed of a wheel spinning on an axis. Gyroscopes are induced to maintain spinning on a particular axis once set in motion, according to the laws of conservation of angular momentum. When an external force is applied to a gyroscope, the reaction is a motion perpendicular to the axis of rotation and can be measured. Gyroscopes are commonly used for direction measurements in submarines and ships, being accurate over long periods of time but typically very large and not portable.
Gyroscopes may also be constructed using MEMS technology and contain an internal vibrating resonator shaped like a tuning fork, discussed in detail by Foxlin et al. [FOXL98a]. When the vibrating resonator experiences rotational forces along the appropriate axis, Coriolis forces will cause the tines of the fork to vibrate in a perpendicular direction. These perpendicular forces are proportional to the angular velocity and are measured to produce output. Since each gyroscope measures only one axis of rotation, three sensors are mounted orthogonally to measure all degrees of freedom. To gain absolute orientation the velocity from the sensor must be integrated once, but this drifts over time and is not normally used for standalone orientation tracking. These devices are similar to accelerometers in that they require no source or infrastructure, support very fast update rates, are cheap to buy, have low power requirements, and are simple to add to a wearable computer. Another common name for these devices is a rate sensor, and companies such as Crossbow [XBOW02] manufacture gyroscopes for a wide range of non-tracking related commercial uses.
The most accurate gyroscope technology is based on lasers and the change in phase of photons that occurs between two intersecting laser beams and a detector. A Ring Laser Gyroscope (RLG) uses mirrors to bounce a laser beam around back to a detector, while a Fibre Optic Gyroscope (FOG) uses a coil of fibre optic cable wrapped around a rotation axis back to a detector. When a change in rotation occurs, the photons will take slightly more or less time than under no motion, and by measuring the phase difference and integrating it the total motion and hence relative orientation can be calculated. The TISS-5-40 FOG described by Sawada et al. [SAWA01] exhibited results with attitude better than ±0.1 degrees and heading drift less than 1 degree per hour. In comparison, MEMS-based gyroscopes drift by a degree or more within minutes of time passing (or even less).
Ultrasonic tracking measures the time of flight of ultrasonic chirps from transmitter sources to microphones, producing absolute position and orientation values directly.
While the mechanical tracker developed by Sutherland for use with his HMD was accurate and fast [SUTH68], the weight of the device was difficult to work with and had limited motion due to the mechanical linkages. Sutherland also developed an acoustic tracker which was not tethered, and worked by sending out pulses of ultrasonic sound from the head, and measuring the time of flight to reach numerous sensors dispersed across the ceiling. While the tracker worked and demonstrated the possibilities of tracking without cumbersome mechanical linkages, problems were encountered with the ultrasonic pulses interfering with each other.
Ultrasonic tracking is limited by the properties of the pulses sent for time of flight detection. Noise in the environment caused by the jingling of keys will cause the tracker to fail, and environmental effects such as wind reduce the quality of the results [WELC02]. Since the pulses travel at the speed of sound, delays in tracking increase as the sensor moves away from the transmitter. By relying on the speed of sound, environmental effects such as temperature, humidity, and air currents can have an impact on the accuracy of the measurements.
While time of flight can produce accurate position values in a room using triangulation, calculating orientation is more difficult because multiple transmitters and receivers must be adequately spaced apart to get an accurate result. Furthermore, the orientation updates are quite slow compared to other technology. Foxlin et al. [FOXL98b] mentions that in the Constellation tracking system, the orientation values are combined with accelerometers and gyroscopes using a Kalman filter to increase the update rate and smooth the output.
Passive magnetic tracking measures the forces generated by the Earth’s magnetic field as a source, producing absolute heading values directly.
When a freely suspended ferromagnetic object is exposed to a magnetic field, it will rotate so that its magnetic domains are in opposing alignment to the applied field. The Earth generates a magnetic field and a ferromagnetic object can be used to find the directions of the north and south poles of this field anywhere on the planet. By attaching a measuring scale to a freely suspended ferromagnetic object (to form a compass), the orientation of a tracking device can be determined relative to magnetic north. A compass is mechanical and due to the inertia of the magnet and attached parts, there is a settling time where the user of the device must wait to make an accurate reading. Electronic trackers have been constructed that use mechanical parts, but a more efficient method is to use solid state components.
A magnetometer is a solid state electronic device that can detect magnetic fields. As a magnetic field passes through a coil of wire, this produces an induced current that is proportional to the strength of the field and the incident angle to the coil. By aligning three magnetometers orthogonally, the direction to magnetic north can be calculated. These devices do not have inertia like the mechanical equivalent and so produce faster and more accurate results. Solid state magnetometers are available from a number companies such as Crossbow [XBOW02], who manufacture them for a number of non-tracking related commercial uses. Since the Earth’s magnetic field exists everywhere on the surface, no infrastructure is required to be setup and there is no range limitation. Although the magnetic field produced by the Earth is quite strong, at the surface it is relatively weak when compared to the field produced by a local magnetic source. When other ferromagnetic objects are brought close to a magnetometer, the Earth’s magnetic field is distorted locally and this affects the measurements of the sensor.
Active magnetic tracking measures the magnetic fields generated by a local transmitting source, producing absolute position and orientation values directly.
Rather than just relying on weak magnetic fields generated by the Earth to perform direction sensing, a tracking device may generate its own powerful local magnetic field. The tracking sensor measures this local magnetic field to determine position and orientation measurements between the sensor and source. The first tracker to implement this technique was designed in the 1970s to be used inside aircraft cockpits by Polhemus [POLH02], and is discussed by Raab et al. [RAAB79]. This technology uses three magnetometers arranged orthogonally as with a passive magnetic tracker, and a transmitter to generate a field for it to detect. The transmitter is constructed with three magnetic coils also arranged orthogonally, and each coil is pulsed with an AC signal to generate a magnetic field that is then detected by the sensor coils. By pulsing each transmitter coil separately and measuring the response in the sensor coils, both position and orientation can be reconstructed with good accuracy at close distances. A limitation of the AC signals used by Polhemus trackers is that changing eddy currents form in nearby metal objects and this causes distortions in the measured results. With this limitation in mind, one of the original engineers left the company to create a new company to fix these problems. The new company, Ascension Technologies [ASCE02], developed trackers that were similar but uses a DC pulse that generates stable eddy currents in nearby metal. To improve accuracy further, measurements from the sensors when the transmitter is not active are used to measure background magnetic fields. With these two improvements, magnetic tracking is less susceptible to interference by metal but it is still a problem. Both Polhemus and Ascension trackers work in environments where the tracking range is reasonably small and cables must be used to connect both the transmitter and multiple sensors to the controller unit.
GPS tracking measures the time of flight of signals from satellites in space to the user, producing absolute position values directly.
The Global Positioning System (GPS) was developed by the US military to provide reliable and real-time navigation information not previously available using existing methods such as dead reckoning and celestial navigation. The system is based on a constellation of 24 satellites that orbit the Earth, each transmitting specially encoded radio waves that contain highly accurate timing information. A receiver unit (with knowledge of the current position of the GPS satellites) can calculate its position by measuring the time of flight of these signals from space. The GPS satellites broadcast on two frequencies, L1 at 1575.4 MHz and L2 at 1227.6 MHz, and can penetrate atmospheric effects such as cloud, rain, smoke, smog, dust, and air pollution [MCEL98]. These frequencies are blocked by physical objects such as buildings and tree canopies, and so GPS cannot be reliably used amongst these objects. Encoded onto these signals are P-code for military users, C/A code for civilian users, and navigation messages containing satellite information. The L1 channel is intended only for civilian use (containing C/A and navigation messages) while L2 is designed for military use along with L1 and contains the more precise P-code information. Previously, the L1 channel was intentionally degraded by the US military using Selective Availability (SA) but this has now been deactivated. Another navigation system that is operated by the Russian Federation is the Global Navigation Satellite System (GLONASS), which operates in the same way as GPS but with different frequencies, satellite geometry, and signal encoding. Some GPS receivers also come with the ability to use GLONASS satellites to improve accuracy, although GLONASS cannot be currently used standalone because only a small number of the full constellation of satellites are in orbit.
The positioning quality resulting from a GPS receiver depends on the accuracy of the processing performed in the receiver as well as other external effects. The quality of the receiver is important because the time of flight measurements and position calculations rely on having an accurate internal clock and high resolution floating point unit. Using three satellites and an accurate atomic clock makes it possible to find the position of the user, but if the clock is not accurate (as is the case with commercial grade units) then an extra satellite is required to resolve the ambiguity. Further errors are introduced by particles as well as magnetic and electrical effects in the atmosphere that affect L1 and L2 bands. GPS uses time of flight and so cannot derive orientation, except when multiple sensors are spaced sufficiently far apart. This technique is not normally used except on large scale construction projects such as bridge building where adequate distances between sensors can be obtained.
Consumer grade GPS receivers come in a variety of form factors, from tiny embedded OEM chip designs to hand-held units with information displays such as the Garmin GPS 12XL [GARM99]. The accuracy of consumer grade GPS units vary depending on environmental conditions, but with the use of differential GPS (DGPS) radio signals, accuracies of 5-10 metres at one update per second can be achieved. DGPS signals are generated by a base station that measures the difference between its known surveyed position and reported GPS position, transmitting corrections for each satellite to GPS receivers located within a few hundred kilometres.
With the development of GPS, surveyors have started to use it for their work but require greater accuracy than is possible with DGPS enabled consumer grade receivers. By improving the quality of the internal receiver, the accuracy of GPS calculations can be improved. For example, the Trimble Ag132 [TRIM02] uses signal processing algorithms to filter out GPS signals reflected from nearby objects (referred to as multi path correction) and to compensate for some errors introduced by the atmosphere. By using clocks and processors more accurate than consumer grade units, as well as DGPS corrections, the accuracy of position measurements is improved to around 50 cm at a rate of 10 updates per second.
Even with these improved GPS units, Allison et al. discuss the use of Real-time Kinematic (RTK) techniques to further improve the accuracy of GPS tracking [ALLI94]. RTK GPS units can achieve accuracy in the range of 1-2 centimetres at 30 updates per second, obtained by counting the number of wavelengths between the satellite and the receiver, and using extra L2 frequency information. As the GPS antenna is moved around, the receiver closely monitors the phase changes in the signal and uses the count of the wavelengths to provide 1-2 centimetre accuracy. Although encrypted, the L2 signal still contains some timing information that can be extracted and RTK correlates this with the normal L1 signal. RTK also uses similar correction techniques as discussed previously and requires a secondary surveyed DGPS source located within a few hundred metres.
A number of different coordinate systems are used for representing locations on Earth, and are discussed extensively in the Geodetic Datum of Australia Technical Manual [ICSM00]. Polar coordinates have been traditionally used by navigators and surveyors and are the most logical for working with a planet that approximates a slightly flattened spheroid. Latitude is measured in degrees north/south from the equator, and longitude is measured in degrees east/west from the prime meridian. Distance from the centre of the Earth is not required in many cases because the user may be assumed to be located on the spheroid surface. A number of different spheroid parameter models (datums) have been developed for representing coordinates on an imperfectly shaped Earth, such as AGD66, AGD84, and WGS84. Polar coordinates are output natively by GPS systems and I refer to these coordinates as LLH (latitude-longitude-height) values.
An alternative to using polar coordinates is the use of Cartesian coordinates relative to the centre of the Earth, referred to as ECEF values. These coordinates are represented in metres as XYZ values, with Z passing through the geographic north pole, X through the equator and prime meridian, and Y through the equator and 90 degrees east. ECEF coordinates are commonly used to transform between coordinate datums but are unintuitive for use by humans. While LLH values only require latitude and longitude to specify position (where height can optionally be assumed to be on the spheroid), ECEF values require all 3 components to be useful at all. For working in small local areas, surveyors have developed special coordinate systems and projections using metres as units with an approximate flat Earth model, referred to as the Universal Transverse Mercator (UTM) grid. The Earth is divided up into a number of zones, each with separate origins so that coordinates within can be expressed easily. UTM coordinates are specified as northings and eastings values in metres from a local origin, and are simple to handle using standard trigonometry. ECEF and UTM coordinates are both useful when dealing with 3D renderers, which are designed to operate using Cartesian and not polar coordinates. There are a number of standard algorithms for accurately converting between LLH, ECEF, and UTM coordinates, although they are beyond the scope of this dissertation and described elsewhere [ICSM00].
Optical tracking can be implemented using active sources or passive features, producing either relative or absolute values depending on the technology in use. Trackers that use a known source such as a transmitter or fiducial marker produce absolute values directly. Source-less trackers require integration to produce relative orientation and position values.
When navigating outdoors, humans primarily rely on their eyesight to capture images of the local environment and match these against previous memories to work out their current position. By using information known about the world as well as cues such as occlusion, relative size, relative density, height, environmental effects, motion perspective, convergence, accommodation, and binocular disparities [CUTT95], people estimate the size and relative position to visible objects. There are current investigations to perform similar operations with video cameras and computer-based vision systems to estimate both position and orientation information. This research involves a number of disciplines such as artificial intelligence and computer vision, and is currently not mature enough to produce a tracker capable of working under arbitrary motion [AZUM01].
For environments where features are not known in advance, a number of techniques have been developed that attempt to perform full position and orientation tracking based on arbitrary objects in the scene. Some examples of these markerless trackers are by Simon et al. [SIMO00], Simon and Berger [SIMO02], Genc et al. [GENC02], and Chia et al. [CHIA02]. These researchers have developed algorithms for finding parts of the scene to track and calculate the motion of the user. There are many problems that include selecting appropriate features to track, dealing with the unexpected movement of non-static objects, blurred images caused by high speed motion, varying lighting conditions, distinguishing between translation and rotation, and drift of the integrated results over time. Bishop describes a tracking system implemented in a custom integrated circuit that measures optical flow at orders of magnitude faster than using video cameras, reducing drift and blur problems [BISH84]. Behringer attempted to match silhouettes on the horizon against those generated from local terrain models for orientation sensing, although suffers from the horizon being obscured by objects such as buildings and trees [BEHR98].
The most successful optical tracking methods so far all involve the use of markers that are placed in the environment for image recognition methods to detect. State et al. discuss problems similar to those mentioned earlier, and that fiducial markers or landmarks simplify the computations needed to analyse the image [STAT96]. Since landmark tracking still has problems with robustness, State et al. use magnetic tracking as the primary source and then correct it with tracking from live video streams. The magnetic tracker simplifies the vision tracking mechanism by providing an estimate for the location of the markers. The markers can then be used to extract out more precise position and orientation values that are pixel accurate. An alternative technique used in other systems is to drag the overlay image in 2D to align with the markers. This is much simpler but produces less accurate results because rotation and depth effects are not handled.
The ARToolKit software library has been developed by Kato and Billinghurst for use in tangible mixed reality applications [KATO99]. Using a single fiducial marker printed on paper, position and orientation information can be extracted from individual video frames. The markers are black squares with a white inner area that can contain a non-symmetrical pattern. By analysing the edges to measure perspective change, the rotation of the pattern, and the distance from the camera, the algorithm can extract tracking information in real-time. This algorithm is simple in that it requires only a single marker for tracking, and generates results that appear to overlay the marker correctly but may be inaccurate from other view points. This software is not intended to be used to implement accurate 6DOF tracking systems and other techniques should be used instead. ARToolKit is widely used in many research applications and is available under the GNU Public License, while others remain proprietary.
Based on the discussions of various tracking technologies in this section, it can be seen that there is no one technology that is able to perform accurate tracking standalone. As discussed earlier, Welch and Foxlin [WELC02] and Azuma et al. [AZUM98] agree that for the implementation of the ultimate tracker, hybrids will be required since each technology has limitations that cannot otherwise be overcome. Tracking outdoors is also even more difficult than indoors because the variables that affect the tracking cannot be controlled as easily. By combining two or more sensor technologies using an appropriate filter, these limitations may potentially be overcome to produce a tracker that is accurate over a wider range of conditions. Many commercial tracking systems implement some kind of hybrid tracking and these will be discussed in this subsection.
For position sensing, strap-down inertial navigation systems (INS) are constructed using a combination of three orthogonal gyroscopes and three orthogonal accelerometers [FOXL98a]. Accelerometers can measure 3D translations in sensor coordinates but the orientation of the sensor in world coordinates must be known to calculate the translation in world coordinates. By integrating the gyroscope values the orientation can be found and used to compensate for gravity in the accelerometers and then to calculate the INS position in world coordinates. The INS is not a complete tracking system however, since all the sensors are only relative devices and are affected by drift. Errors in any of the sensed values will affect other calculations and so introduce additional accumulating errors over time.
For measuring orientation, a simple example is the TCM2 digital compass from Precision Navigation [PNAV02], shown in Figure 2‑17. The TCM2 uses magnetometers to measure the angle of the device to magnetic north, and also tilt sensors based on accelerometers to measure the angle relative to the gravity vector downwards. Each sensor is not capable of measuring the values the other can, and so by combining them full 3DOF orientation values can be measured. These values are absolute and do not drift over time, making them ideal for AR applications, although are relatively low speed, include a lot of noise in the output, and the magnetic heading can be easily distorted by large metal objects.
|
|
|
Figure 2‑17 Precision Navigation TCM2 and InterSense InertiaCube2 tracking devices |
Again for measuring orientation, Foxlin and Durlach [FOXL94] followed by Foxlin et al. [FOXL98a] developed the InertiaCube, which combines the tilt sensing properties of accelerometers with rate sensing gyroscopes (see the previously discussed INS) to produce full 3-axis orientation values. This device requires no infrastructure (apart from gravity to supply an absolute reference for the tilt sensor) and is very small and portable. A filter is used to combine the tilt sensor values with the drifting gyroscopes to produce absolute orientation values with the desirable properties from both sources. Since the heading information is not corrected with an absolute sensor, this value will drift over time while the others remain stable.
To compensate for problems associated with the previous two hybrid orientation trackers, Foxlin combined the InertiaCube sensor with a 3-axis magnetometer (similar to the TCM2). The specially designed Kalman filter reads in all the sensor values and combines them to correct for errors that occur under different conditions. This device is sold as the InterSense IS-300 and InertiaCube2 [ISEN03], and is shown in Figure 2‑17. Both of these devices produce absolute values that do not drift over time since correction is performed on all orientation axes. The relative values then provide smooth and high speed output under fast motion. Azuma also produced a similar tracker that combined a three axis gyroscope with a TCM2 magnetic compass [AZUM99]. A limitation of these hybrids is if the sensor is exposed to local magnetic distortions for a considerable time, the Kalman filter will incorrectly calibrate the gyroscopes and introduce errors into the output.
Table 2‑2 lists a summary of the information presented concerning 3D tracking technologies, comparing their features and limitations. As described previously, there is no one perfect tracking technology and so combinations of these are typically required.

Table 2‑2 Comparison between various types of 3D tracking technology
In 1963, Sutherland produced the Sketchpad system [SUTH63] that had a revolutionary interface for its time. Instead of using the computer for batch processing or using a keyboard to enter interactive commands, he developed a new user interface that allowed users to draw graphical objects and manipulate them using a light pen on a display. The interactions occur in real-time, and use what is now referred to as direct manipulation to let the user control the system.
A second important piece of early work was the Put-That-There system by Bolt, where a user may interact with data in a room using simple pointing gestures and speech commands [BOLT80]. The goal of this system was to allow a user sitting in a chair to interact with information projected onto the walls. This system used Polhemus magnetic trackers for the pointing, a speech recognition computer with a pre-programmed grammar for interpreting commands, and a data wall showing information of interest to the user.
The term direct manipulation was first defined by Shneiderman, involving the continuous representation of application data as objects that can be manipulated with an input device [SHNE83]. Other important properties are that operations are rapid, incremental, and reversible with immediate feedback, and usable by both novice and expert users. The most powerful direct manipulation interfaces use analogies that a user can easily relate to, such as turning a dial or dragging a box, and are more intuitive than command-based interfaces. Johnson et al. describes direct manipulation using another guideline: “Data objects should be selected and operated on by simulated physical contact rather than by abstract verbal reference” [JOHN89].
One of the first and most complete implementations of a direct manipulation-based user interface was the Xerox Alto project, later released commercially as the Xerox Star [JOHN89]. This system uses a keyboard and monitor like many others, but for the first time implemented what is referred to as the desktop metaphor. Rather than a user invoking tools to perform operations, the system abstracts these concepts to that of an office and working with documents. Instead of using a command line interface, the user controls the system by directly manipulating icons and dialog boxes presented on the display, using a mouse to simulate physical contact. The presentation of the display was also novel by using a high resolution graphical display, with windows for displaying documents and abstract graphical representations of data. The Xerox Star system inspired many other WIMP user interfaces such as the Macintosh and Windows, which are still used today. It is important to realise that not all systems with a mouse and windows are necessarily desktop-based. Only those systems that abstract away operations on data files to an office desktop metaphor meet this criteria [JOHN89]. For example, X Windows and TWM are only 2D windowing environments for running applications. Systems such as the Macintosh, Windows, KDE, and GNOME are all desktop environments. This dissertation uses the term 2D desktop more generally to refer to any computer system that is placed on a desk. These systems use a mouse-like input device, a keyboard for command entry, and a monitor for the display of information.
This section discusses the evolution of interaction techniques for virtual reality, starting from the development of the first HMDs and tracking devices to the current state of the art. The main focus of this section is on the user interfaces that have been developed, with the majority of the research systems described performing the modelling of 3D objects. Modelling applications are good examples for research because they require complicated user interfaces with efficient and easy to use techniques. Applications that can perform 3D modelling are also still an area of research, as discussed by Brooks [BROO97]. This chapter will only review the introduction of techniques to the VR area, since the previously reviewed AR systems borrowed their interfaces from similar VR systems.
After working on the Sketchpad system, Sutherland presented a new idea that he termed the Ultimate Display [SUTH65] in 1965, when computer graphics was still in its infancy. The goal was to create an interface that would provide data for all the senses of a human, fully immersing them into a simulated reality that does not exist. As previously mentioned, in 1968 Sutherland produced the first HMD [SUTH68], which was transparent and able to overlay the physical world with a simple wire frame cube and labels. To track the motion of the HMD a mechanical tracker was used, giving the user real-time 3D graphics based on their point of view. A 3D input wand with a wire filament tracker (described previously) was used to perform simple interactions with the environment. While the display and graphics were primitive by today’s standards, this is the first demonstrated example of technology for virtual and augmented reality.
Although humans are able to understand 2D views of 3D information, VR is based on the notion that it is more intuitive to hold a 3D object in the hands or to walk around and view an object from different angles. Trying to create and edit 3D information using only a 2D view is cumbersome, and a HMD allows intuitive user interfaces based on the motion of the body. A number of research papers (that will be discussed later in this subsection) such as [CLAR76], [SACH91], [BUTT92], and [LIAN93] all have the common theme that 3D direct manipulation interfaces are superior to 2D-based environments that impose limitations on operations that are naturally 3D.
The first interactive 3D editing application on a HMD was a surface editor by Clark [CLAR76] that made use of the HMD, wand, and mechanical trackers designed by Sutherland. This system removes the need for keyboard commands and numeric entry of data that was common in previous 3D systems. A user wearing a HMD can walk around to get different views of a spline surface, and then manipulate points interactively using a wand. This allows designers to freely explore changes to the surface using the direct manipulation metaphor, and focus on the task of designing a suitable surface. Clark concluded his paper with the comment that “3-D computer-aided surface design is best done in real-time with 3-D tools. To expect a designer of 3-D surfaces to work with 2-D input and viewing devices unnecessarily removes a valuable degree of freedom”.
The systems presented by Sutherland and Clark demonstrated important ideas but the technology was too cumbersome for regular use. With the availability of new technology, Fisher et al. implemented the Virtual Environment Display System [FISH86], using a stereo HMD with a wide field of view that matches the capabilities of the human vision system, and active magnetic tracking (discussed previously) that allows much more freedom of motion. To interact with the system, speech recognition for commands and a tracked pair of gloves for direct 3D manipulation is used. While the direct manipulation of objects is intuitive, the level of sophistication of the user interface was very primitive compared to the state of the art methodologies available for 2D interfaces, such as the Xerox Star [JOHN89].
Although Clark’s surface editor [CLAR76] was the first demonstration of modelling using a HMD and 3D input device, it was quite primitive due to its limited functionality compared with existing CAD systems. Later research was performed using 3D tracked input devices to improve usability but still used 2D desktop monitors as displays. These systems allow the user to directly manipulate 3D objects, but do not use the body to move around the environment. Later research then used immersive VR displays that supported the full use of the body to interact with the environment.
Sachs et al. presented a system named 3-Draw [SACH91], which allowed users to perform 3D modelling tasks using a pair of 6DOF tracked Polhemus sensors: one held in the hand as a stylus and the other attached to a tablet. The model being created is positioned relative to the tablet and is represented on a desktop monitor. By rotating and moving the tablet, various views of the object may be seen interactively. Using the tracked stylus, the user can sketch curves in 3D and then deform them into various shapes. Since both stylus and tablet are tracked the user can freely manipulate them to get the best view for comfortable modelling. Although allowing the interactive specification of views, the view cannot be moved beyond the range of the tracker, limiting the size of the modelling universe. Sachs et al. made the observation that “the simultaneous use of two sensors takes advantage of people’s innate ability - knowing precisely where their hands are relative to each other”. Sachs et al. demonstrated that by using 3D input, encouraging results were achieved when compared to existing 2D techniques used by CAD systems. Sachs et al. also state that their approach of focusing on the modelling task was more efficient than working at the non-intuitive control point level where direction and magnitude must be specified manually.
The JDCAD 3D modelling system also uses a desktop monitor as a display, and Liang and Green identified a number of key problems with the use of 2D input devices for 3D modelling, justifying the use of 3D input devices [LIAN93]. A designer using a 2D input device must break down 3D problems into unnatural 2D steps, therefore changing their thinking to suit the modelling tool. Some examples are creating vertices by using the cursor from two different view points, or rotating objects one axis at a time using widgets. Another insightful comment made from testing JDCAD was that users found it hard to control all six degrees of freedom (position and especially rotation) at the same time. Being able to constrain position and orientation separately is useful - while having only 2D inputs is limiting for a designer, full 6DOF controls can be hard to control accurately. Compared to the previously discussed 3-Draw system that is limited to the work area of the tablet, JDCAD implements techniques for flying and action at a distance. These techniques allow the user to perform modelling tasks at any location and obtain arbitrary views of objects. Pioneering techniques (described later) were also developed for selection and manipulation of objects out of arm’s reach, and the execution of commands using 3D menus.
Using techniques developed in previous work, Butterworth et al. developed a fully immersive HMD-based modelling system named 3DM [BUTT92]. The use of traditional keyboards and mice is no longer available when immersed in VR, and so alternative user interface techniques are required to interact with the system. 3DM was the first immersive system to support a user interface with 3D menus and tool palettes. The interface performs the selection of options using direct manipulation, similar to traditional desktop user interfaces. The authors state that the application was inspired by the ease of use of the interface for the MacDraw program, which uses similar menus and toolbars. 3DM is able to create simple geometric objects such as cylinders, boxes, and cones, and triangle strips. During the process of creation, as well as during edit operations, the user may directly manipulate these objects at a number of levels: vertices, objects, and groups of objects in hierarchies. For objects that are too far away or at a size that is difficult to work with, the user’s scale and location in the world can be adjusted as desired by the user. The 3D menus and tool palettes pioneered in 3DM are concepts still used in many VR applications.
|
|
|
Figure 2‑18 CDS system with pull down menus and creation of vertices to extrude solids (Images courtesy of Doug Bowman – Virginia Polytechnic Institute) |
The CDS system designed by Bowman [BOWM96] takes concepts from VR described previously, and extends these with extra features for the construction of new objects at a distance. Rather than just placing down objects within arm’s reach, CDS is capable of projecting points against the ground plane using a virtual laser beam originating from the user’s hand, as shown in Figure 2‑18. The points on the ground plane can then be connected together using lines and extruded upwards to form solid shapes.
|
|
|
Figure 2‑19 CHIMP system with hand-held widgets, object selection, and manipulation (Images courtesy of Mark Mine – University of North Carolina, Chapel Hill) |
Mine et al. produced the CHIMP system [MINE97a], integrating many of the techniques from previous VR systems (such as laser pointing, 3D widgets, and menus) as well as the concept of proprioception, the intuitive knowledge the user has about objects placed on or near the body. Rather than working at a distance, CHIMP is designed to allow users to interact with the environment within arm’s reach, since humans are more adept at working within this range. To interact with objects at a distance, Mine et al. introduced the concept of scaled world grab, where the world is scaled so that the selected object appears at a workable size in the hand. The user can then easily adjust the object with widgets that are within arm’s reach, such as shown in Figure 2‑19. By placing items such as menus and tools near the body, the user can reach out and grab them using proprioception without having to see them directly with the eyes. Users can indicate commands that are similar to the physical world using intuitive gestures.
|
|
|
Figure 2‑20 Immersive and external views of the SmartScene 3D modelling environment (Images courtesy of Paul Mlyniec – Digital ArtForms Inc) |
SmartScene from Multigen [MULT01] is a commercial 3D immersive modelling system and is shown in Figure 2‑20. It implements many of the techniques presented previously, combining them together to produce a powerful 3D modelling system that is capable of performing a wide range of tasks on both HMD and projector displays. It uses direct manipulation, tool palettes, menus, scaled world operations, and the creating and editing of geometry in real-time, controlled using two 6DOF tracked pinch gloves.
This subsection presents a detailed summary of interaction techniques that are currently used in VR modelling systems. The previous discussion only mentioned the development of the most notable systems over time, while this subsection introduces the many techniques contributed by various researchers. Techniques are discussed for interaction within arm’s reach, action at a distance, and command entry. Bowman and Hodges [BOWM97] and Poupyrev [POUP98] both provide surveys and comparisons of some of the 3D manipulation techniques discussed.
The most intuitive way to manipulate objects in a VR system is to use the concept of direct manipulation - reaching out and grabbing an object using a tracked prop with buttons or a set of gloves with finger press sensors. When an object is selected, it is slaved to the user’s hand and can be freely translated and rotated in all 6DOFs. The implementation of operations such as flying, scaling, and grabbing using affine transformations in a scene graph is discussed by Robinett and Holloway [ROBI92]. Based on existing direct manipulation work in 2D user interfaces, the next natural progression is to implement concepts such as menus, tool palettes, and icons, as implemented first by Butterworth et al. in 3DM [BUTT92].
Conner et al. introduced the concept of 3D widgets[CONN92], based on extensive previous work in the 2D desktop area. Conner et al. define a widget as being an encapsulation of geometry and behaviour, with different widgets implementing a range of geometry and behaviours. Widgets were first introduced in 2D user interface toolkits to assist with application development, designed to abstract away the user interface from the program performing the task. By extending this concept to 3D, Conner et al. propose the development of 3D user interface toolkits with similar goals in mind, but supporting more powerful interactions using the extra DOFs available. Since 3D environments contain a number of extra DOFs, the definition of 3D widgets is much more complex than in previous 2D environments.
With research by Mine et al. into working within arm’s reach and proprioception [MINE97a], the use of hand-held widgets was proposed as a way of efficiently adjusting controls in the environment. Instead of grabbing and rotating an object at a distance, the user grabs an object and it appears in their hands with various widget handles around it. Using the other hand, the user can then grab handles to perform operations such as scaling or rotation. This interaction is much more efficient because users are familiar with the concept of holding a physical object in one hand and manipulating it with the other hand. By holding 3D dialog boxes in the hands, users can manipulate widgets to enter values and perform operations that have no direct mapping with the physical world. Mine et al. also introduce the concept of storing interaction widgets relative to the body that the user intuitively knows where to find using proprioception. For example, to access a menu the user lifts their hand into the air and pulls down, causing the menu to be dragged down into view. To delete an object, the user grabs it and uses a throwing over the shoulder gesture.
While direct manipulation may seem intuitive, one problem is the lack of haptic feedback. Users are familiar with reaching out and feeling an object while grabbing it, making virtual grabbing difficult because the user can only rely on visual cues for feedback. Interacting with user interface objects such as tool palettes and menus is also difficult due to the same lack of haptic feedback. Another limitation of direct manipulation is that if the object is not within arm’s reach, the user must walk or virtually fly to a location that is closer. When walking or flying is not possible, alternative metaphors to direct manipulation are required.
In some cases it is not possible to express an operation using an intuitive direct manipulation operation. Operations such as selecting a lighting model or adjusting the number of triangles used in a primitive have no physical world counterpart and so an abstraction must be introduced. Brooks offers an insight into what interactions are suitable for different operations in virtual environments [BROO88]. For cases where a discrete interactive change to a virtual world parameter is required, Brooks suggests the use of menu selections. For dynamically changing parameters, dynamic input devices should be used instead. The only time Brooks recommends character-based commands be used is when retrieving an object by name or attaching a name to an object. In general, designers of VR and AR systems should avoid using keyboard entry where possible because command-based systems are highly abstract and unintuitive.
Brooks also suggests that the specification of commands and operations in the view area should be separate and assigned to different cursors, since it “interrupts both the visual and tactile continuity inherent in the operand cursor’s natural movement”. Using this methodology, the command interface can be changed to other formats without affecting the interactions in the view. Brooks suggests that command selection is also a natural candidate for speech recognition engines. As previously mentioned, the Put-That-There system by Bolt implemented an interface that used separate hand tracking for pointing and speech recognition for command entry [BOLT80].
Liang and Green implemented the first 3D interaction menus, named Daisy and Ring [LIAN93]. The Daisy menu presents a series of commands arranged evenly around a sphere, and by rotating the input device the desired menu option can be moved to within a selection basket and a button pressed. Liang and Green found that rotating a Polhemus sensor about all 3 axes was difficult because of the hanging cable and so developed an improved menu. The Ring menu presents the options in a ring shape (so there are fewer options than with Daisy) and only one DOF is used to rotate the ring and move an item into the selection basket. The Ring menu was demonstrated to improve usability even though fewer options are available at any time. The HoloSketch system by Deering also uses ring style menus, and is used to support a stereo desktop 6DOF modelling application [DEER95]. Deering’s 3D pie menus pop up around the 6DOF cursor when a wand button is pressed, and the user then moves the wand to the desired option and selects it. Sub-menus are supported by pushing the menu back and floating a new pie menu on top. These pie menus are designed to use accurate position tracking and minimise both travelling distance and screen real estate by appearing around the wand.
As discussed earlier, Butterworth et al. [BUTT92] and Mine et al. [MINE97a] both implement 3D menus and tool palettes that appear in front of the user. The user indicates 3D items with their hands to directly make selections, but requires accurate tracking to be usable. The pull down menus and tool palettes implemented are structured similar to those used in 2D desktop style user interface toolkits. This form of command entry can be quite tiring if the user is constantly reaching out in front of their body to make selections, and lacks the haptic feedback that is expected by the user.
Bowman and Wingrave developed a VR menu system [BOWM01] that employs Pinch Gloves [FAKE01] as the input device. Instead of using a pull down menu, the top-level items are mapped to the fingers on one hand and the second-level options to the other hand, with no 3D tracking of the hands required. The user selects a top-level option with the matching finger, the system updates the display with a list of second-level options, and the user then picks one using the other hand’s fingers. Using their small finger, the user can cycle through options if there are more than three options available. This menu design is limited to a depth of two, and is not easily scalable to a large number of hierarchical commands.
Instead of using a menu or tool palette to execute commands, Zeleznik’s SKETCH system uses a three button 2D input device to sketch out 3D pictures using gestures [ZELE96]. Hand gestures are analysed to initiate various commands such as selection, manipulation, and deletion. Gesture-based inputs are limited to working in environments where fast and accurate tracking is available, and to a limited set of commands that are expressed using real life actions. When abstract concepts that have no logical gesture mapping are performed, unintuitive gestures must be created and learned, or less direct interfaces such as those discussed previously must be used.
Wloka and Greenfield developed a device named the Virtual Tricorder, a generic tool that can be used to perform many operations in VR [WLOK95]. This device is tracked in 3D and contains a set of input buttons that are mapped to different operations. By overloading the device’s operations the number of tools to be carried is reduced while increasing the functionality. The Virtual Tricorder is limited to the number of buttons that are available, and the overloading of functionality may become complicated for the user to understand.
In many cases it is inconvenient or not possible to fly to an object to be directly manipulated. In the JDCAD system [LIAN93], Liang and Green developed a number of techniques that were later described by Mine using the term action at a distance [MINE95b]. Rather than directly manipulating the object, the user can manipulate it through pointing. The first technique Liang and Green developed is virtual laser pointing, and allows the intuitive selection of objects by projecting a virtual laser from the hands toward an object, just as can be achieved in the physical world. Once the object is selected, it may be attached to the laser beam like a long rod and manipulated by rotating the hand. While this technique can perform remote manipulation very intuitively, it suffers from the amplification of fine hand movements and tracker noise to large motions at a distance. Other transformations such as rotation along an arbitrary axis or varying the distance along the laser beam are also not supported without further extensions to the technique. Bowman and Hodges implemented a fishing reel metaphor that can adjust object translation toward and away from the user after selection [BOWM97]. Liang and Green discovered that at large distances and with small objects the thin laser beam was difficult to select with, mostly due to the amplification of tracker noise. The spotlight technique was then developed, using cone shapes that increase in radius over distance.
Forsberg et al. improved on the spotlight technique to create a new technique named selection apertures [FORS96]. Rather than using a laser or cone originating from the user’s hand, this technique originates a cone from the user’s head, with the axis of the cone passing through the cursor on the user’s hand. A circular selection cursor mapped to the user’s hand defines the radius of the cone at the hand, affecting the overall size of the selection cone. An interesting property of this technique is that the cone does not originate from the hands, and so only the position of the cursor is required instead of full 6DOF tracker values. Devices with poor orientation sensing can be used, and selection accuracy can be improved since the user is looking through a virtual scope rather than aiming a virtual laser from the waist.
Another alternative to the direct manipulation of objects is by mapping the user’s hand to a selection cursor using a non-linear function. Mine first described such techniques [MINE95a], and discussed how the location of the hands can be used to control the velocity of a cursor flying through the environment. If the user moves their hand beyond a central point, the object will move away with increased velocity, and by bringing their hand closer the object will move toward the user. This technique is similar to using a joystick to control the motion of the cursor, except this technique is in 3D and uses the hands. The use of joystick controls adds a layer of abstraction from direct manipulation that may degrade performance in the environment.
The GoGo arm [POUP96] was developed by Poupyrev et al. as a way of manipulating objects at a distance with a similar technique as described by Mine, but using absolute position values instead of a velocity abstraction. This technique uses the volume within reach of the user’s hands and maps the closest two thirds directly to the cursor for direct manipulation. The remaining one third of the volume away from the user is mapped to a non-linear quadratic function that increases rapidly with distance. The overall function used has a smooth transition and allows working within arm’s reach and at long distances without changing modes. Since this technique controls a 3D cursor, it can be used for both selection and manipulation, although the accuracy of the technique will degrade according to the function used as the distance is increased.
On a standard desktop display, 3D objects can be selected with a mouse by positioning the 2D cursor on top of the object of interest. To select an object, the system finds the intersection with all objects underneath the cursor and returns the closest one. This technique can also be used in virtual environments by placing the cursor over an object at a distance, and projecting a selection ray from the head through the hand cursor. Pierce et al. describe these as image plane techniques [PIER97], and indicate how they can be used to perform selection of objects out of arm’s reach. Four selection techniques are proposed: head crusher, sticky finger, lifting palm, and framing hands, as an alterative to previous laser and spotlight techniques. Although Pierce et al. did not discuss manipulation of objects using this technique, the same mechanism (as in laser and aperture selection) can be used to adjust the position and orientation of the object at a distance. An interesting comment from the discussion of an informal user study by Pierce et al. is that “no user has had any trouble understanding how the techniques work”, and that arm fatigue is minimised since hand selection time is reduced compared to other VR techniques.
Rather than trying to extend the direct reach of a user with the use of extensions such as laser beams and non-linear mappings, Stoakley et al. proposed a new interaction metaphor named Worlds-in-Miniature [STOA95]. In this metaphor, the user holds a small copy of the 3D world in their hands. By viewing the WIM in the hands, objects that are currently obscured in the immersive VR view can be easily seen from overhead. Objects in the WIM can also be manipulated directly using the hands, with these changes made visible in the immersive VR view. The advantage of this technique is that it can perform selection and manipulation tasks using direct manipulation, even through the object may be very far away from the user. For cases where the world is very large however, the WIM model must be scaled to fit the user’s hand and so small objects may be invisible to the user.
The scaled world grab technique by Mine et al. [MINE97a] uses similar concepts to perform remote interactions within arm’s reach. After selecting an object the entire world is scaled and translated so that the object of interest appears in the hand. The user can then interact with the object and others nearby, with the world being returned back to its original scale when finished. Since the entire world is scaled during the grab operation, the user can still see other nearby objects and there is no need to divide up screen space between a WIM and the immersive view.
Another technique designed to overcome the shortcomings of WIMs is the Voodoo Dolls technique by Pierce et al. [PIER99]. In this technique, the user works in a standard immersive view and then selects an object of interest. When selected, the system creates a unit sized “doll” in the hands that represents the object in the environment. Changes made to a doll held in the hand are reflected immediately in the normal environment. When dolls are held in both hands, the dolls are scaled around the non-dominant doll of unit size. By varying the position and rotation of the hands the relative placement of the dolls can be adjusted in the environment. Dolls can be created by selecting, released by letting go with the hands, and passed between the hands. To provide context for the user, the dolls are rendered with the selected object as well as others that are close by.
Users intuitively know how to manipulate objects in the physical world, and so by using tracked physical props these can be used as user interaction devices. Previously mentioned work normally uses gloves or button controllers to interact with the environment, but these are generic devices that do not have real world equivalents. Hinckley et al. demonstrated evaluations of using props for the visualisation of 3D models of the brain [HINC94b]. A small doll’s head with an embedded Polhemus sensor is used to represent the brain, while a tracked cutting plane and pointer are used to select slices or points in the virtual model. The surgeon can very intuitively interact with these props since their operation is obvious and uses the surgeon’s existing manipulation skills. In other research, Hinckley et al. again demonstrated that well designed tracker props are easier to understand and use than the generically-shaped tracker sensors supplied by the manufacturer [HINC97]. The use of props can be cumbersome if there are too many discrete operations to represent, or if the task is too abstract to map to any physical world prop. The use of props also prevents the use of the hands for other tasks that may be required.
The Personal Interaction Panel developed by Szalavari and Gervautz [SZAL97] makes use of tablets as a prop-based input device, and has been used in collaborative AR work by Schmalstieg et al. [SCHM00] and Reitmayr and Schmalstieg [REIT01a]. The PIP is held in the hands and uses AR to overlay 3D widgets indicating various controls that can be adjusted with a hand-held pen (see Figure 2‑7). The tablet may be quite small and implemented using a variety of technologies such as pen tracking or pressure sensing, making it portable and easily carried. Another feature of the PIP is that it provides haptic feedback for the user as they press the pen against the tablet, in contrast to the hand-held widgets and tool palettes discussed previously. Lindeman et al. demonstrated that by providing passive-haptic feedback to the user in precise direct manipulation tasks, user performance is significantly increased [LIND99]. In this study, the best results were achieved when the user was able to hold a tablet in one hand and then press against it with a tracked finger. Other methods such as fixing the tablet to the world or having no haptic feedback produced lower user performance values.
Table 2‑3 lists a summary of the information presented concerning interaction techniques for virtual reality (and also augmented reality), comparing their features and limitations.

Table 2‑3 Comparison between forms of VR interaction techniques
Being able to capture the physical world into a digital model has become a critical part of modern professions such as surveying, building, and architecture. These areas have traditionally used paper to record information, but over time the amount of data required has increased and now computers are used to streamline these tasks. Brooks discusses the problems associated with the capture of physical world objects from a computer graphics perspective [BROO97]. One suggestion he makes is that an iterative refinement strategy is desirable, where the most resources are focussed on complex objects and not on those that can be approximated with no noticeable loss of detail. This section discusses various techniques used to capture physical world data into a computer.
Surveyors are responsible for measuring and capturing the geometry of landscapes for various uses such as construction and the division of property boundaries. Using a known reference point on the Earth, coordinates of other points may be found from relative orientation and distance measurements. Originally, surveying was performed by using measuring chains, where the chain is pulled between two points and the length is calculated by counting the number of links. The accuracy of the chain is affected by its physical properties as well as distortions caused by gravity. The angle is measured using a theodolite, which resembles a small telescope mounted onto a tripod. By aiming the theodolite’s crosshairs at a target, the angle can be mechanically measured relative to the base. Laser range fingers are now also used on theodolites to instantly measure distances without the use of chains, achieving accuracies in the order of millimetres. With integrated angle and distance measurements in theodolites, quick and accurate measurements can be performed in the field.
GPS technology has also improved since its introduction and is now widely used in the surveying field as well. As previously mentioned, RTK technology allows accuracies of 1-2 centimetres and is accurate enough to be used for surveying. Using a pole-mounted GPS, surveyors can instantly record the position of an object by placing the pole down at the location and pressing a button. The use of modern equipment such as GPS, theodolites, and laser range finders enables the surveyor to be more efficient and accurate compared to traditional techniques.
To capture the model of a building, the most basic method is to physically measure the structure with a tape measure, and then record the dimensions and geometry on paper. This information can then be used to recreate the object as a 3D graphical model using a desktop CAD system. The main problem with this technique is that it is very time consuming, as each point needs to be manually measured, recorded on paper, and then entered into a computer. This process is also prone to errors, and it will only be obvious during entry into the computer that a mistake has been made when points do not line up correctly. Errors require the user to repeatedly go back outside and make new measurements until the model is satisfactory. While existing plans for buildings can be used as 3D models, Brooks points out that in many cases these plans show the object as designed but not as actually constructed [BROO97]. Apart from just discussing the capture of models into the computer, Brooks argues that working with static world models is a substantial engineering task, similar in magnitude to a software engineering task. A rule of thumb he proposes that the complexity of an object can be measured by counting the number of polygons, similar to counting lines of source code when programming.
|
|
|
Figure 2‑21 Partial UniSA campus model captured using manual measuring techniques (Image courtesy of Arron Piekarski) |
At the start of this thesis work, Arron Piekarski used the manual techniques described above to capture a number of buildings on the university campus. This process took about a week to achieve the level of detail and accuracy required and was used to create the AutoCAD model shown in Figure 2‑21.
Images captured with cameras from two known positions may be used to reconstruct the 3D geometry of objects. Cameras placed at even slightly different locations will receive images differently due to perspective depth effects. By matching features between the images and measuring the differences in position, 3D mesh surfaces can be automatically produced. This technique can be applied using stereo cameras at a fixed distance apart or from a single camera that is moving along a known path (such as on an aircraft or a vehicle). Sester et al. describe a number of problems with these image-based techniques [SEST00]. Environments containing large height discontinuities or occlusion by other objects will prevent features being accurately measured. If the sampled images contain repetitive patterns or lack of unique textures then matching between images becomes difficult or impossible.
Stereo cameras can only capture a 3D surface from one particular view and cannot produce a fully closed model unless multiple surfaces are merged together. The Façade system by Debevec et al. uses photographs taken from multiple view points to accurately capture the geometry and textures of objects such as buildings [DEBE96]. While the final models are completely closed surfaces compared to those produced by stereo image capture, the data must be processed offline and requires intervention by the user. The user manually specifies similar feature points between various images to assist with the processing. Image-based reconstruction is still an active research area and no fully automated techniques currently exist for arbitrary shaped objects.
Laser scanning devices are commonly used to capture large outdoor structures, with examples such as the commercially available I-SiTE [ISIT02] and Cyrax [CYRA03] scanners. These devices sweep a laser over an outdoor area and capture distance measurements, constructing a detailed 3D mesh of the area. Since the scanner cannot view all faces of the object it must be repositioned at different view points and the resulting 3D meshes merged together. The specifications for the I-SiTE scanner claim to measure a point every 30 cm at a distance of 100 metres, and to be able to reconstruct a large building from four different angles within one hour. Laser scanners suffer from occlusion by other objects, absorption of the laser on very dark surfaces, no reflections from sharply angled surfaces, and bright reflections caused by specular surfaces [SEST00]. For objects such as buildings, laser scanning produces much more accurate results than those available from image-based reconstruction, although the scanning devices are much more expensive.
An alternative technique usually deployed from aircraft or satellites is the use of Synthetic Aperture RADAR devices (SAR). SAR sends out RADAR pulses along a range of angles and measures the returning pulse’s intensity and phase to form an image. Images are repeatedly captured as the SAR platform moves through the world, and are not obscured by clouds or trees since the visible light spectrum is not used. By matching features between images the phase differences can be used to calculate relative distances for each point on the image [SEST00]. This approach is mostly used for the capture of large areas such as mountains or cities and suffers from similar problems to stereo image reconstruction.
Stereo images, laser scanning, and SAR imaging all require line of sight with the particular light spectrum to capture the geometry of objects. Any objects that are occluding the model will introduce areas of uncertainty, and some objects may include features that are self occluding, such as the pillars in front of a building. Areas that cannot be scanned will form shadows in the 3D mesh that incorrectly represent the physical world shape. If the scanner cannot be repositioned at a different angle to see around the occluding object, it will be impossible to scan correctly. While multiple 3D meshes from different view points can be merged together to form a single mesh, this is difficult and requires accurate correspondences to be made.
Scanning techniques rely on the brute force capture of millions of data points to sample objects, but for flat surfaces there will be many unnecessary points and for areas with sharp changes in features there will not be enough points. For example, when modelling a house, the doors and windows each only need a single polygon but the frames around them require highly detailed geometry to represent their structure. The capture time for models also remains the same no matter what the complexity of the object is, whether it is a simple cube or a detailed sculpture. For simple capture tasks, it is not possible to spend only a couple of minutes to capture the approximate outline of a building since the key corner points may be missed by the scanner as the step size is increased. Large models with millions of polygons are also time-consuming to render or simplify on current hardware, and so these capture processes can require extra processing to achieve the desired geometry detail.
All of the described capture techniques in this section produce 3D output that is relative to the device that performed the capturing. To provide absolute positions for the 3D model, a GPS (or some other positioning device) must be used to measure the location of the scanner. The accuracy of the final world-relative model will therefore depend on the least accurate of all the devices in use. Finally, these techniques are limited to objects that already exist in the physical world. For capturing objects that do not exist, CAD modelling tools are needed for users to express their design and visualise it.
Computer Aided Design (CAD) systems are used to create accurate 2D and 3D representations of physical world objects. These systems form a core part of most design work, since the computer can perform a wide range of useful calculations and processes that help to reduce time and costs. Much research has gone into the development of these systems, and this section provides an overview of their use and underlying technologies that will be useful in later chapters of this dissertation.
|
|
|
Figure 2‑22 Screen capture of Autodesk’s AutoCAD editing a sample 3D model |
Systems used in commercial environments have evolved on 2D desktop-based machines, with a typical example of a popular system being Autodesk’s AutoCAD [ACAD03] (see Figure 2‑22). Mine reviews a wide range of CAD systems and discusses their features extensively [MINE97b]. Simple drawing systems are capable of creating 2D vector-based primitives such as points, lines, arcs, and text. CAD systems extend these primitives to support other higher-level features such as layers, dimensioning, and template objects that can be used for complex 2D designs. 3D CAD systems can be used to model real life objects using solid modelling operations. These 3D models may be analysed before construction to ensure it meets the design requirements, and then possibly sent to an automated milling machine to produce a physical representation. There are many powerful CAD tools that can perform a wide range of tasks but these will not be discussed in this dissertation.
CAD systems are generally 2D applications that project the specified view point on to a display, and allow the user to draw and edit both 2D and 3D objects. Given only a 2D input device and a keyboard, CAD systems implement a number of techniques to enter in 3D information. Using direct keyboard entry of numerical values for 3D locations is the most exact input method since there is no ambiguity. An alternative is the use of multiple views from different angles, where the user can select the same point in each view and the 3D location is calculated through intersection. CAD systems also introduced a concept named working planes that is described by Mine [MINE97b]. Working planes are surfaces of infinite size that are placed into the modelling environment, and the 2D cursor on the display is projected against these. Given any 2D cursor position, the working plane can be used to calculate 3D coordinates for other operations. Working planes can be defined using numeric input, graphically while perpendicular to a view point angle, or relative to the surface of another object.
Working planes have features that can be explained best using an example of constructing a model of a house. Given an empty model, the 2D top down view is selected and a working plane is created at height zero. The user can click a series of points to define lines forming the perimeter of the house. The user switches to a side view and then extrudes the perimeter up to create a solid shape. Inbuilt objects supplied with the CAD system such as pyramids and wedges (or previous extrusion techniques) can be used to form the roof shape. Up to now this example has only used working planes that are perpendicular to the current view. The true power of working planes is most apparent when the object cannot be aligned to the view point. In this scenario, instead of using the coordinate system to specify working planes, an object facet itself can be used. To draw a window onto a wall of the house, the user nominates the wall and then simply draws against the surface. As each point is entered it is projected against the wall (working plane) and used to create a 3D vertex. If a picture is hanging on the wall, it can be moved along the surface of the working plane instead of on the plane perpendicular to the view point. Working planes act as a constraint mechanism that assists with the specification of three degrees of freedom using only a two degree of freedom input.
While a shape may be defined by specifying each vertex manually and then joined into polygons, this is a very time consuming process. In many cases, it can be seen that shapes contain surfaces similar to those in primitive objects such as boxes, cylinders, spheres, and cones. Using constructive solid geometry (CSG) techniques, CAD systems can take objects that are mathematically defined and combine them using Boolean set operations. In his detailed survey paper of the field of solid modelling, Requicha describes the foundations of using CSG representations for 3D objects in this way [REQU80]. An example of traditional Boolean set operations is depicted by the Venn diagrams in Figure 2‑23, with two overlapping closed 2D regions and various logical operators. The examples depicted are inverse, union, intersection, and difference operators.
|
|
|
Figure 2‑23 Venn diagrams demonstrating Boolean set operations on 2D areas A and B |
For the examples depicted in Figure 2‑23, any kind of closed space (such as a 3D solid object) can be used instead. A point is defined as being part of a set if it is enclosed by the surface, and the region inside the surface is assumed to be a solid volume. Figure 2‑24 demonstrates similar Boolean set operations but using 3D solid objects and operating on a pyramid as object A and a sphere as object B. The union, intersect, and difference operators produce shapes that would be difficult to express otherwise and yet it is obvious what input primitives were used.
|
|
|
|
|
|
|
Pyramid A |
Sphere B |
Union (A È B) |
Intersect (A Ç B) |
Difference (A – B), (A Ç !B) |
|
Figure 2‑24 CSG operations expressed as Boolean sets of 3D objects |
|
(Images courtesy of Leonard Teo) |
To simplify the calculations for computing CSG objects, the input objects should all be definable using mathematical surface equations. Surface equations can be used to generate polygon meshes and surface normals at any level of detail since they are continuous functions with well defined values. 3D shapes may be defined mathematically using equations that equal zero when the X, Y, and Z coordinates are on the surface of the object. For example, a sphere surface of unit dimensions can be defined using the equation x² + y² + z² - 1 = 0. If coordinates inside the sphere are used then the equation will return a negative value, and it will be positive for all points outside the sphere. The surface equation of a cylinder that is infinite along the Z axis and of unit radius in X and Y can be similarly represented using the equation x² + y² - 1 = 0. Similarly, the surface equation for a plane is Ax + By + Cz + D = 0 and when zero is calculated then the point lies exactly on the surface. The surface normal of the plane is represented in the equation using the vector [A, B, C]. An interesting property of the plane equation is that it is not enclosed since the plane moves off to infinity in all directions and so one would assume that it does not have an inside or outside. A plane does have an inside and outside though, since it possesses a surface normal vector for direction and cuts the universe into two halves (see Figure 2‑25). The portion above the plane is defined as outside while the portion below is inside. The surface equations previously defined can all be categorised as quadric surfaces, each expressed using the general equation Ax² + By² + Cz² + Dxy + Exz + Fyz + Gx + Hy + Iz + J = 0. Other more exotic shapes can be represented using this and other higher order equations, but will not be discussed here.
|
|
|
Figure 2‑25 Plane equation divides the universe into two half spaces, inside and outside |
|
|
|
Figure 2‑26 Finite cylinder defined by intersecting an infinite cylinder with two planes |
The definition of the infinite cylinder introduced previously is not useable for rendering since it is infinite in length while physical world objects are generally finite in length. What is desired is the ability to place end caps on the cylinder to limit its length, but this cannot be easily expressed using a single surface equation. A capped cylinder can instead be defined with the combination of a cylinder and two planes, the planes being used to provide the end caps. Using the layout shown in Figure 2‑26 it can be seen that the planes define a region bound along the Z axis, but infinite along X and Y, while the cylinder defines a region bound in X and Y, but infinite in Z. By combining these regions using the intersection operator, a new solid region that includes the common parts between all inputs is defined and appears as a capped cylinder. It is critical that the surface normals for the two planes are correctly placed pointing out from the cylinder, otherwise the final object will be non-existent since there is no common inside.
Using a similar technique as used previously, a box can be defined using six plane equations. Figure 2‑27 shows how these planes can be arranged so that their inside regions all overlap – when this area is intersected a box shape is formed. This construct is interesting because while a box shape cannot be defined using a single quadric surface, it can be defined using a collection of planes. Using CSG techniques, most shapes can be created given enough inputs and iterations, making this a very powerful way of expressing objects with only simple inputs.
|
|
|
Figure 2‑27 Box defined using six plane equations and CSG intersection operator |
One drawback to CSG is that since the objects are defined using quadric equations, they cannot be directly rendered on typical 3D graphics hardware that supports only polygons. Requicha describes in detail the use of boundary representations made up of facets to represent solid models before and after CSG operations [REQU80]. Many algorithms have been developed for the calculation of CSG boundary representations, including real-time ones such as those by Laidlaw et al. [LAID86] and Thibault and Naylor [THIB87].
This section introduces the concept of a wearable computer, and how one can be used to perform outdoor augmented reality. Wearable computers are defined, input devices are reviewed, and then the problems associated with performing mobile AR are then discussed. Much of the information presented in this section is based on knowledge I have accumulated after designing mobile computers since 1998.
For the purposes of this dissertation, I will define a wearable computer to be a self powered computing device that can be worn on the body without requiring the hands to carry it, and can be used while performing other tasks. The first known instance of the development of a wearable computing device was in the late 1950s by Thorpe and Shannon [THOR98]. This device was used to predict the motion of roulette wheels at casinos and contained twelve transistors worn by the user, with a foot mounted switch as an input device and a small concealed earpiece for feedback. This was pioneering work because the wearable was completely hidden in clothing and able to pass the careful inspection of casino security staff.
Wearable computers have now progressed to the wearing of hardware with internal architectures similar to that available on standard computers, with systems developed by Carnegie Mellon University (such as VuMan [BASS97] and Spot [DORS02]) and the Massachusetts Institute of Technology (such as MIThreal [MITH03]). Systems are now commercially available from companies such as Xybernaut [XYBE03] and Charmed [CHAR03], and combined with a HMD are being deployed to assist workers with tasks that require information to be presented while keeping the hands free. Systems have been tested in the field with studies such as those by Siegel and Bauer [SIEG97] and Curtis et al. [CURT98]. Research such as designing for wearability by Gemperle et al. [GEMP98] and embedding wearables into business suits by Toney et al. [TONE02] are examples of research focusing on making computers a part of everyday clothing.
A key feature of a wearable computer is the ability for a user to operate the computer while being mobile and free to move about the environment. When mobile, traditional desktop input devices such as keyboards and mice cannot be used, and so new user interfaces are required. Thomas and Tyerman performed a survey of various input devices for wearable computers and how they can be used for collaboration tasks [THOM97a]. Thomas et al. evaluated three different input devices for text entry on wearable computers: a virtual keyboard controlled by trackball, a forearm keyboard, and a chordic keyboard [THOM97b]. While these devices demonstrated improvements in accuracy and speed after training the user, they still are not as efficient as a standard desktop keyboard. Although many devices have been developed for communication with wearable computers, there is still much research to perform in this area as the devices are still cumbersome and force the user to use unnatural interactions. Some currently available devices include:
· Chord-based keyboards (the Twiddler2 [HAND02] is shown in Figure 2‑28a)
Forearm-mounted keyboards (the WristPC [LSYS02] is shown in Figure 2‑28b)
Track-ball and touch-pad mouse devices (a generic track-ball mouse is shown Figure 2‑28c, and the Easy Cat touch-pad [CIRQ99] is shown in Figure 2‑28d)
Gyroscopic and joystick-based mouse devices (the Gyration wireless mouse [GYRA04] is shown in Figure 2‑28e)
Gesture detection of hand motions
Vision tracking of hands or other features
Voice recognition
|
|
|
Figure 2‑28 Wearable input devices suitable for use in outdoor environments (a) chordic keyboard, (b) forearm keyboard, (c) track-ball mouse, (d) touch-pad mouse, and (e) gyroscopic mouse |
Many wearables such as the ones discussed previously are small and can be concealed on the body, with construction that is able to survive daily use. When working with AR and interactive 3D graphics however, most wearable computers lack the computational power available on standard desktop and laptop systems to perform these tasks. Instead of using small and compact wearable computers, the applications presented in this dissertation require equipment that is bulky, power inefficient, and heavy, such as laptops, trackers, and batteries. If current trends in the miniaturisation of hardware continue, these devices will reduce in size to that of current wearable computers today. Since the equipment currently required is quite bulky, I have used a mobile backpack configuration similar to that used by Feiner et al. [FEIN97]. Although large and heavy, these are still wearable just like any other smaller computer, although are less comfortable and ergonomic.
The design of mobile backpack systems that can perform AR has a number of problems that are not experienced when working on a desktop computer. Designing a system that is powerful enough, uses mobile power sources, is able to work outside, and withstand tough environmental conditions is difficult and introduces tradeoffs. Azuma also discusses some of the problems associated with making AR work outdoors [AZUM97b]. The designs possible vary depending on the requirements, and some of the constraints are as follows:
· Weight and size - Wearable computers should not be a burden on the user to carry and use.
Power supply and run time - Components that have large electrical energy requirements need more batteries, which add to weight and size. The amount of time the system can run for is controlled by the amount of power supplied by the batteries and the efficiency of the components.
Performance - Calculations for rendering and tracking require large amounts of processing power, and to meet certain requirements larger and more energy intensive devices may be required. The power consumption of devices is directly proportional to clock speed and heat dissipation.
Ruggedness - Sensitive electronic equipment needs to be protected from the environment or it will be damaged easily. Connectors, cables, and components normally used indoors may be unsuitable outside due to forces caused by the user moving around, as well as being immersed in an atmosphere with dust, moisture, and heat.
Price - Cheaper devices are always desirable when possible.
The previous requirements are all interdependent. If the designer optimises for one particular category such as increasing performance, the weight, size, and price of the system will also increase. By optimising for low weight and small size, the wearable will not require as many batteries, but will result in the ruggedness protections being removed, diminished performance in the computer, and an increase in the price to pay for miniaturisation of the components. A standard design guideline is that when designing systems, it is possible to optimise for one case at the expense of most of the others, so there are always trade offs to be made. Mobile AR systems tend to be optimised for performance, with most other factors being sacrificed. This makes them large, energy intensive, extremely fragile, and very expensive. While technology is improving, there is still much progress to be made before these systems may be easily and commonly used.
HMDs operated in an outdoor environment also suffer from portability problems, as well as dealing with a wide and dynamic range of lighting conditions, from darkness to full bright sunlight, and effects such as shadows and reflections. For optical displays, bright lighting can cause the optical overlay to not be visible and can also enter the user’s eyes via the sides of the HMD. Darkness causes the opposite effect with the physical world not easily visible while the overlay is very bright. When working with video overlay displays, video cameras are required that can operate over wide ranges of lighting. These cameras still do not have performance even approximating that of the human eye however.
This chapter has discussed the current state of the art in AR technology. While the current level of technology is very impressive, much of this technology has been around since the time of the first HMD by Sutherland in 1968. Although there have been a number of improvements and the quality of systems has increased substantially, there are a number of unsolved and important problems that prevent mainstream use of AR applications. This dissertation will address some of the problems with user interfaces and 3D modelling tasks for AR, with a particular focus on operating in mobile outdoor environments.
3
"If you can find a path with no obstacles, it probably doesn't lead anywhere"
Frank A. Clark
This chapter describes new interaction techniques for augmented reality to support manipulation and construction of geometry at large distances away from the user. Existing 3D techniques previously described in Chapter 2 extend the user’s interaction beyond arm’s reach, but focus on operating at distances relatively close to the user where there are many cues available to accurately estimate relative position. This chapter references various studies to show that beyond a certain distance the ability of humans to perceive depth is severely attenuated. This affects the accuracy of interactions that can be performed at large distances, which are important when interacting in an outside world with structures beyond the depth perception capability of humans. A new technique named AR working planes is described, using the projection of 2D cursors onto 3D planes to avoid the specification of depth values directly by the user. This technique only requires the use of 2D inputs and so can be implemented using a wide range of input devices, making it ideal for use in mobile environments. The AR working planes concept is described in detail, discussing their placement in various coordinate systems and creation relative to the user or the world. The use of AR working planes for action and construction at a distance is then described, including the manipulation of existing objects and the placement of vertices to create new geometry. To perform operations using AR working planes, it is important that the plane is correctly aligned with the physical world to ensure the accurate capture of information. I demonstrate that an accurate way to perform this is by using the eye to align physical world features and therefore ensuring the body and head are correctly placed. Using the AR working planes technique developed in this chapter, the human is capable of performing interactions that are limited only by the accuracy of the tracking equipment in use and not by their lack of depth estimation capabilities.
|
|
|
Figure 3‑1 3D objects are projected onto a plane near the eye to form a 2D image |
Humans gauge the distance to objects and their layout through visual cues acquired with the eyes, along with any other available senses such as sound, touch, and smell. The human sense of vision is unique in that it is capable of gathering information from a virtually infinite range of distances, whereas other senses tend to be useful only within close range. Human vision can be approximately modelled as a 2D array of pixels (similar to a video camera) gathering light to produce a 2D image representing the 3D environment. While horizontal and vertical placement of objects in the image is easily obtainable, depth is ambiguous due to the flattened representation of the image, as depicted in Figure 3‑1. Depth information can only be estimated by analysing the contents of the images captured. The eyes and brain process a number of vision cues that occur in images to determine the depth positioning of objects in the scene, and are combined together to improve accuracy. Drascic and Milgram [DRAS96] present a survey of perceptual issues in AR, discussing various depth cues and how mixed reality systems are limited in presenting them to the user. Cutting and Vishton [CUTT95] followed by Cutting [CUTT97] [CUTT02] provide detailed surveys on previous work in the area of perception and the determination of distance relationships between objects using visual cues. Cutting and Vishton collected results from a large number of previous studies and categorised nine cues (rejecting another six), describing the range they are accurate over and the kind of depth information that can be extracted. Not all visual cues can produce absolute measurement information however; some cues can only provide relative ratios between objects or simple ordering information. The nine cues described by Cutting and Vishton [CUTT95] are as follows:
· Occlusion – when objects at varying distances are projected onto the retina of the human, objects that are closer will overlap objects that are further away. This allows ordering information to be extracted and works over any distance, but cannot be used to form any absolute measurements.
Relative size – by measuring the size of a projected image on the retina and knowing that two objects are of a similar size, both ordering and size ratios can be calculated but without absolute values. No prior knowledge of the object’s size is required for this cue except that the objects are of the same size, and can be placed at any visible distance.
Relative density – this cue is similar to relative size and uses two similar objects (but of unknown sizes) and compares the density of textures that are placed on them. By comparing the texture patterns, ordering and size can be calculated, although absolute values are still not possible. This cue is also useable at any distance, assuming the objects are visible.
Height in the visual field – this cue relies on gauging the distance of objects by comparing their heights relative to each other. Assuming the objects are all placed onto the ground plane, that the eye is at a known height, and that the ground is indeed flat, then this cue can produce absolute distance measurements. In most cases however, not all the previous conditions can be met and so only ordering is available. This cue is only effective from about 2 metres onwards as the human must be able to see the objects touching the ground plane.
Aerial perspective – when objects such as mountains and buildings are at very large distances, environmental effects such as fog, lighting, and distortion begin to affect the image of objects. As distance increases, the image of objects becomes gradually more attenuated and so this can be used as a measure of distance. Since this cue is only effective at large distances, calculating absolute values based on the attenuation of objects may be difficult since they might not be easily visible.
Motion perspective – when moving sideways through the environment, images of objects that are at a distance will move across the retina slower than closer objects, caused by perspective distortion. This cue attenuates over distance and works best when the eye can easily focus onto the objects in motion; therefore objects that are so close that they move by very quickly will be difficult to process. While absolute distances may be extracted given knowledge of the movement and height of the eye, motion perspective is best able to be used for relative ratios and ordering.
Convergence – when objects are in close range, the eyes will adjust their angle to point toward the object of interest. As the distance increases, the angle of the eyes gradually widens to the point where they are both looking in parallel directions when an object is at very large distances. Convergence requires knowledge of the distance between the eyes, and when used within a range of about two metres, this cue is able to give accurate absolute distance measurements.
Accommodation – in order to perceive objects clearly, the eyes will focus the image by adjusting internal lenses controlled by a muscle, similar to a camera. This cue can be used to calculate distance for an object, and is usually combined together with convergence. Accommodation operates up to approximately two metres, although the eye’s ability to focus deteriorates with age. Similar to convergence, absolute distance information within its limitations can be calculated.
Binocular disparities – when two eyes are both focused on the same object, if it is within a close range the images presented to each eye will vary slightly. Using the eyes in stereo may capture depth information with absolute values assuming the distance between the eyes and convergence is known, and also correspondences between points in the images can be found. This cue produces absolute distances from very close ranges and attenuates linearly as distance increases.
Cutting and Vishton also mention a number of other cues discussed in various literature, but eliminate them from consideration because they are based on the previously identified cues, or not demonstrated as being effective during user studies. In normal daily life, the brain combines these cues together to produce situational awareness for the human. In VR environments, some of these cues can be simulated with the use of HMDs. HMDs can produce stereo images with offsets to match the distance between the eyes, and software can simulate fog and some environmental effects. While stereo HMDs give the user some feeling of depth perception, this is limited because the brain may be confused by inconsistencies in the sensor information normally acquired.
To summarise the various cues and their effectiveness at different distances, Cutting and Vishton produced a graph depicted in Figure 3‑2 that indicates the accuracy of each cue. This figure uses a log scale for distance along the X axis, and a normalised log scale along the Y axis with the smallest distance change measurable divided by distance. A value of 0.1 on the Y axis may indicate the ability to discern a 1 metre change at a distance of 10 metres, or a 10 metre change at a distance of 100 metres. Each of these curves is based on the data from numerous previously performed user studies and demonstrates that each cue is effective at different distances.
Based on their analysis of the nine available cues, Cutting and Vishton defined three separate spaces around the body at different distances to better categorise the depth estimation available. The first area defined is named personal space and ranges from the body to up to 2 metres. Personal space is where humans perform most of their close up interactions, and so depth perception is highly refined due to its importance in daily life. From 2 to 30 metres is a second area termed action space. In this space, users may interact reasonably accurately with other objects (such as throwing a ball to hit a target), but with less cues and accuracy than personal space. Beyond 30 metres is vista space, where objects appear flat and distance estimations become quite poor compared to closer spaces. Figure 3‑2 includes divisions showing where the three spaces are located relative to the accuracy curves previously described.
|
|
|
Figure 3‑2 Normalised effectiveness of various depth perception cues over distance (Adapted from Cutting and Vishton [CUTT95]) |
Based on this discussion, it seems that a human’s ability to reconstruct 3D information about a scene is most capable when operating within close range to the body. Since humans mainly deal with objects that are within arm’s reach, this sense (named proprioception) is highly refined and was used by Mine to improve user interfaces for 3D environments [MINE97a]. At larger distances however, these abilities attenuate very rapidly to the point where beyond 30 metres or so it is difficult to perceive absolute distances. When modelling large outdoor structures such as buildings, distances of 30 metres or greater are quite common. If distances cannot be perceived accurately for the modelling tasks required, then performing action and construction at a distance operations will require extra assistance to be useable.
|
|
|
Figure 3‑3 Graph of the size in pixels of a 1m object on a HMD plane 1m from the eye |
Previously described techniques such as working planes [MINE97a], selection apertures [FORS96], and image planes [PIER97] were developed to project the locations of display-based cursors onto a 3D environment. These techniques are useful for selection and manipulation operations in vista space because there are no restrictions on the range of use, and the techniques are just as easy to use within arm’s reach or kilometres away. Image planes and selection apertures are not capable of specifying distance however, since the plane from the view frustum is used for the cursors and depth is not required to be resolved. Assuming a typical perspective projection, the accuracy of vertical and horizontal motion in all of these techniques is proportional to the size and distance of the object, but attenuates at a constant rate less than that of human depth perception. With the use of HMDs, the cursor is represented using pixels and introduces a pyramid of uncertainty specified by the pixel size at the projection plane. To simplify this argument, I will ignore anti-aliasing effects that may occur when points and lines are drawn smoothly onto pixel arrays. Figure 3‑3 plots the effect of distance on the projection of a 1 metre object onto a Sony Glasstron PLM-700E HMD with pixels of size 0.618 mm at 1 metre from the eye (the derivation of this value is described in Section 3.8). From Figure 3‑3, it can be observed that the 1 metre object is not properly visible beyond approximately 1618 metres since it is less than a single pixel in size. An important property of interactive modelling is that users can only perform manipulations that are visually verifiable. There is no need to provide the user with the capability to move a mountain on the horizon 5 centimetres to the right because it is not visually noticeable. Only by approaching the object will the user notice any accuracy problems, and these can then be corrected since it is a change that can be verified. Based on this argument, the use of projection techniques imposes no accuracy limitations noticeable by the user.
Using the previously discussed projection concepts, these can be extended into the AR domain to perform interactive modelling outdoors. I have developed a concept named augmented reality working planes that is based on the working planes concept used in traditional CAD systems. AR working planes can be created in the environment relative to the user or other objects, and stored in one of four possible coordinate systems. These planes can then be used as a surface to project a 2D cursor on to, resolving full 3D coordinates to manipulate existing objects and create new vertices. Since planes are by definition infinite in size, the user can project the cursor onto the plane from almost any location, although the accuracy decreases as the plane becomes parallel to the user’s view. AR working planes improves on existing image plane-based techniques because the plane can be any arbitrary surface, allowing the calculation of depth at any distance and interaction in all three dimensions. This technique is also a mobile alternative to desktop CAD systems because the 3D view and working planes can be specified using the body in the physical world. To control the cursor projected against the AR working plane, any 2D input device can be used. The cursor is projected onto the surface of the plane and so no depth information is required, allowing a wide range of input devices to be used. Chapter 5 focuses on the implementation of a mobile input device suitable for use with AR working planes.
The use of AR working planes does impose some limitations on the user, and requires them to specify distance by creating a plane and then drawing against it from a different direction. Two operations from separate locations and orientations are usually required so that depth can be extracted without requiring the user to estimate it. While Chapter 2 reviewed previous research by Liang and Green that indicated that the decomposing of 3D tasks into 1D or 2D units was not efficient [LIAN93], in the scenario of working in vista space there is no alternative. As a support of my argument, Ware [WARE88] and Hinckley [HINC94a] both state that reducing degrees of freedom is useful when it is hard to maintain precision in certain degrees while adjusting others. In vista space, depth estimation is poor and so removing this degree of freedom is the best option to preserve accuracy.
In CAD systems, working planes can be placed in the environment using exact numeric keyboard entry, by drawing the plane’s cross section from a perpendicular view, or by selecting another object’s facet [MINE97b]. The first two cases may be difficult and unintuitive because people think in terms of objects relative to their body rather than abstract coordinate systems and view points. My extensions to working planes for AR can create these planes using the user’s body, making them much more intuitive to use when operating outdoors. An important improvement is that these AR working planes can be created and fixed to a number of coordinate systems that humans intuitively understand.
|
|
|
Figure 3‑4 Coordinate systems used for the placement of objects at or near a human |
Feiner et al. discuss the presentation of information in AR displays and how this information can be in surround-fixed, display-fixed, or world-fixed coordinates [FEIN93b]. As the user moves around the virtual environment, information in each coordinate system will be displayed differently. By selecting an appropriate coordinate system for each type of information, it can be more intuitively understood by users. Mine and Brooks discuss the placement of tools such as menus and tool palettes relative to the body, and how the user can find these easily since they are carried around relative to the user [MINE97a]. Using these concepts, a number of different coordinate systems can be identified that are suitable for performing modelling tasks, as depicted in Figure 3‑4. I have named these coordinate systems world, location, body, and head. In Figure 3‑4, the user operates in a world coordinate system that is anchored to some fixed point in the physical world. Using a positioning device, location coordinates are measured relative to world coordinates and represent the location of the user’s feet but without direction. Using an orientation sensor mounted on the hips, body-relative coordinates can be calculated by applying an offset to transform from the feet to the hips and then applying the orientation. Head-relative coordinates are similarly calculated with the appropriate height and orientation of the user’s head. The height values used for body and head coordinates can be either measured once and stored as a constant, or captured from a tracking device. I have only identified these coordinate systems as the main ones of importance for this research, but there are many others if appropriate tracking devices are available.
Information can be stored relative to any of the coordinate systems described in Figure 3‑4. The surround-fixed windows by Feiner et al. map to body-relative, display-fixed windows map to head-relative, and world-fixed windows map to world-relative. The menus and tool palettes floating about the user implemented by Mine and Brooks map to body-relative. Using the coordinate systems defined here, I extend the concepts of Feiner et al. to include not only the presentation of information, but also the placement of AR working planes so that points may be created and objects manipulated at a distance. This section describes AR working planes that have been created relative to each of the four coordinate systems and the effect that user motion has on the created planes. Although body-relative coordinates are described here, they are not implemented in later chapters since no sensor is used to measure body rotation, and is included only for comparisons to existing work. Based on the orientation and position sensors that I have used, figures are used to show the effect on each AR working plane of body translation, head rotation, or combination movements in the environment.
|
Translate |
Head Rotate |
Translate / Head Rotate |
|
|
|
|
|
Figure 3‑5 World-relative AR working planes remain fixed during user movement |
World coordinates are the top-level coordinate system used to represent positions over a planet or other large areas of interest. Objects that are specified relative to the origin of the world coordinate system are anchored to a fixed place in the physical world, and are completely independent of the user’s motion, as depicted by (1) in Figure 3‑4. In virtual environments, most objects are created world-relative since they are not attached to the user and may move independently, with examples being buildings, trees, and automobiles. The user’s coordinate systems are also specified in world coordinates, since their position and orientation are returned from tracking devices that are world-relative. Figure 3‑5 depicts a user moving in the environment with the AR working plane remaining since it is in coordinates independent of the user. World-relative AR working planes are commonly used when working with buildings and the user desires to keep the planes fixed relative to the walls at all times.
|
Translate |
Head Rotate |
Translate / Head Rotate |
|
|
|
|
|
Figure 3‑6 Location-relative AR working planes remain at the same bearing from the user and maintain a constant distance from the user |
Location coordinates are derived by taking the current position of the user from a tracking device and adding this to the origin of the world coordinate system. The axes for both location and world coordinates are still aligned except there is a translation offset between the two, as depicted by (2) in Figure 3‑4. With location coordinates the orientation of the user has not been applied, and so any changes in rotation will have no effect. An object placed in location-relative coordinates will always appear at the same true compass bearing from the user and maintain the same distance during motion. Location-relative coordinates are particularly useful for displaying an immersive compass to the user - the compass labels are attached around the user at a fixed radius and stay at the same orientation no matter what direction the user is looking. Another use is to attach a virtual camera at a fixed distance and direction from the user at all times, which follows the user’s location but does not move with head or body rotation. Figure 3‑6 depicts the effects of user motion on an AR working plane that is location-relative, where the plane moves with the user around the world. With user translation the plane moves with the same transformation, but rotation has no effect. The main uses for location-relative coordinate systems are the placement of vertices and object manipulation at fixed orientations. These fixed orientations are useful when working with buildings, keeping the AR working plane parallel to the walls but still moving relative to the user.
|
Translate |
Head Rotate |
Translate / Head Rotate |
|
|
|
|
|
Figure 3‑7 Body-relative AR working planes remain at a fixed orientation and distance to the hips and are not modified by motion of the head |
Although it is possible to define any number of coordinate systems, this will not be performed since in many cases it does not make sense to create objects relative to arbitrary parts of the body. A user’s sense of proprioception is focused about its main components such as the hips and the head, and so these will be the main focus. Body coordinates are defined relative to location coordinates except that orientation of the hips is added, as depicted by (3) in Figure 3‑4. Objects placed in body-relative coordinates will always appear in the same location-relative to the hips as the user moves around, with a good example being a tool belt worn by a worker. When walking around or when moving the head, the tool belt always remains in the same fixed position, ready to be accessed by the hands. Body-relative differs from location-relative in that the rotation of the hips affects the attached objects, whereas location-relative ignores any rotations by the user. The cockpit of an aircraft is also similar, where controls are always at the same location-relative to the user’s hips but the aircraft can fly around and keep the controls mapped to the same locations. Figure 3‑7 depicts the effects of user motion of the body on an AR working plane that is body-relative, where the plane is attached to the hips of the user as they move around the world. Although body coordinates are very intuitive within arm’s reach due to proprioception, they become more confusing at further distances since extra visual inspection is usually required. Some possible uses for body-relative coordinate systems are the placement of tools on a belt for easy access and display of non-critical status information.
Head-relative coordinates are similar to body-relative in that they add rotations to the location-relative coordinates, and can be defined relative to either location or body coordinates, as depicted by (4) in Figure 3‑4. The only difference between head-relative and body-relative coordinates is the part of the body that the information is attached to. Objects placed in head coordinates will always appear in the same location-relative to the user’s head, with a good example being a floating status indicator on a HMD. No matter what the position or orientation of the user, the status indicator will always be visible at the same location. Figure 3‑8 depicts the effects of user motion of the head on an AR working plane that is head-relative, where the plane is attached to the head of the user as they move around the world. When the user moves through the world, the plane will be translated and rotated to remain fixed within the field of view. The main use for head-relative coordinate systems is the placement of display status information and object manipulation. Head-relative mode is the most natural choice for object movement since it allows the user to adjust all three degrees of freedom by moving the body.
|
Translate |
Head Rotate |
Translate / Head Rotate |
|
|
|
|
|
Figure 3‑8 Head-relative AR working planes remain attached to the head during all movement, maintaining the same orientation and distance to the head |
In order to take advantage of AR working planes, the plane must first be created in the environment. During creation, AR working planes must be located in one of the coordinate systems defined earlier, which will affect the operations that can be performed. This section discusses different methods of creating planes that may then be used for manipulation and vertex creation.
Figure 3‑9 depicts a user creating a plane originating from the user’s head, parallel to the direction that the head is viewing. If the user is viewing in the direction of true north, then the plane will be infinite in the north and south directions, with east and west divided by the plane. Constraints may be applied to the orientation of the head so that only some degrees of freedom are used to create the plane. Since AR working planes are only useful when facing the user for cursors to be projected onto it, the user must be able to move independently of the plane to new viewing locations. This method is only relevant with world-relative coordinates since the plane is decoupled from the user’s motion.
|
|
|
Figure 3‑9 AR working plane created along the head viewing direction of the user
Figure 3‑10 AR working plane created at a fixed offset and with surface normal matching the view direction of the user |
Figure 3‑10 depicts a user creating a plane that is located at a fixed distance away and with surface normal matching the user’s view direction. If the user is viewing in the direction of true north, the plane will have a surface normal pointing north and be infinite in the east and west directions. Constraints may be used to restrict the degrees of freedom of the orientation of the head for creating the plane. Since the plane is facing the user it is ready to draw on and is suitable for use with all coordinate systems defined previously. The limitation of these planes is that the distance from the user must be specified with another input method, and the user may not be able to perform this accurately.
This technique is very similar to the previous in that the plane’s surface normal is based on the user’s view direction. The difference is that the plane is created so that it passes through the intersection point a user has selected on an object in the world. Figure 3‑11 depicts a user creating a plane at the intersection point of an object. These planes are most useful when created in head-relative coordinates for manipulation operations, although any other coordinate system is also possible.
Figure 3‑12 depicts a plane created to match the surface of a nominated facet on an object. Each of the objects has an AR working plane that is coincident with the selected facet, making it invariant to the user’s current position and orientation. As long as the object facet is visible and can be selected, then it can be used to spawn an AR working plane in the environment. Since the plane is created visible to the user it is immediately ready to draw on and is suitable for use with all coordinate systems. World coordinates are the most logical usage however, since the planes are defined relative to an object that is typically in world coordinates. Uses for other coordinate systems are discussed in the next sections.
|
|
|
Figure 3‑11 AR working plane created at intersection of cursor with object, and normal matching the user’s view direction
Figure 3‑12 AR working plane created relative to an object’s surface |
Using a similar technique to that discussed previously, the facet of an object may supply a surface normal for an AR working plane created at another object. Figure 3‑13 depicts a plane created at the point where the user’s cursor projection intersects an object in the environment. The surface normal is copied from an object selected previously with the same method. This technique is useful for manipulating objects relative to the surfaces of others and so is the most logical with world-relative coordinates, although other coordinate systems are possible as well.
|
|
|
Figure 3‑13 AR working plane created at a nominated object based on the surface normal of another reference object
|
|
|
|
Figure 3‑14 Manipulation of an object along an AR working plane surface
Figure 3‑15 Depth translation from the user moving a head-relative AR working plane
|
|
|
|
|
|
Figure 3‑16 AR working plane attached to the head can move objects with user motion |
|
|
|
Figure 3‑17 Scaling of an object along an AR working plane with origin and two points
Figure 3‑18 Rotation of an object along AR working plane with origin and two points |
Given the ability to place down AR working planes in the environment, one possible use is the implementation of translate, scale, and rotate operations. The first step is to create a working plane in the environment using one of the previously described techniques relative to an appropriate coordinate system. The choice of coordinate system determines the type of operations that can be performed. When using AR working planes in head coordinates, these techniques share similar properties to selection using image plane [PIER97].
Translation operations where the object is accurately moved across the AR working plane surface can be performed as shown in Figure 3‑14. Two points are projected onto the plane and are used to calculate a translation. This translation is then applied to the object to move it to the desired location, with the offset always being along the surface of the plane. If the AR working plane is attached to the location, body, or head then varying the user’s position will drag the object around, as depicted in Figure 3‑15. When using body or head coordinates, translations and rotations can be combined together, such as depicted in Figure 3‑16. By combining these techniques with cursor motion along an AR working plane, complex manipulation operations can be performed.
Scaling operations can be performed along the surface of an AR working plane and requires three input points – an origin for the scaling operation, and two points to specify a direction and magnitude vector. The two cursor points are used to calculate a new scaling transformation relative to the origin and then applied to the object, as depicted in Figure 3‑17.
|
|
|
Figure 3‑19 Vertices are created by projecting the 2D cursor against an AR working plane
|
|
|
|
|
|
Figure 3‑20 AR working plane attached to the head can create vertices near the user |
Rotation operations can be performed about the surface normal of an AR working plane with three input points – an origin for the axis of rotation, and two points to specify an angle. The two cursor points are used to calculate a new rotation transformation relative to the axis of rotation and then applied to the object, as depicted in Figure 3‑18.
The second more novel use for AR working planes is the placement of points in the environment. Selection and manipulation of existing objects has been implemented previously using a number of techniques, but there is still a lack of techniques for the creation of new geometry at a distance. Figure 3‑19 depicts how a user can project the cursor against an AR working plane and create vertices anywhere on the surface. Similar to the previous object manipulation section, this operation can be performed using an AR working plane in any coordinate system and created using any technique.
Apart from just creating points against fixed surfaces, if the AR working plane is relative to user coordinates then it will move with the motion of the user, as depicted in Figure 3‑20. As the user translates and rotates, the AR working plane will also move and points will be created in world coordinates against the current surface. While this technique may be used to create complex collections of vertices, this can be tedious for many objects. Chapter 4 will introduce techniques designed to simplify the creation of object geometry given certain assumptions.
When creating AR working planes using the position and orientation of the body, it is important that the user be as accurately placed as possible. To create vertices specifying the outline of a building, working planes must be created that are in alignment with the walls. Any errors in the placement of the working planes will cause projected vertices to deviate from the true physical wall surface.
The eye is an incredibly accurate measuring device that can notice even minute shifts between two objects that are in alignment. While fishing offshore with my father, I was shown how to look at large features along the coastline such as hills, towers, and buildings. When a fishing spot was discovered that we would like to come back to, my father would look along the shore to find landmarks that were visually aligned. After selecting aligned landmarks, these would then be recorded in his diary, producing a diagram similar to the example in Figure 3‑21. Lining up two landmarks would place the boat along a particular bearing, and then lining up a further two landmarks along another bearing would fix the position of the boat down to the intersection of the two lines. We could accurately find previous fishing spots within a few metres accuracy without the use of any tools except visual inspection using the eye. Even when using his GPS unit, my father would only use it to get within its 5-10 metre accuracy and then use line of sight techniques to improve the position of the boat. The alignment of landmarks varies even when walking around the boat and so performing measurements from the same seating position is required to achieve the best accuracy. The main difficulty with this technique is that it is limited to spots where landmarks can be found to align. Books for amateurs by Pescatore and Ellis [PESC98] and the web site by Poczman [POCZ97] are examples of collections of fishing locations around Adelaide marked using this technique.
Bowditch describes similar techniques used by professional sailors when navigating close to shore [BOWD02]. Figure 3‑22 shows the placement of official navigational aids named range lights, which are used to indicate safe channels that boats can travel along. In many harbours there are obstacles that can easily damage ocean vessels, and so by keeping the range lights aligned the navigator can keep the ship very accurately in the marked channel without straying off course.
|
|
|
Figure 3‑21 Example fishing spot marked using various shore-based landmarks (Sketch courtesy of Spishek Piekarski)
Figure 3‑22 Example of range lights in use to indicate location-relative to a transit bearing (Adapted from Bowditch [BOWD02]) |
The alignment of landmarks can also be performed using a video-based HMD but with reduced accuracy compared to the eye since the resolution is much lower. According to Rose, the human eye has the capability to resolve single dots at approximately 1-2 minutes of arc [ROSE73], although the brain is capable of achieving resolutions at least one order of magnitude higher by processing the image further. In comparison, the HMD described in this section has a resolution of approximately 2 minutes of arc with no further enhancements possible. This section uses the easily measurable properties of a HMD to simplify the calculations, as the human vision system contains a wide range of processing that is not fully understood and difficult to model.
|
|
|
Figure 3‑23 Sony Glasstron HMD measured parameters and size of individual pixels |
This section uses geometry to prove that the alignment of landmarks is useable with a video overlay mobile AR system to assist with the specification of planes in the environment. With landmark alignment, the creation of planes is limited mainly by the tracking equipment and not by the user’s perceptive capabilities. Using the known parameters of a HMD and using the distance to two marker objects from the user, the maximum sideways translation the user can move without observing a change in alignment can be modelled. To simplify the calculations, subtle visibility effects that occur at sub-pixel levels when two objects appear to visually interact with each other will be ignored.
Figure 3‑23 depicts the layout for a Sony Glasstron PLM-700E HMD, which has a resolution of 800x600 pixels projected onto a focal plane 1.25 metres from the user’s eye. The perceived image has approximate measured dimensions of 0.618 metres by 0.464 metres at the focal plane. Given this layout information, the size of each pixel may be calculated using similar triangles. Each pixel is approximately square and so is 0.773 millimetres in width and height at 1.25 metres. At a normalised focal distance of 1 metre, the pixels are 0.618 millimetres in width and height. Since each pixel is assumed to be square, normalised horizontal and vertical sizes are both represented using D.
If a landmark at some distance is to be visible on the HMD, it must be projected onto at least one pixel (or a significant portion of a pixel) on the display. Given the previous distance of 0.618 mm for a pixel at one metre, this can be extended out for any distance with similar triangles. For example a 100 metre distant marker must be 61.8 mm wide to be visible as a single pixel on the HMD. Using this concept, a diagram of similar triangles can be drawn (see Figure 3‑24) with a marker A of width aD at distance a, and marker B of width bD at distance b. The minimum required size of these markers is proportional to the distance from the HMD.
|
|
|
Figure 3‑24 Distant landmarks must be a minimum size to be visible on a HMD |
|
|
|
Figure 3‑25 Dotted lines indicate the angle required to separate the two marker’s outlines |
When the user and both markers are in line, there will be an exact overlap between the markers, and when viewed separately, each will form an image on the HMD that is exactly the same size. If the user moves sideways any distance at all, the objects will no longer overlap (as in Figure 3‑22) and appear to gradually separate apart. The goal of the following calculations is to estimate d from Figure 3‑25, the distance the user must move so that the projections of both markers no longer overlap (with a small gap between), ensuring visibility on the HMD. The distance d also represents the error in positioning possible using line of sight techniques, and can help to analyse their usefulness. The diagonal dotted line in Figure 3‑25 depicts the line that the user must look along to notice a separate pixel for each marker. To simplify the calculations, I assume that the markers appear on the display small enough that the geometry can be treated as straight lines rather than arcs. This is possible given the small size of the pixels in millimetres and the large distance of the markers in metres. Based on the dotted lines from Figure 3‑25, Figure 3‑26 depicts the arrangement of the similar triangles that need to be solved to estimate the error distance.
|
|
|
Figure 3‑26 Similar triangles used to calculate final positioning error function |
|
|
|
Table 3‑1 Alignment accuracies for markers at various distances from the user |
Using the similar triangles in Figure 3‑26, the equations can be derived to calculate a final equation that represents the accuracy d of this technique, shown in Figure 3‑27. This final error equation is useful because it allows a simple analysis of the accuracy of landmark alignment over a variety of distances. A single constant is used to linearly scale the equation depending on the pixel size calculated earlier. As the markers both approach the same distance, the technique rapidly increases errors due to an asymptote in the function when a=b. However, when the markers are sufficiently spaced apart from each other, the accuracy of the technique is quite incredible considering the distances involved. Table 3‑1 demonstrates this with the accuracies achieved using markers placed at different distances from the user.
|
|
|
Figure 3‑27 Rearrangement and simplification of final positioning error equation |
|
|
|
Figure 3‑28 Derivation of alignment equation when marker B is at an infinite distance |
When working in a 3D environment, the tracking hardware will also impose limitations on the measuring accuracy of the system. If the landmark alignment is more accurate than that of the tracking hardware, it will be adequate for the required modelling task. Figure 3‑29 depicts a 3D surface with contour lines for the accuracy equation and is capped at the 2 cm limit of a Real-Time Kinematic GPS unit. Figure 3‑30 depicts a similar 3D surface restricted to the 50 cm accuracy obtainable from a high quality differential GPS unit. The sloped regions indicate distances where the accuracy of the technique is within the performance of the respective GPS units. As an example, using a building with corners at 100 metres and at 150 metres, this gives an accuracy of 18.54 centimetres that is within the accuracy of a 50 cm high quality GPS unit. This accuracy is quite poor compared to the 2 cm accuracy of an RTK GPS unit however, and to achieve accuracies better than 2 cm the near corner must be closer than 22 metres (therefore the far marker must be closer than 72 metres). These values may be calculated using the equation in Figure 3‑27.
|
|
|
Figure 3‑29 3D surface plot with marker distances achieving alignment accuracy of 2 cm |
Another property of landmark alignment is that as one landmark approaches an infinite distance, the slope of the 3D surface begins to match a linear approximation, most visible in Figure 3‑30. This slope is then only controlled by the distance of the closer marker, and as it moves toward the user the technique becomes more accurate. This property is useful when working with very long buildings at a close distance for example, where the further marker is so distant that only the close marker affects the accuracy of the technique. The equation in Figure 3‑27 can be rewritten into Figure 3‑28 to calculate the error in this case by using a limit with marker B approaching an infinite distance.
The previously discussed equations and graphs show that by visually aligning landmarks through a HMD, very accurate positioning of the body can be obtained. While a human’s ability to perceive depth rapidly attenuates as distance increases, the landmark alignment process is highly accurate over any distance given visible markers at a suitable distance apart.
|
|
|
Figure 3‑30 3D surface plot with marker distances achieving alignment accuracy of 50 cm |
Existing techniques for VR have been developed mainly to solve the problem of manipulating existing virtual objects at a distance, and do not address the problem of creating new vertices and geometry that are out of arm’s reach. This chapter demonstrated that a human’s ability to perceive depth rapidly attenuates in the vista space beyond 30 metres, making it difficult to correctly specify distances. Accurate depth specification is required to perform the modelling of large outdoor structures and these are almost always within vista space, and so suitable techniques are required to overcome the limitations of humans. I developed the concept of augmented reality working planes based on previously developed CAD and VR systems, performing the projection of 2D cursors onto 3D surfaces to specify depth information. AR working planes restricts degrees of freedom that the user is not capable of specifying accurately, and breaks the operation into logical tasks that can be easily understood by the user. AR working planes can be created using a number of methods, stored relative to world, location, body, and head coordinates, and used for object manipulation and vertex placement. By implementing working planes in AR, I take advantage of features that are only possible with the physical presence of the user in the environment. By using accurate positioning based on the alignment of objects in the environment, operations can be performed at large distances with only minor accuracy degradation caused by the user. AR working planes is a core concept for outdoor modelling used to support action at a distance, and the construction at a distance concept introduced in the next chapter.
4
“Applying computer technology is simply finding the right wrench to pound in the correct screw”
Anonymous
This chapter presents a series of new augmented reality user interaction techniques to support the capture and creation of 3D geometry of large outdoor structures. I have termed these techniques construction at a distance, based on the action at a distance term used by other researchers. My techniques address the problem of AR systems traditionally being consumers of information, rather than being used to create new content. By using information about the user’s physical presence along with hand and head gestures, AR systems may be used to capture and create the geometry of objects that are orders of magnitude larger than the user, with no prior information or assistance. While existing scanning techniques can only be used to capture existing physical objects, construction at a distance also allows the creation of models that exist only in the mind of the user. Using a single AR interface, users can enter geometry and verify its accuracy in real-time. Construction at a distance is a collection of 3D modelling techniques based on the concept of AR working planes, landmark alignment, CSG operations, and iterative refinement to form complex shapes. The following techniques are presented in this chapter: street furniture, bread crumbs, three types of infinite planes, CSG operations, projection carving, projection colouring, surface of revolution, and texture map capture. These techniques are demonstrated with a number of examples of real objects that have been modelled in the physical world, demonstrating the usefulness of the techniques.
Current research in AR applications (as discussed previously in Chapter 2) has focused mainly on obtaining adequate tracking and registration, and then developing simple interfaces to present display information to the user. One important problem that has not been properly addressed is the authoring of the content that is displayed to the user. Since most AR systems are being used simply as a visualisation tool, the data is prepared offline with standard editing tools and then transferred to the AR system. Brooks states that one of the still unsolved problems in VR is the creation and capture of 3D geometry [BROO97], which is also relevant for AR models. To develop content for AR systems, I have developed a number of techniques I collectively termed construction at a distance. These techniques use the AR system itself to capture the 3D geometry of existing structures in the physical world, and create new 3D models for virtual objects that do not yet exist. Construction at a distance makes use of the AR working planes and landmark alignment techniques discussed previously in Chapter 3, and builds higher-level operations to perform the capture and creation of 3D models. This section describes some of the main features of my modelling techniques: supplementing existing physical capture techniques, working at a fixed scale in the environment, taking advantage of the previously defined AR working planes, performing iterative refinement of models, and the use of simplified operations to avoid the entry of vertices where possible.
The purpose of these techniques is not to replace existing object capture methods discussed in Chapter 2, which are highly accurate and can produce excellent results given the correct conditions. When working in unfavourable conditions, construction at a distance may be able to overcome these and capture a 3D model. Some examples of where my techniques are most useful are:
· I use a human operator that is capable of accurately estimating the geometry of planar shapes, even when partially occluded by other objects in the environment. When trees occlude the edges of a building, a human can estimate the layout based on the image information available and other knowledge.
The eye is a highly accurate input device capable of aligning along the walls of buildings within the limitations discussed previously. Accurate modelling is still possible when working from a distance and direct access to the object is not available.
Existing capture techniques have a fixed operation time no matter what the complexity of the scene is, whereas in my methods the human can judge the most appropriate level of detail. In many cases the user wants to create only simple shapes such as boxes to represent buildings, and so these techniques are ideal for quick operations.
Existing techniques require the object to already exist so it can be captured, whereas my methods allow the human to specify any geometry desired. This allows the creation of new shapes that do not physically exist and may be used to plan future construction work.
While the eye and brain are powerful capture devices, there are limitations introduced by the resolution and accuracy of the tracking devices used to record the inputs. For example, when using a GPS accurate to 50 centimetres the object size that can be modelled is in the order of metres (such as a car), while using a 1 millimetre magnetic tracker allows much smaller objects (such as a drink can). This research does not attempt to address problems with registration or accuracy of tracking devices, but instead works within the limitations of current technology to provide the best solutions that can be achieved.
In previously discussed VR research, a number of techniques have been developed for use in modelling applications. These applications traditionally provide tools to create and manipulate objects in a virtual world, and to fly around and perform scaling operations to handle a variety of object sizes. While techniques for action at a distance such as spot lights [LIAN93], selection apertures [FORS96], and image plane techniques [PIER97] have been developed, these only perform simple manipulations on existing objects and cannot be used to create new ones due to the lack of generating distance values. Techniques such as flying [ROBI92], Worlds in Miniature [STOA95], scaled world grab [MINE97a], and Voodoo Dolls [PIER99] can perform the creation of points by bringing the world within arm’s reach, but accuracy is affected by the scale. Due to their non-exact freehand input methods, all of these systems are also limited to conceptual modelling tasks and not precision modelling. CAD systems use snapping functions or exact numerical entry to ensure accurate inputs, but require an existing reference to snap to or non-intuitive command-based entry.
Although AR environments share some similar functionality with VR, AR is unique in that it requires registration of the physical and virtual worlds. Flying and scaling operations require the breaking of AR registration and so cannot be used. Scaled world representations force the user to divert their attention from the physical world to perform miniature operations within the hands. Existing VR techniques cannot create models of objects the size of skyscraper buildings without breaking the 1:1 relationship between the user and the virtual world. With construction at a distance techniques, the scale of the world is fixed and only the user’s head position controls the view. The virtual geometry is created using absolute world coordinates and is always registered and verifiable against the physical world in real-time. By using the physical presence of the user as an input device, the body can be directly used to quickly and intuitively control the view rather than relying on a separate input device.
As discussed previously, humans are much more capable of accurately estimating and specifying horizontal and vertical displacements compared to distances. By using the AR working planes and landmark alignment techniques formulated earlier, simple 2D input devices can be used to draw points in 3D. An AR working plane can be defined at any time from the body along the direction of view (maximising accuracy with landmark alignment) or relative to an existing object (maintaining the same accuracy as the source object), and the user can then move around to a different angle to draw against this surface. With AR working planes, the user is able to draw points that are at large distances and at locations that are not normally reachable, maintaining a 1:1 relationship between the virtual and physical worlds.
In order to draw against the working plane surface, a 2D input device must be used. This chapter is written without specifying any particular input device technology to demonstrate the generic nature of these techniques. Some of the examples in this chapter show a glove with fiducial marker-based tracking in use as the input device. A 2D cursor is overlaid on top of the fiducial marker and this is used to project points onto the AR working planes. Other input devices such as trackballs or joysticks can just as easily be used for this task. Chapter 5 will discuss further the implementation of the user interface, the input devices used, and how construction at a distance is implemented within applications.
Construction at a distance relies on a set of fundamental operations that by themselves cannot generally model a physical world object. Combining a series of these fundamental operations by making iterative improvements can produce complex shapes however. As the modelling operation is taking place the user can see how well the virtual and physical objects compare, repeatedly making changes until the desired quality is gained. The previously discussed CSG techniques used by CAD systems also rely on this principle to produce highly complicated shapes that would otherwise be difficult to specify. The ability to instantly verify the quality of models against the physical world helps to reduce errors and decrease the total modelling time.
The process of iterative refinement for VR modelling is discussed by Brooks [BROO97], and he recommends that a breadth-first iterative refinement strategy is the most efficient. Each major object should be created using a simple representation at first, and then of each minor object of lesser importance. By refining the objects that require it, guided by the eye of the user, Brooks suggests that poor approximations will be immediately obvious and should be the first objects to correct. I have used these VR guidelines for my construction at a distance techniques, and take the refinement process one step further by using the unique ability of AR to compare virtual and physical worlds simultaneously. Instead of attempting to capture millions of polygons, construction at a distance focuses on simplicity and can be used to capture models at very simple detail levels if desired by the user. As mentioned in the background chapter, a laser scanner typically takes about one hour to capture a building from four different angles, producing a fixed number of millions of polygons with no control over detail. In comparison, if a simple model with only important features is required, the user can focus their efforts on these parts only and reduce the modelling time considerably. As an example, a building with four walls and a sloped roof was captured in the few minutes it takes to walk around and sight along each wall of the building.
|
|
|
Table 4‑1 Top down view of building shapes with vertices (v), edges (e), and facets (f) |
Some techniques have been developed previously for the interactive creation of data in virtual environments with no prior information. The previously discussed CDS system by Bowman can create vertices by projecting a virtual laser beam against the ground plane [BOWM96]. By connecting these points together and extruding the 2D outline upwards, full 3D solid objects can be created although they are limited in complexity by being constant across the height axis. The previously discussed work by Baillot et al. performed the creation of vertices located at the intersection of two virtual laser beams drawn from different locations [BAIL01]. After defining vertices these can then be connected together to form other shapes of arbitrary complexity, limited only by the time available to the user. Since these techniques both operate using vertex primitives that are then connected into edges, polygons, and objects, the complexity of this task increases as the number of facets on the object increases. Given a building object with n walls (along with a roof and a floor) there will be 2n vertices, 3n edges, and n+2 facets to process. Table 4‑1 shows some example objects with a varying number of faces, with linear growth for vertex and edge complexity. Rather than treating objects as collections of vertices, construction at a distance mainly operates using surfaces and solid objects, so an object with 10 facets can be modelled in 10 steps rather than as 20 vertices and 30 edges.
This section describes techniques involving the direct placement of objects within arm’s reach. While not being truly construction at a distance, these techniques may be used as inputs for other operations. These techniques are the simplest to understand for the user, although can be time consuming due to the physical movements required.
|
|
|
Figure 4‑1 AR view of virtual table placed in alignment with physical world table |
The simplest way to perform modelling is to use prefabricated objects and place them at the feet of the user as they stand in the environment. The object is placed when commanded by the user, with its orientation specified relative to the viewing direction of the user and always placed level to the ground plane. I have termed this technique street furniture, as it can be used to place down objects that commonly occur on the street, such as the table in Figure 4‑1. This method works well when objects to create are known in advance, and the user can avoid having to model the object each time. Due to the direct nature of the technique, no abstractions are required for the user to understand since physical movements are used to control the object placement. By instantiating objects at the user’s feet, tracking of the hands can be completely avoided to simplify the task further. While this is not construction at a distance according to the definition, it is the most basic and simplest operation that can be performed using a mobile outdoor AR computer. Later techniques described in this chapter use direct placement for the creation of infinitely sized plane surfaces in the environment. The main limitation of this technique is that the user must be able to walk up to the location desired. When the object cannot be reached, techniques that can perform true construction at a distance must be used.
|
|
|
Figure 4‑2 VR view of bread crumbs markers defining a flat concave perimeter
Figure 4‑3 AR view showing registration of perimeter to a physical world grassy patch |
Using the previously defined street furniture technique to place down prefabricated shapes is useful but limited to objects created in advance. For large objects such as rivers, lakes, trails, roads, and other ground features I have developed a technique named bread crumbs. In many cases it is possible to walk near the edges of these ground features, and so a direct system of marking out their vertices is realisable. The bread crumbs technique is inspired by the children’s fairytale Hansel and Gretel [GRIM22]. In this story, the children are taken out into the forest by their parents in the hope they will not come back home again. Hansel was carrying a loaf of bread however, and dropped small crumbs of bread where they walked, enabling them to find their way back home again. On the second occasion, the children found themselves unable to get home because the birds in the forest ate up the trail.
|
|
|
Figure 4‑4 Example bread crumbs model extruded to form an unbounded solid shape |
With a mobile AR system, less edible and more reliable virtual markers can be placed down on the ground to simulate bread crumbs. The user walks along the perimeter of the object they wish to model and manually specifies markers to place at points of interest under the feet, as shown in Figure 4‑2. This is the same as marking waypoints when using a hand-held GPS while walking. When the user fully walks around the object, a closed perimeter is automatically formed and converted into polygons, as in Figure 4‑3. While the initial perimeter defined is a thin polygon, it can be infinitely extruded up to define a solid building outline, or infinitely extruded down to approximate the bottom of a lake or river. The resulting extruded object is unbounded in the vertical direction as shown in Figure 4‑4, and must be completed using other techniques described later.
The bread crumbs technique has been used to model roads, parking lots, grassy areas on campus, and other concave outline style shapes. Paths and navigation routes may also be defined using bread crumbs, except treated as a line segment instead of a solid polygon.
This section describes a series of construction at a distance techniques based on the user’s physical presence in the environment. Using simple head-based pointing, the geometry of planes originating from the body can be specified, taking advantage of the user’s sense of proprioception. Using CSG techniques, these planes can be used to easily define solid building shapes out of arm’s reach. Since many buildings in the physical world can be modelled using planes, the process of modelling can be accelerated compared to the simplistic approach of creating each vertex and edge manually.
|
|
|
Figure 4‑5 Infinite carving planes used to create a convex shape from an infinite solid |
Buildings in the physical world tend to approximate collections of angled walls in most cases. As described in Chapter 2, a solid convex cube can be formed with the specification of six planes arranged perpendicular to each other and a CSG intersection operator. Instead of specifying these planes numerically, the user can create these same planes in an AR environment by projecting them along the line of sight. By looking along the plane of a wall of a building and aligning the two ends (a very accurate method of positioning discussed previously) the user can project an infinite plane along this wall (in a similar way to AR working planes). Each plane defines a half space that when combined with a CSG intersect operation will form a finite solid shape.
Figure 4‑5 depicts a five-sided building and the location of the mobile AR user as they are sighting down each of the walls, showing the infinite volume being iteratively bound by the infinite planes. At the beginning of the operation, the AR system creates an (approximately) infinite solid volume that will be used for carving. When the user is aligned with a wall, they project an infinitely long vertical plane along the view direction into the world. This plane divides the previous infinite solid into two parts and the left or right portion (decided by the user) is carved away from the solid and removed. As the user sights along each wall, the solid will be reduced down to an object that is no longer infinite in the X and Y axes. At completion, a floor is automatically created at ground level, and the roof is left unbounded for carving using other techniques, since it is impractical to sight along the roof of a very tall building. The final 3D shape is stored using absolute world coordinates and reflects the geometry of the physical building.
With this technique, the object can be carved away iteratively and the user receives real-time feedback of the infinite volume being bounded, allowing immediate undo in case of a mistake. Compared to the direct methods described previously, this plane-based technique allows the capture of buildings from a distance without having to actually stand next to or on top of the building. Since the user is in direct control of the modelling process, the positions of occluded surfaces can be estimated using their knowledge of the environment. These features are useful because many existing physical capture methods require a full view of the object, GPS trackers do not work well near large buildings, and standing on top of a building may not be possible or too dangerous. This technique is also much more efficient than vertex and edge specification since each wall is handled with a single primitive that is easy to create accurately. A limitation of this technique is that using only planes and a CSG intersection to define objects restricts usage to convex buildings with no indentations, and this will be addressed further at the end of this section.
Another limitation of the orientation infinite planes technique is the dependence on an orientation sensing device for the head. While GPS units may have reliable accuracies in the order of 2 cm, orientation sensors vary in accuracy due to problems with interference and limitations of the technology. These variations affect the placement of planes in the environment and as the distance from the user increases, angular errors cause increasing positional errors. The last section of this chapter will further discuss the accuracy of these techniques in more detail, but using techniques that can avoid the use of orientation sensing should be able to produce much more accurate results.
|
|
|
Figure 4‑6 Orientation invariant planes generated using multiple marker positions |
In order to take advantage of the stability of position tracking, the orientation infinite planes technique described earlier can be modified to use two or more position points to specify orientation, making it invariant to errors in orientation tracking devices. Using the same landmark alignment concept discussed previously, the user can accurately sight along a wall and mark a position. To indicate direction, the user walks closer while maintaining their alignment and marks a second point. These two points can then be used to project an infinite carving plane, with Figure 4‑6 depicting the position markers and planes used to form a complete building shape. By increasing the spacing of the marker points or using a line of best fit amongst multiple points, the accuracy of this technique can be further improved.
|
|
|
Figure 4‑7 Relationship between GPS accuracy and required distance to achieve better than 1 degree of orientation error for two different GPS types |
To measure the accuracy of this technique, Figure 4‑7 depicts how the angular error is affected by the positional error in the GPS and the distance between the two marker points. To make this technique useful it must have an accuracy that is better than is available using traditional orientation sensors. Assuming a maximum error of 1 degree, Figure 4‑7 contains the calculations to find the distance required between markers for 50 cm and 2 cm accurate GPS units. Using the RTK example, if 1.1 metres is required for less than 1 degree error then 10 or more metres will produce orders of magnitude better results than previously possible. For a 50 cm GPS these results are not so promising since the marker distance required of 29 metres is quite large - the user will have to choose between the errors introduced by the GPS or those from the orientation sensor.
|
|
|
Figure 4‑8 Orientation invariant planes formed using first specified angle and markers |
This technique is similar to the position infinite planes technique in that it is invariant to orientation sensing errors. The previous technique required the user to specify the orientation for each plane by using two points, but if the angles at each corner are known to be equal then only one orientation is needed and the others can be calculated automatically. The user creates the first plane using the same method described previously, but for each additional plane only one position marker is recorded. Based on the number of positions marked, the system knows the number of walls to create and calculates the orientation for each position point based on the first plane. This technique uses nearly half the number of points and yet produces the same accuracy if the first plane is properly placed and the building meets the required properties. Figure 4‑8 depicts the markers created by the user and how they are used to project planes through the environment to form a solid shape.
Many objects in the world are not the same shape as simple boxes, cylinders, spheres, and cones. While it may seem that many objects are too complicated to model, they may usually be described in terms of combinations of other objects. For example, the process of defining a cube with a hole using vertices is time consuming, but can be easily specified with a CSG operation. As discussed in the background chapter, CSG is a technique commonly used by CAD systems, supporting Boolean set operations such as inversion, union, intersection, and subtraction [REQU80]. The manufacture of objects in the physical world is also performed in a similar manner - to produce the previous example a drill is used to bore out a hole in a solid cube.
|
|
|
Figure 4‑9 Box objects can be moved into a building surface to carve out windows |
To demonstrate CSG operations outdoors, Figure 4‑9 depicts an example where a user is applying the CSG difference operator to subtract cubes from a building shape. This could be used when the user needs to carve out indented windows. The first example shows a cube placed at a distance (1a), and then dragged sideways until it enters the building shape (2a). The second example shows a cube attached to a working plane (1b) and then pushed (2b) into the surface of the building (similar to a cookie cutter), requiring close access to the building. As the cube is being positioned by the user, the CSG difference operator is interactively calculated and displayed to the user.
The infinite planes examples previously described have only dealt with convex shapes, limiting the types of buildings that can be modelled. A convex shape can be simply described as one that appears to have the same shape when a sheet of rubber is stretched over its surface, such as the trapezoid in Figure 4‑10. A concave shape is more complex and contains holes or other indentations that will not be visible when a sheet of rubber is stretched over. Concave buildings similar to the T, L, and O shapes in Figure 4‑10 cannot be modelled directly using a single set of infinite planes because planes from one part of the object will exclude other parts. Concave objects can be created using other convex objects as inputs however, such as with the subtraction of one box from another depicted in Figure 4‑11. The two boxes can be individually created using infinite planes, and then combined using a CSG difference operator to produce a concave object. When combined with CSG techniques, infinite planes become more useful with the ability to also model concave objects.
|
|
|
Figure 4‑10 Convex trapezoid and concave T, L, and O-shaped objects
Figure 4‑11 Concave object created using CSG difference of two convex boxes |
This section describes a series of construction at a distance techniques based on AR working planes. The previous techniques are capable of placing prefabricated objects and capturing bounding boxes for large objects, but detailed modelling is not provided. Using AR working planes and a 2D input glove (as used in these examples), the user can specify much more intricate details to create realistic 3D models.
The projection carving technique modifies existing objects by projecting points against surfaces and then cutting away extrusions to produce new highly concave shapes. This technique provides the ability to construct features such as zig-zag roofs and holes that are difficult or impossible to model using previously described techniques. Figure 4‑12 depicts an example of how this technique can be used to carve two peaked roofs onto a building model. These building models may have been created using infinite planes and projection carving can be used to restrict the infinite roof to a finite volume. The AR working plane is created relative to a polygon that has been selected by the user. The object that contains the polygon is then used as the input for the upcoming carving operation. The user then creates vertices along the surface of the AR working plane and these are connected together to form a 2D concave outline. This outline is then extruded along the surface normal of the working plane and used as an input tool for a CSG difference carving operation.
|
|
|
Figure 4‑12 AR working planes are used to specify vertices and are projected along the surface normal for carving the object’s roof |
The projection is performed using orthogonal extrusion from the AR working plane, and is position invariant so points can be entered from any location in front of the polygon. This enables the user to cut a flat roof on a 100 metre high building while standing at ground level and looking up. If the cursor was directly used to carve the object directly like a laser beam, the system would produce pyramid-shaped extrusions. For some buildings, the user may only desire to create a flat roof or a single slope, and by creating only one point the system will create a horizontal cutting plane, and with two points a diagonal cutting plane is created. More than two points implies the user wishes to cut with an outline and so it must be fully specified as in Figure 4‑12. The CSG operation can be switched from difference to intersect if desired, with the effect being that the user can cut holes or split an object into separate parts instead of carving the outside. Used in this form, orthogonal extrusion is limited to carving operations that can be seen in a silhouette representation – other features such as indentations that are not visible from the side can not be captured with this technique. Some of these limitations can be overcome by limiting the depth of the extrusion used for carving. By using a small fixed value or controlling it by moving the body forward or backward, the extrusion can be controlled by the user and used for features such as windows or doors.
This technique is first demonstrated with a simple example of modelling a building with a sloped roof such as in Figure 4‑13. The user first captures the overall geometry of the building using orientation infinite planes with an unbounded roof. To carve the roof, the user positions their body somewhere in front of the building so that the entire slope is easily visible. The user then indicates with the cursor the first vertex defining the roof on the left, then the peak of the roof, and then the right side of the roof, as shown in Figure 4‑14. To complete the selection, the user must enter vertices around the overall building to indicate that the rest of the object should be kept. The selection region is then used to define a carving tool that removes all objects outside the region, to produce the final shapes shown in Figure 4‑13 and Figure 4‑15.

Figure 4‑13 AR view of infinite planes building created with sloped roof

Figure 4‑14 AR view of infinite planes building being interactively carved with a roof

Figure 4‑15 VR view of building with sloped roof, showing overall geometry
A second example demonstrating this technique is a small automobile being modelled outdoors in Figure 4‑16. Firstly, a prefabricated box is created and scaled to approximate the overall dimensions of the car. The user next views the car from the side and intersects points against the box surface to define the silhouette. Figure 4‑16 shows the markers being placed on the box, and Figure 4‑17 shows the final solid shape of the car approximately matching the physical world. The object can then be carved along any other polygons to further refine the model until it suits the user’s requirements.
|
1 5 Figure 4‑16 Frames of automobile carving, with markers placed at each corner
Figure 4‑17 Final resulting automobile shown overlaid in AR view, and in a VR view |
Once a building has been created, the user may desire to place windows, doors, and other extra details onto the model. While it may be possible to draw these details onto a texture map (which cannot be zoomed arbitrarily), or to place extra polygons outside the building to represent these (covering the original building), the building model itself remains untouched. If these new polygons are removed or manipulated, the original solid object remains since the changes are only superficial. A more desirable scenario is that polygons of a different colour are actually cut into the subdivided surface of an object, so that if they are deleted it is possible to see features inside the object that were previously concealed. I have named this technique projection colouring and its operation is depicted in Figure 4‑18. Using the same steps as projection carving, vertices are projected against an AR working plane created relative to the surface and then connected into an outline. Instead of carving away the outline, the surface is subdivided and the colour of the outlined polygon is modified. The newly coloured polygons may then be deleted or manipulated freely by the user if desired. The window and door in Figure 4‑18 have been cut into the surface using this technique, with the door then openable using a rotation.
|
|
|
Figure 4‑18 Schematic illustrating the painting of a window onto a wall surface |
|
Figure 4‑19 Examples showing surface of revolution points for tree and cylinder objects |
When working outdoors and modelling natural features such as trees and artificial features such as fountains, box-shaped objects are usually poor approximations to use. In an attempt to model these objects, I have used surface of revolution techniques (as used in many desktop CAD systems) to capture geometry that is rotated about an axis. The user starts by creating an AR working plane in the environment, with the most intuitive way being to sight toward the central trunk of the tree and project the AR working plane along the view direction. The user then projects vertices onto the AR working plane, defining one half of the outline of the object. After specifying the vertices along the axis of rotation, the system generates a solid object by rotating the outline around the axis, as depicted in Figure 4‑19. Figure 4‑20 shows an example where the vertices of a tree have been specified with a preview shape generated, with Figure 4‑21 showing the final shape in the environment. This technique generates good results when modelling natural objects such as pine trees that are highly symmetrical about the trunk. For trees that grow with deformities and other non-symmetrical features this technique may not generate suitable approximations. To improve the approximation, previously described carving techniques may be applied to refine the model until the user is satisfied with the object.
|
|
|
Figure 4‑20 AR view of surface of revolution tree with markers on AR working plane
Figure 4‑21 VR view of final surface of revolution tree as a solid shape |
When implementing live AR video overlay, the mobile computer includes a video camera that may also be used to supply textures for polygons. By using the same tracking devices that are already required for AR, the system can automatically match up images from the camera to polygons in the scene. Captured models are normally only presented using a single colour and texture maps increase the realism for users without having to add extra polygons for detail. This texture map capturing technique is an alternative to the previously described projection colouring technique when only superficial details are required. To perform texture map capture, the user stands at a location where the texture for an object’s polygon is clearly visible to the camera. The user selects the polygon to activate capture mode and the system projects the polygon vertices onto the AR video overlay to map the still image as a texture. The user repeats this operation for each polygon until the object is completely textured. An example is shown in Figure 4‑22, where a stack of pallets is modelled using a 1 metre box and then textures are captured for the polygons. The final resulting model is shown in Figure 4‑23.
|
|
|
Figure 4‑22 Outdoor stack of pallets approximating a box, before modelling
Figure 4‑23 VR view of final model with captured geometry and mapped textures |
The best results for this technique are obtained when the object is fully visible and fills as much of the HMD as possible, as well as being perpendicular to the user’s viewing direction. Since OpenGL only supports linear texture mapping, if the angle to the polygon is large (such as the roof of the box in Figure 4‑22 and Figure 4‑23), the texture will be highly distorted since a non-linear mapping is required. Although techniques for capturing textures of 3D models have been described previously, this has not been performed in a mobile outdoor AR environment. Previously discussed work by Debevec et al. implemented the capture of 3D models from photographs and extracted textures for each facet [DEBE96]. Lee et al. also implemented the capture of textures in AR but with surfaces being modelled within arm’s reach using a wand, with the system automatically capturing textures when video frames were deemed suitable [LEE01]. The video stream used with mobile AR suffers from problems with motion blur and tracker registration, and having the user choose the moment to capture the texture generates the highest quality output.
While a number of techniques can perform the modelling of simple and useful shapes, the true power of construction at a distance is expressed when used in combinations. This integrated example is designed to highlight all of the features of the described modelling techniques with the construction of an abstract building in an outdoor environment. The user walks outside to an empty piece of land and creates a landscape that they would like to preview and perhaps construct in the future. As an added feature, this model may be viewed on an indoor workstation either in real-time during construction or at a later time. Some other similar applications are for creative purposes such as an abstract art or landscape gardening design tool.
Figure 4‑24 and Figure 4‑25 show different views of this example at the end of the construction process. The first step is to create the perimeter of the building shape using the bread crumbs technique - the user walks around the building site and places down markers at the desired ground locations, forming a flat outline. Next, the outline is extruded upwards into a solid 3D shape. Using projection carving, the user cuts a main roof to make the object finite and then carves a slope using various control points. After the overall roof structure is created, the object is lifted into the air. At this point, the supporting columns, trees, tables, and avatar people are created using street furniture placement of prefabricated models at the desired locations. The building is then lowered by visual inspection onto the supporting columns. Next, the user performs further carving and a large hole is created through the centre of the building. Projection carving is then used to cut out two large sections of the building, causing it to exist as three unattached solid shapes most visible in Figure 4‑24. After around 10 minutes for this example, the desired model is complete and the user can now move around the environment to preview it from different view points.
|
|
|
Figure 4‑24 AR view of final abstract model, including street furniture items
Figure 4‑25 VR view of final abstract model, including street furniture items |
The construction at a distance techniques rely on the position and orientation sensors for all tracking, and so increasing the accuracy of these devices will produce improved results and affect the minimum model size that can be properly captured. Errors from each sensor have different effects on the captured models since one is measured as a distance and the other as an angle. When rendering the AR display, results are also affected not only by the errors in the current tracker data, but also those from the capture process. Tracking devices affect the overall operation of the system and how it can be deployed in the physical world.
The position sensor used in these examples is a Trimble Ag132 GPS, with an accuracy of approximately 50 centimetres and working reliably amongst small buildings and light tree cover. To ensure the most accurate positioning possible, an indicator on the HMD shows the quality of the GPS signal to the user. For orientation, an InterSense IS-300 hybrid magnetic and inertial sensor is used, although the tracking is unreliable when there are magnetic distortions present in the environment or when the user is moving quickly and disturbing the sensor. Since the accuracy of the IS-300 may vary unpredictably, this is the most critical component in terms of reliability and accuracy.
When modelling a new object, the accuracy of projection-based techniques is dependant on the user’s current location and the direction they are looking. For the highest accuracy, it is desirable to be as close to the object as possible, minimising the distance the projection can stray from the desired direction caused by angular errors in the orientation sensor. When viewing an existing virtual object, the registration errors with the physical world caused by the GPS will be the most accurate when viewed from a distance due to perspective, while standing very close to an object will cause these errors to be more noticeable. For registration errors caused by the IS-300, these remain constant on the display at all distances due to their angular nature.
While GPS is fast and reliable (with accuracies of RTK GPS units at 2 cm in ideal conditions), improvements in orientation sensing technology are required to make the construction at a distance techniques more accurate. While the IS-300 is one of the best mid-range priced trackers on the market, it is still not completely adequate for modelling outdoors. Due to its use of magnetic sensors, distortion caused by sources of magnetic interference such as the backpack, underground pipes, and street lamps affect the accuracy of the tracking. To correct for this, the backpack contains a touch pad mouse attached to the user’s chest to apply a correction offset. The touchpad allows the user to fine tune the calibration while moving outdoors through varying magnetic fields. I am currently investigating methods of automating this calibration, with methods such as optical natural feature tracking holding promise, but these are still current research problems.
This chapter has presented my novel construction at a distance techniques, designed to support the capture and creation of 3D models in outdoor environments using AR. Construction at a distance takes advantage of the presence of the user’s body, AR working planes, landmark alignment, CSG operations, and iterative refinement to perform modelling tasks with mobile AR systems. The techniques described in this chapter include street furniture, bread crumbs, three types of infinite planes, CSG operations, projection carving, projection colouring, surface of revolution, and texture map capture. When used in an AR environment, users can capture the geometry of objects that are orders of magnitude larger than themselves without breaking AR registration or having to touch the object directly. These modelling techniques are intuitive and support iterative refinement for detail in areas that require it, with AR providing real-time feedback to the user. While existing techniques are available for the capture of physical world objects, these still have limitations and also cannot be used to create models that do not physically exist. The construction at a distance techniques were field tested using a number of examples to show how they may be applied to real world problems. By discussing insights gained from these examples I have identified areas for improvement that currently cause accuracy problems.
5
"Research is what I'm doing when I don't know what I'm doing."
Wernher Von Braun (1912-1977)
This chapter presents new user interface technology that I have designed and developed for use with mobile outdoor AR systems. This user interface is used to control the previously described AR working planes and construction at a distance techniques, and is implemented in a mobile outdoor AR modelling application named Tinmith-Metro. The implementation of the techniques described in this dissertation demonstrates their use, and allows different ideas to be tested and iteratively refined through user feedback. Testing the techniques out in the field has also given insights into more efficient methods after experimenting with early prototypes. The user interface described in this chapter is designed to be implemented within the limitations of current technology while still being intuitive to use. The solution developed uses a pair of vision tracked pinch gloves that interact with a new menuing system, designed to provide a cursor for AR working planes and command entry to switch between different operations. The user requires feedback on the HMD while using the system, and so the design of the display with its various on screen objects is discussed. To implement the full Tinmith-Metro application, a large hierarchy of commands was created to control the various modelling techniques and these are presented in a format similar to a user manual. This chapter ends with a summary of a number of informal user evaluations that were used to iteratively refine the interface.
Augmented and virtual reality both use the motion of the body to control the image that is displayed to the user. While tracking the body is critical to rendering AR and VR displays, fine grained manipulation and selection operations typically require the hands to intuitively specify. The previously described Smart Scene [MULT01] and CHIMP [MINE96] applications have intuitive user interfaces that involve the use of hand gestures supplied by high quality tracking systems. These user interfaces implement many best practice techniques for direct manipulation in VR that are desirable for modelling tasks. While outdoor AR is similar and using these techniques is desirable, there are a number of unique problems (discussed in Chapter 2) that require solutions. Previously discussed outdoor AR systems use indirect style user interfaces involving tablets, keyboards, joysticks, trackballs, gyro mice, and other hand-held devices but do not support direct manipulation operations. The user instead steers a cursor across the display using indirect motions of the hands. Some devices such as velocity-based inputs have a further level of indirection and so are even less intuitive to use. Shneiderman [SHNE83] and Johnson et al. [JOHN89] both discuss how adding levels of indirection may make user interfaces less intuitive to operate. Furthermore, any AR user interface developed must use mobile technology so the user may roam freely in the outside environment. It must take into account restrictions on size, performance, electric power consumption, and weight.
I was initially inspired by the elegant combination of commands and pointing implemented by Bolt in the Put-That-There system [BOLT80]. The user is able to point their hand toward objects projected onto the walls and then speak commands to perform operations. The use of menus by many other applications is an unintuitive abstraction that forces users to select suitable options from a list, usually with the same input device. The direct command entry in Put-That-There is similar to how people operate in the real world, and can be observed when watching a supervisor explain an operation to a worker: “Pick up that brick and place it over there”, while at the same time pointing with the hands to indicate exactly what object and where to put it. Brooks also agrees and discusses how commands and operations should be separated in interactive systems [BROO88].
Based on this initial inspiration from Bolt and Brooks, I have developed a user interface involving the use of tracked gloves so the user can point at objects and interact with them in a mobile outdoor AR system. Since the user is wearing gloves, there is no need for any other devices and the hands are free to perform physical world tasks. Implemented using vision tracking, the interface can locate the user’s hands in the field of view. Using optical methods (discussed further in Chapter 7) is one of the only ways to perform tracking of the hands outside, although the 3D accuracy is quite poor compared to indoor magnetic tracking. One property of the optical tracker is that its 2D accuracy is quite good, and 2D cursors can be calculated from the hand positions very easily. These 2D cursors are suitable for input to the AR working planes techniques and used to perform interactions in 3D environments. Due to poor distance and orientation detection, these values are not used to control the user interface. The use of tracked gloves to perform selection and manipulation of objects is a very direct user interface because natural hand gestures are used to express actions. Although the object at a distance is not being touched directly, humans still intuitively understand this operation. When talking with others, humans naturally point and understand these gestures even when the object is very far away. Pierce et al. demonstrated a number of simple selection techniques demonstrating this using image planes [PIER97].
Brooks suggested that commands and operations should be kept separate and so I have included this important idea into the user interface. Many systems implementing direct manipulation also make use of menus or other widgets that the user must reach out and grab. These systems use the same cursors for both interaction and command entry, and these ideas do not work well when implemented with poor tracking. Rather than having unreliable command entry caused by poor tracking, I have developed an interface based on the pinching of fingers to select commands from menus fixed to the display. The pinching of fingers is a very reliable input that is easy to implement in any environment, and allows the hands to be outside the field of view and untracked during command entry. The use of finger pinches mapped directly to the menu also allows the same hand to perform simultaneous interaction and command entry.
By separating the command interface from pointing, Brooks describes how the command input method can be changed without affecting interaction. He also suggests that speech recognition is a natural candidate for command selection, as used by Bolt in Put-That-There. Johnson et al. also point out that the use of command interfaces such as speech input encourages the user to think of the computer as an assistant or co-worker rather than a tool [JOHN89]. While it would be interesting to explore the use of speech recognition for command entry, the current software available to implement it is very CPU intensive. When using a mobile computer there are limited resources available, and speech recognition would prevent the rest of the AR system from functioning adequately. Problems such as lack of accuracy and interference from noise (especially when outside) were common in older systems, and although improvements have been made they have not been eliminated. Currently this user interface does not implement speech recognition, but is designed to support this and is discussed as a future area of research at the end of this chapter.
The user interface is made up of three components: a pointer based on the tracking of the thumbs with a set of gloves worn by the user; command entry system where the user’s fingers interact with a menu for performing actions; and an AR display that presents information back to the user. The display for the interface is fixed to the HMD’s screen and presents up to ten possible commands as menu options at any one time. Eight of these commands are mapped to the fingers as depicted in Figure 5‑1, and the user activates a command by pressing the appropriate finger against the thumb. When an option is selected, the menu refreshes with the next set of options that are available. Ok and cancel operations are activated by pressing the fingers into the palm of the appropriate hand and are indicated in the topmost boxes of the menu. The interaction cursors are specified using fiducial markers placed on the tips of the thumbs, as shown in Figure 5‑2. With this user interface, the user can perform AR working planes, action at a distance, and construction at a distance techniques in mobile outdoor AR environments.
|
|
|
Figure 5‑1 Each finger maps to a displayed menu option, the user selects one by pressing the appropriate finger against the thumb |
The concept of AR working planes and two example uses were presented in Chapter 3, one for the manipulation of objects and the other for the creation of new vertices. AR working planes are designed to support interactions at large distances and only 2D input is required, simplifying the tracking requirements. Using the previously mentioned gloves, fiducial markers are placed on the thumbs to provide 3D tracking information using the ARToolKit [KATO99]. Due to the implementation of ARToolKit, the 6DOF accuracy can be quite poor but overlaid objects always appear to register with the targets. When the 3D coordinates from the tracker are projected onto the image plane of the display, an accurate 2D cursor can be obtained, as displayed in Figure 5‑2. With the availability of suitable tracking, the user interface contains cursors for both hands as well as a head cursor fixed to the centre of the HMD. Each operation is implemented using varying combinations of these input cursors depending on what is most appropriate. This section describes the implementation of one, two, and zero-handed AR working planes techniques with examples demonstrating their use outdoors.
|
|
|
Figure 5‑2 Immersive AR view, showing gloves and fiducial markers, with overlaid modelling cursor for selection, manipulation, and creation |
To simplify manipulation tasks for the user, operations such as translate, rotate, and scale are implemented using separate commands. This is on the assumption that the user will wish to work with certain degrees of freedom without affecting others. Researchers such as Hinckley [HINC94a] and Masliah and Milgram [MASL00] have demonstrated that users have difficulty controlling both the position and orientation of the hands at the same time. Constrained manipulation is also useful when poor tracking is used, so that other degrees of freedom are not affected when simple changes are made. Manipulation handles (as used in many previous 2D and 3D systems) are not used since they would be difficult to grab with the poor tracking available. The user instead activates the desired manipulation operation with a menu command and may grab any visible part of the object. This is also more intuitive since objects in the physical world can be moved by grabbing any part of the object.
Users may perform selection operations using a single hand controlling the 2D cursor. The cursor is projected into the environment and intersected against the nearest object, similar to the image plane techniques described by Pierce et al. [PIER97]. The object closest to the user will be selected, placed into a selection buffer, and rendered transparently in a selection colour for user feedback. Since the most common operations performed by users only involve single objects, the implementation has been streamlined so that selection is automatically performed when a manipulation command is activated.
To operate on many objects simultaneously, the user can select several objects and collect them together into a selection buffer. When a manipulation task is then specified by the user, they may select any object that is a part of the collection and then all objects will be used as inputs. The ability to collect objects together is also useful for specifying operations such as CSG where two inputs are required and both may contain multiple objects. Multiple objects are currently collected together by selecting each object, although other techniques such as two-handed rubber banding are simple to implement if required. To support modelling, an arbitrary number of selection buffers may be stored in a circular list for later use. Overall then, a modelling session may have a number of selection buffers (with one being active), and each buffer contains collections of objects that must all be operated on together. Selection buffers are similar to groups used in drawing editors, except no changes to the scene graph are made to express this relationship. Each selection buffer is represented using a different transparent colour to assist the user with identifying groupings. The active selection buffer may be nominated by cycling through the available list or selecting an object that is part of the desired selection buffer. Within a selection buffer, references to objects are also stored in a circular list, with one object nominated as the last object. When a new object is added to the selection buffer it will become the new last object, and extra commands are provided to operate on only the last object instead of the entire selection buffer. The last object pointer is useful for performing a type of undo to allow objects to be deleted in the order of creation.
Direct manipulation-based translation operations may be performed by the user with a single hand. At the start of the operation, the user selects an object (and automatically any other sibling objects which are part of the same selection buffer, if any) and an AR working plane is created using head-relative coordinates with normal pointing toward the user, located at the point of intersection with the object. As the user’s hand moves, the cursor is continuously projected onto the AR working plane and used to offset the object accordingly, performing translation and maintaining the same object to user distance. If the user moves their body during the translation operation then the object will be moved in the same direction as this motion. If the user changes their head direction then the object will rotate around with the user’s head. Figure 5‑3 shows an animation of a typical translation operation with a virtual tree being directly manipulated across the landscape by the user’s hand and head motion. Instead of using head-relative coordinates, the AR working plane could also be defined in world, location, or body-relative coordinates to support different motion constraints. The properties of AR working planes relative to these different coordinate systems were described previously in Chapter 3.
|
5 Figure 5‑3 Translation operation applied to a virtual tree with the user’s hands |
Using AR working planes, the hand tracker and 2D cursor may also be used to project 3D vertices. Chapter 4 described a number of techniques involving the placement of vertices for the carving of objects and creating surfaces of revolution. When these techniques are activated, an appropriate AR working plane is created and the user can place vertices onto the plane using finger pinch commands.
The AR working planes scale and rotate operations require more inputs than the translation operation because an axis to scale or rotate about must be defined along with the adjustment vector. Scale and rotate may be performed more naturally through the use of two-handed input, which has been shown to improve performance and accuracy for users. Two-handed input techniques were first pioneered by Buxton and Myers in a study using 2D environments [BUXT86], and then discussed in terms of 3D environments in a survey paper by Hinckley et al. [HINC94a]. Sachs et al. found that using a tracked tablet in one hand and a pen in the other was more natural than having a fixed tablet and one-handed input [SACH91]. Zeleznik et al. presented some two-handed 3D input methods for desktop applications, of which some are implemented here [ZELE97]. These previous works all demonstrate that by using the non-dominant hand as an anchor, the dominant hand can accurately specify operations relative to this. Scaling may be performed by stretching the object with the two hands. Instead of using the orientation of the hand, rotation may be performed by using the angle between the two hands.
|
5 Figure 5‑4 Scale operation applied to a virtual tree with the user’s hands |
Scaling operations are initiated by selecting an object and any associated selection buffer sibling objects with the non-dominant hand, and an AR working plane is created at the object in head-relative coordinates. Since the operation begins immediately after the scale command is issued, the non-dominant hand must be selecting the object and the dominant hand should be in a suitable position to specify the new relative scale adjustment. Similar to translation, the AR working plane could also be expressed in other coordinate systems to support different scaling constraints. The AR working plane is then used to capture initial 3D coordinates for the input cursors so that any relative changes can be applied to the object selected. Since the scale operation is performed against an AR working plane, only the two dimensions perpendicular to the surface normal of the AR working plane may be adjusted. This implementation uses the head cursor as the scaling axis and relative changes in the distance between the hands to control the scaling factor applied. Figure 5‑4 shows an animation using a sequence of frames with a tree being scaled about the head cursor using both hands. Scaling about the head cursor was implemented because I did not want the scaling axis to vary as both hands are being stretched apart. An alternative implementation is where the object is grabbed by the non-dominant hand to specify the scaling axis, and the dominant hand specifies a relative scaling adjustment to apply. If the non-dominant hand moves during the operation then the object may either move with the hand or remain still while the scaling axis moves instead. Performing simultaneous translation and scaling on an object may be undesirable because it is too many degrees of freedom to control with relatively inaccurate tracking.
|
5 Figure 5‑5 Rotate operation applied to a virtual tree with the user’s hands |
Rotation operations are performed using similar two-handed input techniques as scaling, and started by selecting an object and any associated selection buffer sibling objects with the non-dominant hand. Similar to the previous scaling operation, the non-dominant and dominant hands must be in the desired positions because the rotation operation begins when the rotate command is selected. An AR working plane relative to the head is created at the object (similar to translation and scaling) and the initial 3D coordinates of the input cursors are captured. Since rotation is performed against an AR working plane, the rotation is fixed to occur on an axis parallel to the surface normal of the AR working plane. The head cursor is used as the axis of rotation and the angle between the hands is used to apply a relative rotation to the selected objects. Using two-handed rotation has an added advantage of performing well with tracking systems that produce high quality position but poor quality rotation values. By ignoring the orientation values and using only the positions for rotation, I avoid these errors and maximise the accuracy of the tracking system in use. Figure 5‑5 shows an animation using a sequence of frames with a tree being rotated about the head cursor using two-handed input. Similar to scaling, an alternative implementation is where the non-dominant hand grabs the object at the desired rotation axis while the dominant hand rotates the object. If the non-dominant hand moves then the object will move or the axis of rotation will move, and this may be an undesirable number of degrees of freedom to control.
Since the user interface has very powerful command entry which is not based on the cursor inputs, a number of manipulation and creation operations are possible without requiring the hands to be visible. The nudge operation allows the user to perform precise manipulations based on fixed increments. The head cursor can be used similar to the hands by pointing with a cursor fixed to the centre of the display.
To perform accurate translation, scale, and rotate commands, an operation named nudging is provided to operate in fixed increments and without requiring vision tracking. Nudging is based on the use of arrow keys or other discrete controls for interaction which are commonly used in 2D desktop-based modelling systems. The nudge move operation is used to offset the object in one metre increments in any direction relative to the user (left, right, up, down, towards, and away) and is useful in cases where the distance is known exactly or the user cannot walk to drag the object. Scaling is performed similarly using fixed units in any of the three axis directions or with the axes linked together to ensure uniform scaling. Rotation implements fixed angular offsets and can rotate about any of the three axes individually. The scale and rotate nudge operations also provide the ability to scale away from the user or rotate around the ground plane normal, which cannot be performed normally since AR working planes flattens these directions. Nudging does not require any input device tracking at all, and so is useful when objects are too far to judge adjustments, or when tracking is of extremely poor quality.
While the head cursor can be used to perform translation operations similar to those implemented using the hands, the main use is for the placement of objects in the world. AR working planes and orientation infinite planes may be projected from the head into the environment, and the head cursor is used to define the direction from the user that this plane will follow. Using the head cursor is an obvious way to project planes since the user is already aligning objects by sighting along the head cursor cross in the centre of the display. The placement of street furniture objects is also performed relative to the head cursor, with the objects being placed either with zero distance at the user’s feet, or at a distance in front of the user and placed in the direction of the head cursor. The head cursor is further used to define surface normals for head-relative AR working planes definitions, with the normal being calculated using the location of the user’s eye and the virtual head cursor in the centre of the display.
The cursor can be used to perform a range of powerful interaction operations, but a method of specifying which operation to perform is required to support complex applications. This section describes the hierarchical menu system I have developed to support a large number of commands. As previously mentioned, finger pinches on the gloves are utilised to control the menu system, allowing the user to select other menus and specify commands. Since no tracking is required the user can hold their hands anywhere they desire, and there is no need to dock the cursor onto menu options for selection. This especially increases the speed of command entry under poor tracking conditions. A menu-based system was chosen as a simple way of grouping common tasks to reduce searching time and form the basis for a command entry system. Having a logical and efficient layout of menu options is also critical to making it intuitive for users. This section compares the design of the menu system against a number of other relevant AR and VR command entry systems to demonstrate its novel features.
This dissertation has presented the benefits of using hand pointing for interaction and so a menu design that is compatible with this style of input is required. Using another 2D input device would be cumbersome and force the user to switch between different input methodologies. Therefore the design of the menu system is not based on the traditional WIMP 2D style interface with selection of objects using a cursor. Instead, the design is closer to that of the original MicroPro WordStar application shown in Figure 5‑6 and used in the 1970s and 1980s [XTRE03]. This application uses a menu bar along the bottom of the display to show the commands that are mapped to the F1-F10 keys on the keyboard. To select a menu option, the user presses the appropriate F-key and either the command is performed or the menu moves into a sub-level of extra options. Before the introduction of the mouse to user interface design, menu design was quite common and is discussed by Shneiderman [SHNE92]. Interacting with menus of this type requires a discrete number of inputs and the user can directly select what they want without having to control a dynamic input device.
|
|
|
Figure 5‑6 Original WordStar application, showing menu toolbar at bottom of screen (Image courtesy of Mathias Winkler - XTree Fan Page) |
The command entry interface I have designed uses the fingers of the glove to select the desired menu node from a set of vertical menu strips placed on the sides of the display. Figure 5‑1 depicts how each finger on the glove uniquely maps from top to bottom onto the lower four coloured boxes forming the menu display. Transparency is used to allow the user to see the physical world through the menu, reducing visual clutter. The vertical placement of the menu strips intentionally matches the orientation of the hands when held comfortably at rest and when performing cursor operations. The direct mapping of finger pinches to menu options helps the user to quickly select the correct menu node. In addition to the user’s eight fingers available for selection, two extra