Spirit has a mind of its own?
|
|
Thread rating:  |
Jon Berndt - 24 Jan 2004 03:44 GMT What do you make of this
(http://marsrover.jpl.nasa.gov 8:35 p.m. CDT 1/23/3004)
"NASA's Spirit rover did not go to sleep today even after ground controllers sent commands twice for it to do so.
Shortly before noon, controllers were surprised to receive a relay of data from Spirit via the Mars Odyssey orbiter. Spirit sent 73 megabits at a rate of 128 kilobits per second. The transmission included power subsystem engineering data, no science data, and several frames of "fill data." Fill data are sets of intentionally random numbers that do not provide information."
---
It's good news that the high rate transmission is working, and I hope the engineering data is helpful. But, it sounds worrisome that the rover would not go to sleep. Will that wear the battery down to nothing and silence it permanently? Or, does the rover recharge every day? I think it's the latter, though over time the battery will degrade. ?
Jon
Joe Delphi - 24 Jan 2004 06:50 GMT > What do you make of this > [quoted text clipped - 7 lines] > > Jon I think the rover is battery is charged with solar panels.
From what I have heard, the problems started when they used an electric motor to do something. A malfunctioning electric motor could cause a voltage spike which could fry some of the chips associated with the computer. So maybe the reason it is not shutting down is because the computer is damaged to the point where it cannot process the shutdown command.
JD
John Doe - 25 Jan 2004 02:17 GMT > computer. So maybe the reason it is not shutting down is because the > computer is damaged to the point where it cannot process the shutdown > command. A couple years ago, IBM had some fancy 9000 series computer kill humans because it didn't want to allow them to shut him down.
My guess is that the NSA knows about the monolith on Mars and has planted secret mission objectives/code in the rover which conflicts with those of NASA. And when NASA tries to shut down the rover, the NSA's code takes control and prevents the shutdown procedure. During this time, the rover travels at 100km/h to the monolith so it can take pictures and readings, and then returns to its previous location and resumes contact with NASA.
Brian Gaff - 24 Jan 2004 11:48 GMT I think the problem looks like software that encountered something that generated an error that was not tested for.
At least, I'd rather believe that than the frying of some chips. I'd hope that designers were better at their job than to allow a motor in any failure mode to affect the rest of the vehicle. If that is the case, then once again we see how thick we really are!
Brian
-- Brian Gaff.... graphics are great, but the blind can't hear them Email: briang1@blueyonder.co.uk ____________________________________________________________________________ __________________________________
| What do you make of this | [quoted text clipped - 19 lines] | | Jon Hallerb - 24 Jan 2004 13:41 GMT >"NASA's Spirit rover did not go to sleep today even after ground controllers >sent commands twice for it to do s thats bad, it might deplete the batteries. wonder if opportunity will have the same fate? lets hope not.
Dave Donnelly - 24 Jan 2004 13:46 GMT > What do you make of this > > (http://marsrover.jpl.nasa.gov 8:35 p.m. CDT 1/23/3004) > > "NASA's Spirit rover did not go to sleep today even after ground > controllers sent commands twice for it to do so. I find it interesting they make a press release saying they're sending commands to Spirit to tell it to go to sleep, but they don't say why they'd like that to happen.
I presume it's to conserve power, or to avoid overheating the craft, but then again I'm presuming.
They could have just said so, but they didn't.
And they wonder how conspiracy theories start.
In my experience as a computer engineer in general I don't want the computer to shut down while I'm debugging it - the longer it stays up, the more I learn about what it's doing. So I have to presume they must be concerned about power use or heat.
I did read that the computer seems to be in a reset loop i.e. it runs for a while then resets itself. So I imagine it's pretty unpredictable exactly how long it will stay up.
Perhaps trying to force the shutdown will serve to keep the rover in a known state for a while so they can focus on solving the problem.
My best wishes to the folks at NASA/JPL/Cornell, etc. It's a difficult thing you are up against. I wish I could help.
-DD-
Brian Gaff - 24 Jan 2004 20:56 GMT | > What do you make of this | > [quoted text clipped - 30 lines] | | -DD- Yes, I wondered about that as well.
If it keeps resetting, I'd suspect some kind of obvious input that caused it. Assuming they can run the system with say, just diagnostic routines and comms running, then they ought to be about to find out what is going on, but if they are thinking, Overheating trip here, then not being able to force it to sleep is not a good sign.
The thing would seem to be hardware resetting randomly.
Cannot exactly send a service engineer to fix it.
Has the computer got any redundancy, or is it a case of if its shot, so is the whole thing?
Brian
-- Brian Gaff.... graphics are great, but the blind can't hear them Email: briang1@blueyonder.co.uk ____________________________________________________________________________ __________________________________
spaceprojects.tk - 24 Jan 2004 22:36 GMT >If it keeps resetting, I'd suspect some kind of obvious input that caused >it. Assuming they can run the system with say, just diagnostic routines and >comms running, then they ought to be about to find out what is going on, but >if they are thinking, Overheating trip here, then not being able to force it >to sleep is not a good sign. Actually, the problem has been upgraded to serious. Engineers managed to reset the computer.. Although it will stay put for up to 3 weeks, chances are now good that (if nothing else goes wrong) the rover may be able to continue with a nearly normal mission
http://spaceprojects.tk
Hagar - 24 Jan 2004 23:27 GMT > | > What do you make of this > | > [quoted text clipped - 11 lines] > | > | They could have just said so, but they didn't. Actually, Theisinger said just that - it's to conserve power at night. Last night the batteries died - but that's not fatal, just undesireable. Today they managed to get it to shut down.
> Yes, I wondered about that as well. > [quoted text clipped - 10 lines] > Has the computer got any redundancy, or is it a case of if its shot, so is > the whole thing? Bill Gates persuaded NSA HQ to use off-the-shelf Windows XP to save money.
But once a hardware component failed, the activation has expired, and it's trying to phone MicroSoft...
~~~~~~~~~~~~~~~~~~~~~ This message was posted via one or more anonymous remailing services. The original sender is unknown. Any address shown in the From header is unverified.
Brian Gaff - 25 Jan 2004 09:01 GMT Giggle, and the random numbers are the product code....:-) Brian
-- Brian Gaff.... graphics are great, but the blind can't hear them Email: briang1@blueyonder.co.uk ____________________________________________________________________________ __________________________________
| > | > What do you make of this | > | > [quoted text clipped - 52 lines] | The original sender is unknown. Any address shown in the From header | is unverified. David Stribling - 24 Jan 2004 23:32 GMT > | > What do you make of this > | > > | > (http://marsrover.jpl.nasa.gov 8:35 p.m. CDT 1/23/3004) > | > > | > "NASA's Spirit rover did not go to sleep today even after ground > | > controllers sent commands twice for it to do so. [Martian dad, sitting in his easy chair, puts down paper]
"Will you kids quit playing with that thing and come down here? Supper's ready!"
[Giggling is heard coming from the cave entrance behind the big rock]
 Signature David Stribling Remove the to reply
Brian Thorn - 24 Jan 2004 21:41 GMT >> What do you make of this >> [quoted text clipped - 11 lines] > >They could have just said so, but they didn't. They have other things on their minds besides specifying why they want Spirit to go to sleep. I doubt its a power issue or a thermal issue (if Spirit isn't doing anything but rebooting its computer, it won't be generating much heat) they probably want the computer to stop the reboots so that they can get a chance to fix it.
>Perhaps trying to force the shutdown will serve to keep the rover in a >known state for a while so they can focus on solving the problem. Bingo.
Brian
RDG - 24 Jan 2004 15:56 GMT It's those damn Ja-Wahs....
Ool - 24 Jan 2004 16:59 GMT > It's those damn Ja-Wahs.... Rrright! Like, who are they gonna sell a robot with a bad motivator?
 Signature __ "A good leader knows when it's best to ignore the __ ('__`> screams for help and focus on the bigger picture." <'__`) //6(6; ©OOL mmiv :^)^\\ `\_-/ http://home.t-online.de/home/ulrich.schreglmann/redbaron \-_/'
RDG - 24 Jan 2004 19:08 GMT Well then, how about that blue one? Uncle Owen? ?????
Clark - 24 Jan 2004 20:15 GMT > What do you make of this > > (http://marsrover.jpl.nasa.gov 8:35 p.m. CDT 1/23/3004) > > "NASA's Spirit rover did not go to sleep today even after ground > controllers sent commands twice for it to do so. I tole'em not to use Win98, but nooooo they said. They said mars landers weren't on the supported hardware list for W2K and we had to go with 98. That'll teach'em...
Kent Betts - 24 Jan 2004 21:17 GMT > > (http://marsrover.jpl.nasa.gov 8:35 p.m. CDT 1/23/3004) http://marsrovers.jpl.nasa.gov./home/index.html
gov dot slash? weird.
Jan C. Vorbrüggen - 26 Jan 2004 14:31 GMT > http://marsrovers.jpl.nasa.gov./home/index.html > > gov dot slash? weird. AFAIK, this stops the DNS from trying to do translations on local modifications of the name (i.e., adding local postfixes), instead doing a translate on the full name instead.
Jan
Kent Betts - 24 Jan 2004 21:12 GMT I wonder if it is something simple, like a short in the camera motor connector that causes an intermittent drop in supply voltage, which the computer senses as a power loss, triggering a reboot. Wish the engineers would figure out what is wrong with the thing.
I also wonder how much redundancy there is in the design. Seems like a dual power supply, dual processor, dual memory, and dual transmitters would be prudent if it would fit with the weight constraints.
Kent Betts - 24 Jan 2004 21:23 GMT > What do you make of this "The transmission included power subsystem engineering data, no science data..."
So if the science data has it's own board for interfacing and signal conditioning, it might be that the CPU is okay but the interface to the instruments is smoked.
Kent Betts - 25 Jan 2004 04:21 GMT Sat Jan 24 2004 7:45 pm PST
Per rover project manager
The flash memory and the software perform a handshake op. Spirit computer maintains a prioritized "to do" list. If tasks are not completed according to schedule, a reset occurs. Spirit's computer performed a reset 60 times in a 30-hour period.
Engineers looked at the "to do" list to try to determine which application was causing problems. The reset problem was duplicated in the lab and the offending app was isolated.
The present work-around calls for on board RAM memory to be put into service while engineers troubleshoot the flash memory problem. The RAM memory is not as efficient but will do the job.
The Spirit did not go into sleep mode overnight. The batteries are low but will recharge fully on the next sol.
Seventeen sols of operation were complete prior to the anomaly. Spirit has returned 2 1/2 gB of data, performed egress, and returned some micro images.
Condition of the rover is upgraded from critical to serious.
Brian Gaff - 25 Jan 2004 09:06 GMT | Sat Jan 24 2004 7:45 pm PST | [quoted text clipped - 20 lines] | | Condition of the rover is upgraded from critical to serious. OK, so it is software detecting a problem then. I'd have thought that it might have been designed to spot a problem if it rebooted so often, but obviously not. They just had to catch it at the right moment to stop the loop.
Hard job when the delay factor is taken into account, they got lucky!
I'd be very interested in knowing if the problem was caused by some environmental probem, or was just a failure due to bad luck.
Brian
-- Brian Gaff.... graphics are great, but the blind can't hear them Email: briang1@blueyonder.co.uk ____________________________________________________________________________ __________________________________
Kent Betts - 25 Jan 2004 19:28 GMT "Brian Gaff"
> I'd be very interested in knowing if the problem was caused by some > environmental probem, or was just a failure due to bad luck. > > Brian Yeah that's for sure. Actually is sounds like the 1201 alarms on Apollo 11. In the A-11 instance, the software had a task list, and if the tasks did not receive processing an alarm was set but the machine continued running.
It also reminds me of Galileo, which was supposed to have a 128K downlink but completed the mission on ten bits per second. Using the ram memory instead of flash memory...may slow down their operations. Come to think of it, Spirit was capable of autonomous navigation, like they are really going to try that now.
I saw where it got down to -175F overnight, so yeah the environment may be a factor.
starman - 25 Jan 2004 21:23 GMT > I'd be very interested in knowing if the problem was caused by some > environmental probem, or was just a failure due to bad luck. What are the chances that the flash chip got hit by high energy radiation like a cosmic ray? Does the flash have any redundancy for such an event?
JazzMan - 25 Jan 2004 22:08 GMT > > I'd be very interested in knowing if the problem was caused by some > > environmental probem, or was just a failure due to bad luck. > > What are the chances that the flash chip got hit by high energy > radiation like a cosmic ray? Does the flash have any redundancy for such > an event? I would have presumed that the operating system would have provisions for marking bad cells in the flash much like hard drives have bad sectors marked.
JazzMan
 Signature ********************************************************** Please reply to jsavage"at"airmail.net. Curse those darned bulk e-mailers! ********************************************************** "Rats and roaches live by competition under the laws of supply and demand. It is the privilege of human beings to live under the laws of justice and mercy." - Wendell Berry **********************************************************
Kent Betts - 25 Jan 2004 22:42 GMT Now what......looks like one software app, otherwise known as a portion of the software, was causing problems. The app duplicated the problem in the engineering mock-up. So is hardware involved or not? ------- 406PM EST Sun Jan 25 2004
"Spirit is still serious but we are moving toward guarded condition now," rover project manager Pete Theisinger reports. "I think we got a patient well on the way to recovery." In the past day, engineers have determined that Spirit's flash memory hardware is OK. A leading theory today is that a portion of the rover's software simply couldn't cope with all that was happening on Wednesday when the trouble began.
The rover's batteries are now fully charged and the craft shortly will be going to sleep for the night. But before nighttime it will be relaying data to the Mars Odyssey orbiter including engineering and diagnostic information.
Theisinger predicts that Spirit will resume driving around the surface in a couple of weeks.
Steve Hix - 25 Jan 2004 23:24 GMT > Now what......looks like one software app, otherwise known as a portion of > the software, was causing problems. The app duplicated the problem in the > engineering mock-up. So is hardware involved or not? They still don't know. They're working the issue.
Hobbs aka McDaniel - 26 Jan 2004 03:11 GMT > > Now what......looks like one software app, otherwise known as a portion of > > the software, was causing problems. The app duplicated the problem in the > > engineering mock-up. So is hardware involved or not? > > They still don't know. They're working the issue. A press release on the NASA website says that the problem is with the file management software... but maybe that means it was a fundamental flaw in the OS which was developed by a third party. Still, I would think that proper stress testing on earth would have caught this problem before launch. Apparently the amount of data the rover was dealing with at the time of the failure had something to do with triggering the problem.
-McDaniel
Steve Hix - 25 Jan 2004 23:21 GMT > > I'd be very interested in knowing if the problem was caused by some > > environmental probem, or was just a failure due to bad luck. [quoted text clipped - 6 lines] > http://www.newsfeeds.com - The #1 Newsgroup Service in the World! > -----== Over 100,000 Newsgroups - 19 Different Servers! =----- Steve Hix - 25 Jan 2004 23:24 GMT > > I'd be very interested in knowing if the problem was caused by some > > environmental probem, or was just a failure due to bad luck. > > What are the chances that the flash chip got hit by high energy > radiation like a cosmic ray? Does the flash have any redundancy for such > an event? (Oops...previous post went out sans response content.)
During this afternoon's briefing, it was mentioned that the flash memory is susceptible to this sort of problem during read or write operations; they are looking at solar activity during the past couple of days to see it anything matches the observed failures.
Apparently, when it isn't reading/writing, there's no problem.
starman - 26 Jan 2004 07:31 GMT > > > I'd be very interested in knowing if the problem was caused by some > > > environmental probem, or was just a failure due to bad luck. [quoted text clipped - 11 lines] > > Apparently, when it isn't reading/writing, there's no problem. Do they know if Spirit was in daylight or night when the hardware/flash damage may have occurred?
Kent Betts - 26 Jan 2004 22:19 GMT http://spaceflightnow.com/mars/mera/040125spirit.html
Bogged down software could explain Spirit's ailment BY SPACEFLIGHT NOW Posted: January 25, 2004
The group working to unravel the glitch with Spirit and return the rover to action has narrowed the possible cause of its trouble to three potentials, officials said Sunday afternoon.
"Spirit is still serious but we are moving toward guarded condition now," rover project manager Pete Theisinger said. "I think we got a patient well on the way to recovery."
The rover experienced a problem last Wednesday, disrupting communications with Earth and halting all science activities. The breakdown happened while Spirit was performing a calibration of motors on the Mini-TES instrument.
"The leading theory is that the file management software module in the software has gone to some condition that it could not cope with -- that it was not robust enough for the operations we were engaged in when we had the flaw on Wednesday," Theisinger said.
On Saturday, engineers began focusing on the rover's flash memory and the way the software communicates with the computer memory. To get the rover operating, it was told to avoid using the flash memory for now.
On Sunday, the team was able to reset Spirit's computer to the non-flash utilization mode, Theisinger said. Also Sunday, the ongoing diagnostics determined the flash memory hardware aboard the rover to be healthy.
"There are two other theories that are not as well in competition but cannot be discounted, and they are being worked by anomaly subteams," Theisinger added.
"One is there was some kind of error or hardware issue on the motor control board. That's the circuit board with the electronics that control the motors. That's being examined.
"Also...there was a solar event Wednesday and the timing of that is being looked at with respect to correlation to the onset of our problems. The flash memories are sensitive to high-energy ions and neutrons when they are being read from and written to, and we were certainly engaged in a lot of that activity that day."
Theisinger remains hopeful that Spirit will resume its exploration adventure of Gusev Crater by mid-February.
"I think we've got a patient well on the way to recovery, and I think we have a very good chance now we will have a very good rover when we are done getting this thing back up. Although, once again, it will take some time to make sure that we have completely characterized the problem and that we are able to check out all of the functionality on the vehicle.
"You can't take anything for granted here. So I don't expect to be driving for a couple of weeks, maybe three."
Hagar - 26 Jan 2004 22:34 GMT > > > > I'd be very interested in knowing if the problem was caused by some > > > > environmental probem, or was just a failure due to bad luck. [quoted text clipped - 14 lines] > Do they know if Spirit was in daylight or night when the hardware/flash > damage may have occurred? Daylight when they first noticed it.
~~~~~~~~~~~~~~~~~~~~~ This message was posted via one or more anonymous remailing services. The original sender is unknown. Any address shown in the From header is unverified.
Hagar - 27 Jan 2004 09:49 GMT > > > > I'd be very interested in knowing if the problem was caused by some > > > > environmental probem, or was just a failure due to bad luck. [quoted text clipped - 14 lines] > Do they know if Spirit was in daylight or night when the hardware/flash > damage may have occurred? Daylight when they first noticed it.
~~~~~~~~~~~~~~~~~~~~~ This message was posted via one or more anonymous remailing services. The original sender is unknown. Any address shown in the From header is unverified.
Kent Betts - 27 Jan 2004 18:18 GMT Here is a story from spaceflightnow.com. The software problem was causing the flash memory to fill up. Sounds like a problem we used to have with a C compiler. After using an ordered list, we would do a "mark and release" to make the system memory resources avaiable again. Didn't work in our machine, either. There was also something called a "memory leak" in the Windows 95 browser. After an hour of browsing the machine would slow down or stop because the RAM was full of mem that marked "in use". It appears that the problem app was dealing with log files. ------- http://spaceflightnow.com/mars/mera/040126spirit.html Reconstructing Spirit's hopeful road to recovery BY SPACEFLIGHT NOW Posted: January 26, 2004
NASA's Mars Exploration Rover Spirit appeared to be teetering on the brink of failure last week when ground controllers lost contact with the craft sitting in Gusev Crater, its arm extended to a rock as the scientific adventure was beginning. Now, engineers are cautiously hopeful that Spirit will soon be restored to full working order.
"Spirit is doing better. It is kind of like we have a patient in re-hab here, and we are nursing her back to health," Jennifer Trosper, rover mission manager, said Monday.
It is now believed that the rover's flash memory had become so full of files that the craft couldn't manage all of the information stored aboard. Spirit bogged down because it didn't have enough random access memory, or RAM, to handle the current amount of files in the flash -- including data recorded during its cruise from Earth to Mars and the 18 days of operations on the red planet's surface.
"I think we just found an issue with the number of files that eventually were on the spacecraft at this time in the mission that we were unaware of because of the accumulation that happened over the course of cruise and the 18 sols on the surface," Trosper said.
Flash memory is used in electronics, such as digital cameras, because it retains stored information even when the power is turned off. The rover also has random access memory, which doesn't keep stored data when the rover goes to sleep each night.
Controllers are preparing to delete hundreds of cruise files in hopes of lessening the burden.
"We don't know yet whether Spirit will be perfect again. Our current theory is one in which software would fix the problem," Trosper said. "There are other health checks that we have to do with the flash, the high-gain antenna, the Pancam Mast Assembly and the motor control board to make sure our current theory fully checks out."
Some triggering event, not yet fully pinpointed, caused the rover's computer brain to begin a continuous series of resets until engineers on Earth were able to regain control of the craft late last week.
"You have to keep in mind that the problem we've had actually is associated with our ability to collect and maintain recorded data (on the rover). So the flash memory where we store this data that would tell us what had happened over the past days actually is part of the problem we are seeing. So we don't have a lot of information," Trosper told reporters at the daily spacecraft status briefing Monday.
"Let me go back to Sol 18 and tell you a little bit about what we think happened -- to try and reconstruct it. As we get more data, I guarantee you that some of these things will change, but let me tell you what we think today," Trosper said, launching into a detailed explanation that begins last Wednesday.
"Sol 18 we had some weather problems at the (Earth communications) station, and about 10 minutes early for the morning antenna pass we lost the signal. It wasn't clear whether that was the result of a spacecraft problem or a station problem.
"We've done some tracking of that, it's still not completely clear, but it's entirely possible that was a spacecraft problem at that time. We believe that was possibly a reset on the spacecraft that would've caused our signal to be lost when...the software would reset and come up and power off all of the loads and put itself into a safe state.
"Due to the reset, we have actually confirmed that the morning activities that we were trying to do that morning did not complete. So if you recall, we were moving the IDD (science arm), getting ready to (use the Rock Abrasion Tool). The IDD, the arm, position is actually in the same position it was on Sol 18 before we attempted to do that move.
"Some time the morning, early afternoon of Sol 18 (Wednesday) we encountered the problem. That problem, initially, was most likely a reset. We don't understand exactly where that reset came from but we have some ideas. It caused us to get into this belief that the flash system was corrupted in a way that we got into continuous reset loops.
"Then in the afternoon, we actually sent a command sequence to the vehicle with a little bleep in it to tell us that the sequence got there. We sent that sequence and got the bleep with no problems.
"Twenty minutes after that we expected to see a session from the vehicle on the high-gain antenna communicating with us. We had been on the high-gain antenna since Sol 2. We didn't see that communications session. That, in addition to the 10 minute drop out early in the morning, that was one of the early indications that there was something wrong.
"In the afternoon Odyssey pass we did not see any data from the vehicle. The early Sol 19 (Thursday) morning MGS (Mars Global Surveyor) pass, we only saw two minutes of data from vehicle and it wasn't really data from the vehicle -- it was 'the UHF radio was on and nobody was home' kind of data. And then the morning Odyssey pass we received no data.
"On Sol 20 (Friday) in the morning we attempted to command the rover at the nominal uplink rate where it should be if everything is fine, and we received no data. We have pre-loaded communications windows when the rover should attempt to communicate with us and those windows did not execute on the morning of Sol 20.
"One of the things that the vehicle will do if it encounters a system-level fault is change the rate at it will accept commands, and that is for the vehicle's protection as well as for our knowledge. And so in the afternoon we sent a command at a different rate for the vehicle to send us a beep, and we actually got that beep back. The rate we sent it at was a rate that the software would have autonomously put us in if it had some sort of system-level fault. So we knew at that point that there were about four scenarios that would put us at that rate and we started to go down that path of those four scenarios.
"Then we didn't receive data in the overnight UHF passes that night.
"On Sol 21 (Saturday) we were actually trying to establish the same commandability we had the previous day -- we now knew that there was a system-level fault, we didn't know if it was a power issue, if it was a thermal issue, if it was an X-band communications issue. So we sent, essentially, the same command to get a beep on the morning of Sol 21 and we didn't get the beep.
"Then, as we were getting ready to send the next beep command, the vehicle decided to communicate with us in one of its nominal communications windows at which point we got a little bit of data that had very little information in it. In fact, originally we started to decode it and it was from the year 2053 and we thought 'this is not good!' Eventually we found out the data was corrupted, and we were all cheering at that point because there weren't a lot of scenarios that would put us in 2053 on Mars.
"That signal actually dropped out nine minutes or 10 minutes after we got it. And that was at 10 bits per second, so there was very little data and the data we got was corrupted.
"We sent another command to the spacecraft to give us a 30-minute communications session at 120 bits per second. And that command was received and we got the signal on the ground -- we got one frame of data, which told us that it was sending us data. Then it stopped. And that session then ended about 10 minutes early.
"We tried the same thing again and we modified some of the parameters in the command to try and get a different set of data. That different set of data actually gave us a very limited state of the current state of the vehicle -- some channelized telemetry. It told us how many flight software resets happened over the course of those two nights and that's where the big 77 numbers came from, and we realized we had a reset problem, that certain tasks were failing and it was keeping us from doing the communications that we intended to do.
"As a result of that knowledge, we also realized the vehicle may not have shut down because the reset could be associated with the shutdown of the vehicle. So we attempted to shut the vehicle down, and then we send a beep after shutdown to make sure it has shut down. [The rover would not reply with a beep if it was asleep.]
"It's sort of like feast or famine -- we didn't hear from it for a day-and-a-half and then we shut it down and we send a beep and we get the beep, then we shut it down again and send a beep and we get the beep, and then we shut it down again and send a beep and we get the beep. The vehicle was clearly not able to shut itself down and the reset was causing a problem with the shutdown.
"We knew that the power system was struggling, the battery wasn't charged as much as we expected it to be or wanted it to be. So we deleted our overnight UHF passes in case the vehicle decided to do them -- or attempted to. In the same way the reset cycle had caused those commands not to get in and so we got the first Odyssey UHF pass when we had hoped not to hear from the vehicle because we did want it to be asleep and charge the batteries.
"We asked Odyssey and MGS to turn off their radio beacons so (Spirit) didn't use that energy during the night to transmit because we were getting close to entering our low-power mode. Low-power mode is the mode that will safe the vehicle, take the batteries off-line and sit there, basically, and bask in the sun until the voltage gets high enough for the vehicle to wake up.
"So we woke up the morning of Sol 21 (Saturday) on solar array wake up and saw that we had indeed entered low-power mode and the fault protection had worked exactly as designed. In the low-power mode we don't get our morning communications session until about 11 a.m. because that is when the sun is nice and high, the Earth is nice and high (in the sky) and you can get good data rates and transmit.
"And in that we realized that we had this reset problem. Based on just kind of the hunch of our lead software architect, he believed that the problem was probably associated with the mounting of flash and initialization. There is a hardware command that we can send that bypasses the software where we can actually tell the hardware to not allow us to mount flash on initialization. When we the next day actually sent the command to do that, software initialized normally and was behaving like the software that we had always known. It was a fantastic moment.
"Once we got into the mode where we could command the vehicle to get into a software state that we understood, then we were able to collect data. That is the path that we are on right now.
"Right now, our most likely candidate for the issue has been narrowed down a little bit. It is really an issue with the file system in flash. Essentially, the amount of space required in RAM to manage all of the files we have in flash is apparently more than we initially anticipated.
"We have been collecting data and collecting data thanks to (the science team) and we have lots and lots of files on the spacecraft. That's good -- we intended to have lots and lots of files on the spacecraft. This is a new problem that we encountered based on having many files.
"We are currently in a much more specific debugging activity. Today (Monday), we started to dump out some of flash. We are actually loading a script that we get kind of the task trace on the software and identify exactly where the problem was in the code so we can make sure that our hunch is correct.
"Tomorrow, we are might try to access flash and do a little bit of a health check on it. The next day we might try to delete some files to see if our hunch is correct that it's really due to the number of files that we are trying to manage on the flash file system.
"And in parallel we are trying to work a less likely scenario that something happened with the high-gain antenna and the motor control board when we were doing this engineering checkout of the Mini-TES elevation actuator (Wednesday morning). We are still working that as well to make sure that we can get back on the high-gain antenna in a very cautious way.
"In summary, I would like to say that -- as it has always been -- it's humbling to work with a team of such excellent people. I just want to tell you the folks who are working on the details of this problem are the best of the best in the world that we have. Everyday when I come into work, their innovation, their persistence, their talent and their hard work has almost overwhelmed me and certainly humbles me. But that is what has got us where we are today and that's what is going to get us to having a healthy rover on the surface shortly."
Joe Knapp - 28 Jan 2004 03:44 GMT > Here is a story from spaceflightnow.com. The software problem was causing > the flash memory to fill up. "The Mars Exploration Rovers´ software packages were developed using Tornado 1.0.1 for Mars Exploration Rover (a special release) and debugger with updates spanning VxWorks releases from 5.1 through 5.5. The flexibility and portability of Wind River´s software has allowed NASA JPL engineers to leverage work for future projects and enables concurrent development and debugging of the application, which is critical in the unlikely event the system needs to be adjusted from millions of miles away."
A Usenet Google search reveals a number of posts over the years about flash file systems getting corrupted under VxWorks when long file names are used.
Judging by the file names of the "raw" images on the marsrovers.nasa.jpl.gov web site, the 27.3 filename standard under the "Planetary Data System" has been recently adopted over the older 8.3 standard. ("PDS is currently investigating adapting a 27.3 convention for file names. A decision on this will be made at the March, 1999 PDS Management Council.) http://starlight.jpl.nasa.gov/tools/faq.html
Here's a typical image file name on the marsrovers.nasa.jpl.gov web site:
1P128365104EFF0100P2205L4M1.JPG
Which is 27.3.
There are a number of pitfalls identified in Usenet posts in using the VxWorks long filenames, which filesystem is a proprietary format not compatible with Windows-formatted filesystems. Reported symptoms are that the flash filesystem will get periodically corrupted if the flash was formatted and files loaded from a Windows development system before being installed on the target VxWorks system.
A Usenet poster in comp.os.vxworks writes:
"We are using Tornado 1.0.1... We are using NFS and when performing create, read and write operations from the NFS server on the NFS client the files on the NFS mounted flashdisk becomes corrupt. We get an error message, "error reading entry (errno=0x300002)", which indicates that the folder isn't a valid entry. In addition file sizes are not represented correctly.
"When searching for errors by using Scandisk on a Win98 PC we also get the error message "The second media byte is missing or in- correct" which indicates that the disk is not correct formatted. Note that this error doesn't exist on a 'never used' flashcard.
"By using Scandisk we can fix the errors but they will return after a while again.
"We are using the Win98 PC to add folders and files to the flash- card before using it on our target.
"Does anyone know how to solve this problem? <end quote>
The reply:
There are two known problems you are dealing with:
"1. The current dosFs in VxWorks does not handle Win9x/NT style long file names, and that is the most probable reason for the "invalid entry" error you see. "2. The NFS server is known to use the second media byte for its own purposes, the Sandisk complaint being the only negative impact thereof. <end quote>
Another piece of advice to a similar problem:
"TrueFFS is very nosy about the DosFs structure stored on its volumes, and the only safe way to use it is to format a TrueFFS volume only with the format function supplied with TrueFFS. Never use dosFsMkfs() to format a TrueFFS volume.
"DosFs long names are non-standard, and can be created only with dosFsMkfs(), and are known to cause file data to dissapear when used with TrueFFS, because TrueFFS's idea of the meaning of some fields in the boot block and DosFs's idea vary. This has been documented in the SPR database.
"In principle, TFFS provides a block device so it could interoperate with any file system, but in practice, TFFS is aware of the MS-DOS file system structure in order to implement its wear leveling, hence it can result in data corruption if the MS-DOS file system format is even slightly changed, for example if the VxWorks proprietary 40-chars long file names are used.
<end quote>
Just a thought!
John Doe - 28 Jan 2004 04:21 GMT > A Usenet Google search reveals a number of posts over the years about flash > file systems getting corrupted under VxWorks when long file names are used. It is really possible that the folks writing the software for NASA would have been unaware of this ?
Also, wouldn't they have rehersed/simulated the whole mission to ensure the software worked fine ? Or would were tests/simulations done independantly by separate groups responsible for different functions of the robots with none of the independant tests running on a system that also has other group's data on it ?
jimmydevice - 28 Jan 2004 04:48 GMT >>Here is a story from spaceflightnow.com. The software problem was causing >>the flash memory to fill up. <snip>
> Just a thought! It could be even simpler than the previous FFS bug. In a press conf, Trosper mentioned that the fault logging was stored in flash, I can see the scenario, where FFS faults and the exception handler tries to post a message to the flash, which may generate another fault. Eventually, the process stack recursively overflows, due to the unserviced fault log requests, the system resets, due to the watchdog not being serviced, and we start all over. Jim Davis.
|
|
|