This is a discussion on loadlever: job stuck in "(alloc)" state, won't run within the AIX Operating System forums, part of the Unix Operating Systems category; --> I just set up the following class: com_rg4: type = class # class for medium jobs priority = 60 ...
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I just set up the following class: com_rg4: type = class # class for medium jobs priority = 60 # ClassSysprio # cpu_limit= 08:00:00 # 2 hour run time limit wall_clock_limit = 00:10:00 # Needed for BACKFILL scheduler max_processors = 4 # default max processors for class (no limit) max_total_tasks = 4 In Loadl_config.local, I have: CLASS = small(8) medium(5) large(2) inter_class(8) all_spec(0) com_rg8(0) com_rg32(0) com_sb8(0) com_sb32(0) com_rg4(4) com_sb4(4) I submitted some test jobs with the following stanza: #@ job_name = tstclm01 #@ class = com_rg4 #@ node = 1 #@ tasks_per_node = 4 #@ output = $(job_name).txt #@ error = $(job_name).txt #@ job_type = parallel #@ network.MPI = csss,shared,us #@ node_usage = not_shared #@ account_no = 36271012 ## @ wall_clock_limit = 3800 #@ queue However, llq reports: bash-2.05b$ llq Id Owner Submitted ST PRI Class Running On ------------------------ ---------- ----------- -- --- ------------ ----------- esmf04m.498.0 zender 6/13 18:53 R 50 com_rg32 esmf08m esmf04m.485.0 strombrg 6/10 16:55 I 50 all_sp8 esmf04m.529.0 zender 6/14 23:33 I 50 com_rg32 esmf04m.540.0 testacct 6/15 11:23 I 50 com_rg4 (alloc) esmf04m.541.0 testacct 6/15 11:23 I 50 com_rg4 (alloc) esmf04m.542.0 testacct 6/15 11:23 I 50 com_rg4 (alloc) esmf04m.543.0 testacct 6/15 11:23 I 50 com_rg4 (alloc) 7 job step(s) in queue, 6 waiting, 0 pending, 1 running, 0 held, 0 preempted bash-2.05b$ ....and llq -s 540 reports: =============== Job Step esmf04m.540.0 =============== Job Step Id: esmf04m.540.0 Job Name: tstclm01 Step Name: 0 Structure Version: 10 Owner: testacct Queue Date: Tue Jun 15 11:23:30 PDT 2004 Status: Idle Execution Factor: 1 Dispatch Time: Completion Date: Completion Code: User Priority: 50 user_sysprio: 0 class_sysprio: 0 group_sysprio: 0 System Priority: -412109 q_sysprio: -412109 Notifications: Complete Virtual Image Size: 15 kb Large Page: N Checkpointable: no Ckpt Start Time: Good Ckpt Time/Date: Ckpt Elapse Time: 0 seconds Fail Ckpt Time/Date: Ckpt Accum Time: 0 seconds Checkpoint File: Restart From Ckpt: no Restart Same Nodes: no Restart: yes Hold Job Until: Env: In: /dev/null Out: tstclm01.txt Err: tstclm01.txt Initial Working Dir: /u/strombrg/clm Dependency: Resources: Step Type: General Parallel Node Usage: not_shared Submitting Host: esmf04m Notify User: testacct@esmf04m Shell: /usr/local/bin/bash LoadLeveler Group: No_Group Class: com_rg4 Ckpt Hard Limit: undefined Ckpt Soft Limit: undefined Cpu Hard Limit: undefined Cpu Soft Limit: undefined Data Hard Limit: undefined Data Soft Limit: undefined Core Hard Limit: undefined Core Soft Limit: undefined File Hard Limit: undefined File Soft Limit: undefined Stack Hard Limit: undefined Stack Soft Limit: undefined Rss Hard Limit: undefined Rss Soft Limit: undefined Step Cpu Hard Limit: undefined Step Cpu Soft Limit: undefined Wall Clk Hard Limit: 00:10:00 (600 seconds) Wall Clk Soft Limit: undefined Comment: Account: 36271012 Unix Group: franklin NQS Submit Queue: NQS Query Queues: Negotiator Messages: Adapter Requirement: (csss,MPI,shared,US) Step Cpus: 0 Step Virtual Memory: 0.000 mb Step Real Memory: 0.000 mb Step Adapter Memory: 0 bytes -------------------------------------------------------------------------------- Node ---- Name : Requirements : (Arch == "R6000") && (OpSys == "AIX51") Preferences : Node minimum : 1 Node maximum : 1 Node actual : 0 Allocated Hosts : Master Task ----------- Executable : /u/strombrg/clm/clm.sh Exec Args : Num Task Inst: Task Instance: Task ---- Num Task Inst: Task Instance: ==================== EVALUATIONS FOR JOB STEP esmf04m.540.0 ==================== SUMMARY This LoadLeveler cluster has sufficient resources to run this job step. Dynamic constraints and other scheduling requirements may prevent the job step from running at the present time. ANALYSIS Basic Requirements : Class : com_rg4 Machine : (Arch == "R6000") && (OpSys == "AIX51") Network/Adapter : (csss,MPI,shared,US) Consumable Resource : Requirements of Node Type 0 : Minimum Instance(s) : 1 Number of Initiator(s)/Task(s) : 4 Status of machines in the LoadLeveler cluster: The following machine(s) can be assigned to Node Type 0. esmf04m The following machines are unable to meet the Basic Requirements. esmf08m : class = com_rg4 is not supported by this machine. esmf07m : class = com_rg4 is not supported by this machine. esmf06m : class = com_rg4 is not supported by this machine. esmf05m : class = com_rg4 is not supported by this machine. esmf03m : class = com_rg4 is not supported by this machine. esmf02m : class = com_rg4 is not supported by this machine. esmf01m : class = com_rg4 is not supported by this machine. bash-2.05b$ What do I need to do to get these jobs out of (alloc) state? Google web and google groups turned up nothing. Thanks in advance. |