我尝试在VS code中使用调试器来运行它,但结果并不如预期。
首先,像往常一样使用VS code和远程同步登录到集群,这个操作正常工作。然后,我使用以下命令获取交互作业:
condor_submit -i request_cpus=4 request_gpus=1
然后成功分配了一个节点/GPU。
一旦我拥有它,我尝试运行调试器,但某种方式它会将我从远程会话中注销(并且从打印语句中看起来好像它会转到头节点)。这不是我想要的。我想在分配给我的节点/GPU的交互式会话中运行作业。为什么VS Code在错误的位置运行它?我如何在正确的位置运行它?
一些来自集成终端的输出:
source /home/miranda9/miniconda3/envs/automl-meta-learning/bin/activate
/home/miranda9/miniconda3/envs/automl-meta-learning/bin/python /home/miranda9/.vscode-server/extensions/ms-python.python-2020.2.60897-dev/pythonFiles/lib/python/new_ptvsd/wheels/ptvsd/launcher /home/miranda9/automl-meta-learning/automl/automl/meta_optimizers/differentiable_SGD.py
conda activate base
(automl-meta-learning) miranda9~/automl-meta-learning $ source /home/miranda9/miniconda3/envs/automl-meta-learning/bin/activate
(automl-meta-learning) miranda9~/automl-meta-learning $ /home/miranda9/miniconda3/envs/automl-meta-learning/bin/python /home/miranda9/.vscode-server/extensions/ms-python.python-2020.2.60897-dev/pythonFiles/lib/python/new_ptvsd/wheels/ptvsd/launcher /home/miranda9/automl-meta-learning/automl/automl/meta_optimizers/differentiable_SGD.py
--> main in differentiable SGD
hello world torch_utils!
vision-sched.cs.illinois.edu
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
-> initialization of DiMO done!
---> i = 0, iteration/it 1 about to start
lp_norms(mdl) = 18.43514633178711
lp_norms(meta_optimized mdl) = 18.43514633178711
[e=0,it=1], train_loss: 2.304989814758301, train error: -1, test loss: -1, test error: -1
---> i = 1, iteration/it 2 about to start
lp_norms(mdl) = 18.470401763916016
lp_norms(meta_optimized mdl) = 18.470401763916016
[e=0,it=2], train_loss: 2.3068909645080566, train error: -1, test loss: -1, test error: -1
---> i = 2, iteration/it 3 about to start
lp_norms(mdl) = 18.548133850097656
lp_norms(meta_optimized mdl) = 18.548133850097656
[e=0,it=3], train_loss: 2.3019633293151855, train error: -1, test loss: -1, test error: -1
---> i = 0, iteration/it 1 about to start
lp_norms(mdl) = 18.65604019165039
lp_norms(meta_optimized mdl) = 18.65604019165039
[e=1,it=1], train_loss: 2.308889150619507, train error: -1, test loss: -1, test error: -1
---> i = 1, iteration/it 2 about to start
lp_norms(mdl) = 18.441967010498047
lp_norms(meta_optimized mdl) = 18.441967010498047
[e=1,it=2], train_loss: 2.300947666168213, train error: -1, test loss: -1, test error: -1
---> i = 2, iteration/it 3 about to start
lp_norms(mdl) = 18.545459747314453
lp_norms(meta_optimized mdl) = 18.545459747314453
[e=1,it=3], train_loss: 2.30662202835083, train error: -1, test loss: -1, test error: -1
-> DiMO done training!
--> Done with Main
(automl-meta-learning) miranda9~/automl-meta-learning $ conda activate base
(automl-meta-learning) miranda9~/automl-meta-learning $ hostname vision-sched.cs.illinois.edu
没有调试模式无法运行
问题比我想象的更严重。我不能在交互式会话中运行调试器,但是我甚至不能“不带调试运行”,因为它会自动切换到Python Debug Console。这意味着我必须手动运行python main.py
,但这将导致我无法使用变量窗格...这是一个很大的损失!
我正在将终端切换到conoder_ssh_to_job
,然后单击按钮Run Without Debugging
(或^F5
或Control + fn + f5
),尽管我确保在我的集成窗口底部处于交互式会话状态,但它仍会自动转到Python Debugger窗口/窗格,这与我从集群请求的交互式会话没有连接...
相关:
ssh -J ...
部分正常工作。最好的方法是在您的.ssh/config
文件中配置连接。创建 SSH 密钥对并将它们安装在您的集群中。还要检查您的集群文档,看看是否有有关 SSH 连接到计算节点的信息。 - damienfrancois